Fine-Tuning Methods¶

A pretrained protein structure model (AlphaFold2 / OpenFold) has learned general-purpose representations from hundreds of thousands of proteins. But if you care about a specific protein family (kinases, antibodies, ion channels) or a specific property (binding affinity, thermostability), fine-tuning adapts those representations to your domain.

This page covers the full spectrum of fine-tuning approaches, from updating every parameter to injecting tiny low-rank adapters, with the mathematical foundations you need to make informed decisions.

Why Fine-Tune?¶

graph LR
    PRETRAINED["Pretrained Model<br/><small>General protein knowledge<br/>~93M parameters</small>"] --> |"Fine-tune on<br/>your domain"| ADAPTED["Adapted Model<br/><small>Domain-specific<br/>representations</small>"]
    ADAPTED --> HEAD["+ Prediction Head<br/><small>MLP for your property</small>"]
    HEAD --> PREDICTION["ΔG, Tm, IC50, ...<br/><small>Your prediction</small>"]

    style PRETRAINED fill:#3b82f6,stroke:#2563eb,color:#ffffff
    style ADAPTED fill:#7c3aed,stroke:#6d28d9,color:#ffffff
    style HEAD fill:#d97706,stroke:#b45309,color:#ffffff
    style PREDICTION fill:#16a34a,stroke:#15803d,color:#ffffff

The pretrained trunk encodes rich per-residue and pairwise features. But these features were optimized for structure prediction, not for predicting binding affinity or classifying enzyme function. Fine-tuning shifts the representations so they become more informative for your downstream task.

Analogy

Imagine a photographer who has mastered composition, lighting, and color theory (the pretrained model). You want them to specialize in astrophotography (your domain). You do not teach them photography from scratch --- you build on their existing skills and add domain-specific knowledge about long exposures, star trackers, and light pollution. That is fine-tuning.

1. Full Fine-Tuning¶

How It Works¶

All parameters in the model are unfrozen and updated by gradient descent. This gives the optimizer maximum freedom to adapt the model.

from molfun.training import FullFinetune

strategy = FullFinetune(
    lr=1e-5,               # Small LR to preserve pretrained knowledge
    weight_decay=0.01,
    warmup_steps=500,
    lr_decay_factor=0.95,  # Layer-wise LR decay
)

Layer-wise Learning Rate Decay¶

A critical technique for full fine-tuning: earlier layers (closer to the input) receive a smaller learning rate than later layers. This preserves the low-level features learned during pretraining while allowing the high-level features to adapt more aggressively.

\[ \text{lr}_l = \text{lr}_{\text{base}} \times \lambda^{L - l} \]

where \(l\) is the layer index, \(L\) is the total number of layers, and \(\lambda \in (0, 1)\) is the decay factor (typically 0.9--0.95).

Layer 48 (top):     lr = 1e-5
Layer 47:           lr = 1e-5 × 0.95 = 9.5e-6
Layer 46:           lr = 1e-5 × 0.95² = 9.0e-6
...
Layer 1 (bottom):   lr = 1e-5 × 0.95⁴⁷ = 8.8e-7

Catastrophic Forgetting¶

The main risk of full fine-tuning is catastrophic forgetting: the model overwrites its general-purpose protein knowledge with domain-specific patterns, losing its ability to generalize.

graph LR
    A["Pretrained<br/>General knowledge"] --> |"Full fine-tune<br/>(high LR, many epochs)"| B["Overfitted<br/>Domain-only knowledge"]
    A --> |"Full fine-tune<br/>(careful LR, early stopping)"| C["Balanced<br/>General + domain"]

    style A fill:#3b82f6,stroke:#2563eb,color:#ffffff
    style B fill:#dc2626,stroke:#b91c1c,color:#ffffff
    style C fill:#16a34a,stroke:#15803d,color:#ffffff

Mitigations:

Technique	How it helps
Low learning rate (1e-5 to 5e-6)	Limits the magnitude of weight updates
Layer-wise LR decay	Protects early layers more
Warmup (500--1000 steps)	Prevents large updates before the optimizer stabilizes
EMA (Exponential Moving Average)	Maintains a smoothed copy of weights
Early stopping	Stops training before overfitting
Weight decay (0.01--0.1)	Regularizes by penalizing large weights

When to Use Full Fine-Tuning¶

Large dataset (> 10,000 proteins)
Significant distribution shift from pretraining data
Compute budget is not a concern
You need maximum accuracy

2. Partial Fine-Tuning (Freezing)¶

How It Works¶

Freeze the earlier layers of the model and only update the last \(N\) blocks. This drastically reduces the number of trainable parameters and the risk of catastrophic forgetting.

from molfun.training import PartialFinetune

strategy = PartialFinetune(
    n_unfrozen_blocks=4,   # Only update last 4 of 48 Evoformer blocks
    lr=5e-5,
    warmup_steps=200,
)

Why It Works¶

The early Evoformer blocks learn low-level features (local amino acid context, secondary structure patterns) that are largely universal across proteins. The later blocks learn high-level features (global contacts, domain-specific patterns) that benefit most from adaptation.

Blocks 1-44:  FROZEN   (low-level, universal features)
Blocks 45-48: TRAINABLE (high-level, task-specific features)
Head:         TRAINABLE

Trainable Parameters¶

Configuration	Trainable params	% of total
Full fine-tune	~93M	100%
Last 8 blocks	~15M	16%
Last 4 blocks	~7.5M	8%
Last 1 block	~1.9M	2%
Head only	~0.1--1M	< 1%

When to Use Partial Fine-Tuning¶

Medium dataset (1,000--10,000 proteins)
Moderate distribution shift
Limited GPU memory
Good balance of quality and efficiency

3. Head-Only Fine-Tuning¶

The simplest approach: freeze the entire trunk and only train a new prediction head (MLP, linear layer) on top of the frozen representations.

from molfun.training import HeadOnlyFinetune

strategy = HeadOnlyFinetune(
    lr=1e-3,           # Higher LR since only head is trained
    weight_decay=0.01,
)

The trunk acts as a fixed feature extractor. This is fast and safe (no risk of forgetting), but the representations may not be optimal for your task.

When to Use¶

Small dataset (< 500 proteins)
Quick experiments
The pretrained representations are already close to what you need

4. LoRA (Low-Rank Adaptation)¶

The Core Idea¶

Instead of updating the full weight matrices, LoRA freezes all pretrained weights and injects small, trainable low-rank matrices into the attention layers. This achieves adaptation with a tiny fraction of the parameters.

For a pretrained weight matrix \(W_0 \in \mathbb{R}^{d \times d}\), LoRA adds a low-rank update:

\[ W = W_0 + \Delta W = W_0 + B \cdot A \]

where:

\(A \in \mathbb{R}^{r \times d}\) (down-projection)
\(B \in \mathbb{R}^{d \times r}\) (up-projection)
\(r \ll d\) is the rank (typically 4--16)

graph LR
    X["Input x<br/>(d-dimensional)"] --> W0["W₀ · x<br/><small>Frozen pretrained<br/>weights (d×d)</small>"]
    X --> A["A · x<br/><small>Down-project<br/>(d → r)</small>"]
    A --> B["B · (A·x)<br/><small>Up-project<br/>(r → d)</small>"]
    W0 --> SUM(("+"))
    B --> |"× α/r"| SUM
    SUM --> OUT["Output<br/>(d-dimensional)"]

    style X fill:#3b82f6,stroke:#2563eb,color:#ffffff
    style W0 fill:#475569,stroke:#64748b,color:#e2e8f0
    style A fill:#d97706,stroke:#b45309,color:#ffffff
    style B fill:#d97706,stroke:#b45309,color:#ffffff
    style SUM fill:#16a34a,stroke:#15803d,color:#ffffff
    style OUT fill:#16a34a,stroke:#15803d,color:#ffffff

Initialization¶

The initialization is asymmetric and critical for stable training:

Matrix A: Initialized from a Gaussian distribution \(\mathcal{N}(0, \sigma^2)\)
Matrix B: Initialized to zeros

This means \(\Delta W = B \cdot A = 0\) at the start of training --- the model begins exactly where the pretrained model left off. Training then gradually learns a task-specific perturbation.

Down-Projection and Up-Projection¶

The two matrices serve complementary roles:

Matrix	Shape	Role	Intuition
A (down-projection)	\(r \times d\)	Compress the input to a low-dimensional bottleneck	Find the \(r\) most important directions for the task
B (up-projection)	\(d \times r\)	Expand back to the original dimension	Map the task-specific signal back to the model's space

The bottleneck rank \(r\) controls the expressiveness of the adaptation. A rank of 8 means the model can only modify the weight matrix along 8 directions --- but for domain adaptation, this is often sufficient because the change from general proteins to a specific family is low-rank in nature.

Scaling Factor: Alpha and Rank¶

The LoRA output is scaled by \(\alpha / r\) before being added to the pretrained output:

\[ h = W_0 x + \frac{\alpha}{r} B A x \]

Parameter	Typical values	Effect
Rank (\(r\))	4, 8, 16, 32	Higher = more expressive but more parameters
Alpha (\(\alpha\))	\(r\), \(2r\), 16, 32	Higher = stronger adaptation signal
Effective scaling (\(\alpha / r\))	1.0, 2.0	The actual multiplier on the LoRA output

Best practices for rank and alpha

Start with rank 8, alpha 16 (effective scaling = 2.0). This is a robust default that works well across many tasks.
Increase rank if you see underfitting (training loss plateaus high). Try 16 or 32.
Decrease rank if you see overfitting on small datasets. Try 4.
Alpha = 2 × rank is a common heuristic. The scaling factor \(\alpha / r = 2\) provides a good balance between adaptation strength and stability.
Apply LoRA to all attention projections (Q, K, V, and output). Research shows this consistently outperforms applying it to Q and V only.

Parameter Efficiency¶

For an attention layer with \(d = 256\) and 4 projections (Q, K, V, O):

Method	Trainable params per layer	Total (48 layers)
Full fine-tune	\(4 \times 256 \times 256 = 262\text{K}\)	12.6M
LoRA (r=8)	\(4 \times 2 \times 256 \times 8 = 16\text{K}\)	786K
LoRA (r=4)	\(4 \times 2 \times 256 \times 4 = 8\text{K}\)	393K

LoRA reduces trainable parameters by 16--32x while retaining 90--95% of full fine-tuning quality.

Merging¶

After training, the LoRA matrices can be merged back into the pretrained weights for zero-overhead inference:

\[ W_{\text{merged}} = W_0 + \frac{\alpha}{r} B A \]

model.merge()    # Fold LoRA into base weights
model.save("merged_model/")  # No adapter overhead at inference

from molfun.training import LoRAFinetune

strategy = LoRAFinetune(
    rank=8,
    alpha=16.0,
    lr_lora=1e-4,
    lr_head=1e-3,
    warmup_steps=100,
    ema_decay=0.999,
)

5. QLoRA (Quantized LoRA)¶

QLoRA combines LoRA with 4-bit quantization of the base model weights, enabling fine-tuning of large models on consumer GPUs.

How It Works¶

Quantize the pretrained weights to 4-bit NormalFloat (NF4) format
Freeze the quantized weights
Add LoRA adapters in full precision (bfloat16) on top
During the forward pass, dequantize on the fly and add the LoRA contribution

\[ h = \text{dequant}(W_0^{\text{4bit}}) \cdot x + \frac{\alpha}{r} B A x \]

Key Innovations¶

Innovation	Description
4-bit NormalFloat (NF4)	An information-theoretically optimal data type for normally distributed weights. Quantizes values into quantiles, minimizing information loss.
Double quantization	Quantizes the quantization constants themselves, saving ~0.5 bits per parameter.
Paged optimizers	Uses CPU memory as overflow when GPU memory is exhausted, preventing OOM crashes.

Memory Comparison¶

Method	VRAM for 93M param model
Full fine-tune (fp32)	~1.4 GB
Full fine-tune (fp16)	~0.7 GB
LoRA (fp16)	~0.7 GB (base) + ~3 MB (adapters)
QLoRA (NF4)	~0.18 GB (base) + ~3 MB (adapters)

When to use QLoRA

QLoRA shines for very large models where even storing the base weights in fp16 is challenging. For protein models (~93M params), the memory savings are less dramatic than for LLMs (~7B+ params), but QLoRA can still be useful when running on limited hardware or when training multiple adapters simultaneously.

6. IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations)¶

IA3 is an even more parameter-efficient method than LoRA. Instead of adding low-rank matrices, it learns element-wise scaling vectors that modulate the model's activations.

How It Works¶

For keys, values, and feed-forward layers, IA3 introduces a learned vector \(\ell \in \mathbb{R}^d\) that scales the output element-wise:

\[ h_{\text{keys}} = \ell_k \odot W_k x \qquad h_{\text{values}} = \ell_v \odot W_v x \qquad h_{\text{ffn}} = \ell_{ff} \odot f(W_{ff} x) \]

where \(\odot\) is element-wise multiplication and \(\ell\) is initialized to ones (so the model starts identical to the pretrained version).

Comparison with LoRA¶

Aspect	LoRA (r=8)	IA3
Trainable params per layer	~16K	~768 (3 vectors of 256)
Total trainable params (48 layers)	~786K	~37K
Expressiveness	Can change the direction of transformations	Can only scale existing directions
Best for	Moderate distribution shift	Minimal distribution shift, extremely small datasets

from molfun.training.peft import MolfunPEFT

peft = MolfunPEFT.ia3(
    target_modules=["linear_q", "linear_v"],
    feedforward_modules=["ffn"],
)
adapted_model = peft.apply(model)

7. BitFit (Bias-Term Fine-Tuning)¶

BitFit fine-tunes only the bias terms in the model, leaving all weight matrices frozen.

How It Works¶

Every linear layer \(y = Wx + b\) has a bias vector \(b\). BitFit freezes \(W\) and only updates \(b\). Since bias vectors are much smaller than weight matrices, this is extremely parameter-efficient.

Parameter Count¶

For a linear layer with input \(d_{\text{in}}\) and output \(d_{\text{out}}\):

Weight matrix: \(d_{\text{in}} \times d_{\text{out}}\) parameters (frozen)
Bias vector: \(d_{\text{out}}\) parameters (trainable)

This means BitFit trains roughly \(1/d_{\text{in}}\) of the parameters compared to full fine-tuning --- typically < 0.1% of the total.

When It Works¶

BitFit is surprisingly effective for small distribution shifts, especially in NLP tasks. For protein models, it can work well when:

Your domain is close to the pretraining distribution
You have a very small dataset (< 100 samples)
You want the absolute minimum risk of overfitting

However, for larger distribution shifts (e.g., from general proteins to a specific enzyme family), LoRA typically outperforms BitFit.

8. Practical Training Techniques¶

These techniques apply across all fine-tuning strategies and are essential for achieving good results.

8.1 Warmup¶

Gradually increase the learning rate from 0 to the target value over the first \(N\) steps. This prevents large, destructive updates when the optimizer is still cold.

\[ \text{lr}(t) = \text{lr}_{\text{target}} \cdot \min\!\left(1, \frac{t}{T_{\text{warmup}}}\right) \]

Warmup rule of thumb

Use 5--10% of total training steps for warmup. For 10 epochs of 100 batches = 1000 steps, warmup for 50--100 steps.

8.2 EMA (Exponential Moving Average)¶

Maintain a shadow copy of the model weights that is updated as a running average:

\[ \theta_{\text{EMA}}^{(t)} = \alpha \cdot \theta_{\text{EMA}}^{(t-1)} + (1 - \alpha) \cdot \theta^{(t)} \]

where \(\alpha\) is the decay rate (typically 0.999 or 0.9999). The EMA weights are used for evaluation and often produce smoother, more generalizable predictions than the raw training weights.

8.3 Cosine Learning Rate Schedule¶

After warmup, the learning rate follows a cosine decay:

\[ \text{lr}(t) = \text{lr}_{\min} + \frac{1}{2}(\text{lr}_{\text{max}} - \text{lr}_{\min})\left(1 + \cos\!\left(\frac{\pi t}{T}\right)\right) \]

This provides a smooth annealing that empirically works better than step decay for fine-tuning.

8.4 Gradient Clipping¶

Clip gradient norms to prevent training instability:

\[ \hat{g} = g \cdot \min\!\left(1, \frac{c}{\|g\|}\right) \]

where \(c\) is the clipping threshold (typically 1.0). This is particularly important for full fine-tuning where large gradients can destroy pretrained features.

8.5 Mixed Precision (AMP)¶

Train with automatic mixed precision (float16 forward pass, float32 for loss and optimizer state) to halve memory usage and increase throughput. All Molfun strategies support this via amp=True.

9. Decision Guide¶

graph TD
    START["How much data<br/>do you have?"] --> |"< 100 samples"| HEADONLY["Head-Only or BitFit<br/><small>Safest, no forgetting</small>"]
    START --> |"100 - 1,000"| LORA["LoRA (rank 4-8)<br/><small>Best efficiency/quality</small>"]
    START --> |"1,000 - 10,000"| PARTIAL["LoRA (rank 8-16) or<br/>Partial (last 4 blocks)<br/><small>More expressiveness</small>"]
    START --> |"> 10,000"| FULL["Full Fine-Tuning<br/><small>Maximum accuracy</small>"]

    HEADONLY --> EVAL["Evaluate on<br/>validation set"]
    LORA --> EVAL
    PARTIAL --> EVAL
    FULL --> EVAL

    EVAL --> |"Underfitting"| UP["↑ Rank, ↑ unfrozen blocks,<br/>or switch to full"]
    EVAL --> |"Overfitting"| DOWN["↓ Rank, ↑ weight decay,<br/>↑ dropout, early stopping"]
    EVAL --> |"Good fit"| DONE["Deploy!"]

    style START fill:#3b82f6,stroke:#2563eb,color:#ffffff
    style HEADONLY fill:#16a34a,stroke:#15803d,color:#ffffff
    style LORA fill:#d97706,stroke:#b45309,color:#ffffff
    style PARTIAL fill:#7c3aed,stroke:#6d28d9,color:#ffffff
    style FULL fill:#dc2626,stroke:#b91c1c,color:#ffffff
    style EVAL fill:#0891b2,stroke:#0e7490,color:#ffffff
    style UP fill:#d97706,stroke:#b45309,color:#ffffff
    style DOWN fill:#16a34a,stroke:#15803d,color:#ffffff
    style DONE fill:#16a34a,stroke:#15803d,color:#ffffff

Strategy Comparison Table¶

Strategy	Trainable Params	Memory	Risk of Forgetting	Best Dataset Size	Training Speed
Head-Only	< 1%	Lowest	None	< 500	Fastest
BitFit	< 0.1%	Lowest	Very low	< 100	Fastest
IA3	~0.04%	Low	Very low	< 500	Fast
LoRA (r=8)	~1%	Low	Low	100--10K	Fast
QLoRA (r=8)	~1% (4-bit base)	Lowest	Low	100--10K	Moderate
Partial (4 blocks)	~8%	Medium	Moderate	1K--10K	Moderate
Full	100%	Highest	High	> 10K	Slowest

10. The Big Picture: Domain Adaptation + Custom Heads¶

The real power of fine-tuning in Molfun comes from combining domain adaptation with custom prediction heads:

graph TD
    GENERAL["Pretrained OpenFold<br/><small>General protein knowledge</small>"] --> |"LoRA fine-tune on<br/>kinase structures"| KINASE["Kinase-Adapted Model<br/><small>Kinase-specific representations</small>"]

    KINASE --> HEAD_AFF["+ Affinity Head<br/><small>Predict Kd for kinase inhibitors</small>"]
    KINASE --> HEAD_STAB["+ Stability Head<br/><small>Predict ΔΔG for mutations</small>"]
    KINASE --> HEAD_FUNC["+ Function Head<br/><small>Classify kinase subfamily</small>"]

    HEAD_AFF --> PRED_AFF["IC₅₀ predictions"]
    HEAD_STAB --> PRED_STAB["Stability scores"]
    HEAD_FUNC --> PRED_FUNC["Subfamily labels"]

    style GENERAL fill:#3b82f6,stroke:#2563eb,color:#ffffff
    style KINASE fill:#7c3aed,stroke:#6d28d9,color:#ffffff
    style HEAD_AFF fill:#d97706,stroke:#b45309,color:#ffffff
    style HEAD_STAB fill:#d97706,stroke:#b45309,color:#ffffff
    style HEAD_FUNC fill:#d97706,stroke:#b45309,color:#ffffff
    style PRED_AFF fill:#16a34a,stroke:#15803d,color:#ffffff
    style PRED_STAB fill:#16a34a,stroke:#15803d,color:#ffffff
    style PRED_FUNC fill:#16a34a,stroke:#15803d,color:#ffffff

Start with a pretrained model that understands protein structure
Fine-tune on your domain (kinases, antibodies, GPCRs) using LoRA or partial fine-tuning, so the representations become domain-aware
Add a head for your specific property (affinity, stability, function)
Train the head on labeled data --- the domain-adapted representations make this dramatically more effective than training from scratch

This is exactly the workflow that Boltz-2 follows for binding affinity, and it is the core pattern that Molfun makes accessible through a simple, unified API.

References¶

Hu, E. J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized Language Models. NeurIPS 2023.
Liu, H., et al. (2022). Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. NeurIPS 2022. (IA3)
Zaken, E. B., et al. (2022). BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. ACL 2022.
Wohlwend, J., et al. (2025). Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction. bioRxiv.