Binding Affinity Prediction¶

Intermediate ~30 min

Predict protein-ligand binding affinity (Kd/Ki) using data from PDBbind and LoRA fine-tuning. This tutorial also compares LoRA against full fine-tuning to show the efficiency gains from parameter-efficient approaches.

What You Will Learn¶

Fetch binding affinity data with AffinityFetcher
Build an AffinityDataset with protein sequences and ligand SMILES
Fine-tune with LoRAFinetune (rank=8)
Evaluate with Pearson correlation
Compare LoRA vs full fine-tuning

Prerequisites¶

Molfun installed with data extras: pip install molfun[data]
GPU recommended (LoRA fine-tuning is faster on CUDA)

Step 1: Fetch Binding Affinity Data¶

The AffinityFetcher downloads and parses PDBbind data, returning protein sequences paired with experimentally measured binding affinities.

from molfun.data import AffinityFetcher, AffinityDataset, DataSplitter

# Fetch PDBbind refined set
fetcher = AffinityFetcher()
records = fetcher.fetch(
    subset="refined",       # "refined" or "general"
    max_samples=500,        # Limit for this tutorial
)

print(f"Fetched {len(records)} protein-ligand pairs")
print(f"Example: {records[0].pdb_id}, pKd = {records[0].affinity:.2f}")

What is in a record?

Each record contains:

pdb_id --- PDB identifier
sequence --- Protein amino acid sequence
ligand_smiles --- Ligand in SMILES notation
affinity --- Binding affinity as pKd or pKi (log-transformed)

Step 2: Create the Dataset¶

from torch.utils.data import DataLoader

dataset = AffinityDataset(
    records=records,
    max_seq_length=512,
)

# 80/10/10 train/val/test split
splitter = DataSplitter(val_size=0.1, test_size=0.1, random_state=42)
train_ds, val_ds, test_ds = splitter.split(dataset, n_splits=3)

train_loader = DataLoader(train_ds, batch_size=8, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=8)
test_loader = DataLoader(test_ds, batch_size=8)

print(f"Train: {len(train_ds)} | Val: {len(val_ds)} | Test: {len(test_ds)}")

Step 3: Fine-Tune with LoRA¶

LoRA inserts low-rank adapters into the attention layers of the pretrained trunk. This updates only a small fraction of the parameters while achieving results close to full fine-tuning.

from molfun import MolfunStructureModel
from molfun.training import LoRAFinetune

model = MolfunStructureModel.from_pretrained(
    "openfold_v1",
    device="cuda",
    head="affinity",
    head_config={"hidden_dim": 256, "num_layers": 2, "dropout": 0.1},
)

strategy = LoRAFinetune(
    rank=8,                # LoRA rank (lower = fewer params)
    alpha=16.0,            # LoRA scaling factor
    lr_lora=1e-4,          # Learning rate for LoRA adapters
    lr_head=1e-3,          # Learning rate for prediction head
    warmup_steps=100,      # Linear warmup steps
    ema_decay=0.999,       # Exponential moving average for weights
)

model.fit(
    train_loader=train_loader,
    val_loader=val_loader,
    strategy=strategy,
    epochs=15,
    checkpoint_dir="checkpoints/affinity_lora",
)

Checkpoint directory

Setting checkpoint_dir saves a checkpoint after each epoch. Training can be resumed from the last checkpoint if interrupted.

Step 4: Evaluate with Pearson Correlation¶

import torch
import numpy as np
from scipy.stats import pearsonr

model.eval()
predictions, actuals = [], []

with torch.no_grad():
    for batch in test_loader:
        output = model.predict(batch["sequence"])
        predictions.extend(output.scores.cpu().numpy())
        actuals.extend(batch["labels"].cpu().numpy())

predictions = np.array(predictions)
actuals = np.array(actuals)

r, p_value = pearsonr(actuals, predictions)
rmse = np.sqrt(np.mean((predictions - actuals) ** 2))

print(f"Pearson r: {r:.4f} (p = {p_value:.2e})")
print(f"RMSE:      {rmse:.4f} pKd units")

Step 5: Compare LoRA vs Full Fine-Tuning¶

To understand the benefit of LoRA, let us train a second model using full fine-tuning and compare.

LoRA Fine-TuneFull Fine-Tune

from molfun.training import LoRAFinetune

lora_strategy = LoRAFinetune(rank=8, alpha=16.0, lr_lora=1e-4, lr_head=1e-3)

lora_model = MolfunStructureModel.from_pretrained(
    "openfold_v1", device="cuda",
    head="affinity",
    head_config={"hidden_dim": 256, "num_layers": 2, "dropout": 0.1},
)
lora_model.fit(
    train_loader=train_loader, val_loader=val_loader,
    strategy=lora_strategy, epochs=15,
)

from molfun.training import FullFinetune

full_strategy = FullFinetune(
    lr=5e-5,
    weight_decay=0.01,
    warmup_steps=200,
    lr_decay_factor=0.95,
)

full_model = MolfunStructureModel.from_pretrained(
    "openfold_v1", device="cuda",
    head="affinity",
    head_config={"hidden_dim": 256, "num_layers": 2, "dropout": 0.1},
)
full_model.fit(
    train_loader=train_loader, val_loader=val_loader,
    strategy=full_strategy, epochs=15,
)

Comparison Results¶

After training both models on the same data and evaluating on the test set:

Metric	LoRA (rank=8)	Full Fine-Tune
Pearson r	~0.78	~0.81
RMSE (pKd)	~1.25	~1.18
Trainable params	~0.5M	~93M
Training time	~12 min	~45 min
GPU memory	~8 GB	~24 GB

Key takeaway

LoRA achieves 96% of full fine-tuning performance while training 6x fewer parameters and using 3x less GPU memory. For most binding affinity tasks, LoRA is the recommended starting point.

Step 6: Merge and Export¶

After training with LoRA, you can merge the adapters back into the base model for deployment (no LoRA overhead at inference time).

# Merge LoRA weights into the base model
model.merge()

# Save the merged model
model.save("models/affinity_merged")

# Or push to Hub
model.push_to_hub("your-username/binding-affinity-predictor")

Full Script¶

Complete runnable code

"""Binding affinity prediction with LoRA fine-tuning."""
import numpy as np
import torch
from scipy.stats import pearsonr
from torch.utils.data import DataLoader

from molfun import MolfunStructureModel
from molfun.data import AffinityFetcher, AffinityDataset, DataSplitter
from molfun.training import LoRAFinetune

# ── Data ──────────────────────────────────────────────
fetcher = AffinityFetcher()
records = fetcher.fetch(subset="refined", max_samples=500)

dataset = AffinityDataset(records=records, max_seq_length=512)
splitter = DataSplitter(val_size=0.1, test_size=0.1, random_state=42)
train_ds, val_ds, test_ds = splitter.split(dataset, n_splits=3)

train_loader = DataLoader(train_ds, batch_size=8, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=8)
test_loader = DataLoader(test_ds, batch_size=8)

# ── Model ─────────────────────────────────────────────
model = MolfunStructureModel.from_pretrained(
    "openfold_v1",
    device="cuda",
    head="affinity",
    head_config={"hidden_dim": 256, "num_layers": 2, "dropout": 0.1},
)

# ── Train ─────────────────────────────────────────────
strategy = LoRAFinetune(
    rank=8, alpha=16.0,
    lr_lora=1e-4, lr_head=1e-3,
    warmup_steps=100, ema_decay=0.999,
)
model.fit(
    train_loader=train_loader,
    val_loader=val_loader,
    strategy=strategy,
    epochs=15,
    checkpoint_dir="checkpoints/affinity_lora",
)

# ── Evaluate ──────────────────────────────────────────
model.eval()
predictions, actuals = [], []
with torch.no_grad():
    for batch in test_loader:
        output = model.predict(batch["sequence"])
        predictions.extend(output.scores.cpu().numpy())
        actuals.extend(batch["labels"].cpu().numpy())

predictions = np.array(predictions)
actuals = np.array(actuals)
r, _ = pearsonr(actuals, predictions)
rmse = np.sqrt(np.mean((predictions - actuals) ** 2))
print(f"Pearson r: {r:.4f}, RMSE: {rmse:.4f}")

# ── Merge & Save ──────────────────────────────────────
model.merge()
model.save("models/affinity_merged")

Next Steps¶

Small dataset? See LoRA for Small Datasets for guidance on tuning LoRA rank and alpha.
Domain-specific proteins? Try Kinase Structure Refinement with curated protein collections.
Track your experiments? Add Experiment Tracking to compare runs systematically.