YAML Pipelines¶

Intermediate ~20 min

Define reproducible, end-to-end workflows with Pipeline.from_yaml(). A single YAML file describes every step --- from data fetching through preprocessing, training, and evaluation --- making experiments easy to share, version, and rerun.

What You Will Learn¶

Write YAML pipeline recipes with Molfun's pipeline format
Use built-in steps: fetch, preprocess, train, evaluate
Run pipelines from Python or the CLI
Write custom pipeline steps
Organize recipes for common tasks

Pipeline Architecture¶

graph LR
    YAML["recipe.yaml"] --> PARSE["Pipeline.from_yaml()"]
    PARSE --> S1["Step 1: fetch"]
    S1 --> S2["Step 2: preprocess"]
    S2 --> S3["Step 3: train"]
    S3 --> S4["Step 4: evaluate"]
    S4 --> OUT["Results + Artifacts"]

    style YAML fill:#3b82f6,stroke:#2563eb,color:#ffffff
    style PARSE fill:#7c3aed,stroke:#6d28d9,color:#ffffff
    style S1 fill:#0d9488,stroke:#0f766e,color:#ffffff
    style S2 fill:#d97706,stroke:#b45309,color:#ffffff
    style S3 fill:#c026d3,stroke:#a21caf,color:#ffffff
    style S4 fill:#16a34a,stroke:#15803d,color:#ffffff
    style OUT fill:#0891b2,stroke:#0e7490,color:#ffffff

Step 1: Your First Pipeline¶

Create a file called stability_pipeline.yaml:

stability_pipeline.yaml

name: stability-prediction
description: Train a stability predictor with HeadOnly strategy

# ── Data ──────────────────────────────────────────────────
fetch:
  type: csv
  path: data/stability_data.csv
  columns:
    sequence: sequence
    label: ddg

preprocess:
  max_length: 512
  split:
    test_size: 0.2
    random_state: 42
  loader:
    batch_size: 4
    shuffle: true

# ── Model ─────────────────────────────────────────────────
model:
  pretrained: openfold_v1
  device: cuda
  head: affinity
  head_config:
    hidden_dim: 256
    num_layers: 2
    dropout: 0.1

# ── Training ──────────────────────────────────────────────
train:
  strategy: head_only
  strategy_config:
    lr: 1.0e-3
    weight_decay: 1.0e-4
  epochs: 20
  checkpoint_dir: checkpoints/stability

# ── Evaluation ────────────────────────────────────────────
evaluate:
  metrics:
    - pearson
    - rmse
    - mae
  save_predictions: results/stability_predictions.csv

# ── Output ────────────────────────────────────────────────
output:
  save_model: models/stability_headonly
  # push_to_hub: your-username/stability-predictor  # Uncomment to push

Step 2: Run the Pipeline¶

PythonCLI

from molfun.pipelines import Pipeline

pipeline = Pipeline.from_yaml("stability_pipeline.yaml")
results = pipeline.run()

print(f"Pearson r: {results.metrics['pearson']:.4f}")
print(f"RMSE:      {results.metrics['rmse']:.4f}")
print(f"Model saved to: {results.model_path}")

molfun pipeline run stability_pipeline.yaml

CLI output

[Pipeline] stability-prediction
[Step 1/4] Fetching data from data/stability_data.csv...
  Loaded 200 samples
[Step 2/4] Preprocessing...
  Train: 160 | Val: 40
[Step 3/4] Training (HeadOnly, 20 epochs)...
  Epoch  1/20 | Train: 4.21 | Val: 3.89
  Epoch 10/20 | Train: 0.82 | Val: 0.94
  Epoch 20/20 | Train: 0.41 | Val: 0.52
[Step 4/4] Evaluating...
  Pearson r: 0.8432
  RMSE:      0.7821
[Done] Model saved to models/stability_headonly

Step 3: Recipe Examples¶

Binding Affinity with LoRA¶

affinity_lora.yaml

name: affinity-lora
description: Binding affinity prediction with LoRA fine-tuning

fetch:
  type: pdbbind
  subset: refined
  max_samples: 500

preprocess:
  max_length: 512
  split:
    val_size: 0.1
    test_size: 0.1
    random_state: 42
  loader:
    batch_size: 8
    shuffle: true

model:
  pretrained: openfold_v1
  device: cuda
  head: affinity
  head_config:
    hidden_dim: 256
    num_layers: 2
    dropout: 0.1

train:
  strategy: lora
  strategy_config:
    rank: 8
    alpha: 16.0
    lr_lora: 1.0e-4
    lr_head: 1.0e-3
    warmup_steps: 100
    ema_decay: 0.999
  epochs: 15
  checkpoint_dir: checkpoints/affinity_lora

evaluate:
  metrics:
    - pearson
    - rmse
  save_predictions: results/affinity_predictions.csv

tracking:
  backend: wandb
  project: affinity-experiments

output:
  save_model: models/affinity_lora
  merge_lora: true

kinase_refinement.yaml

name: kinase-refinement
description: Refine kinase structures with partial fine-tuning

fetch:
  type: collection
  name: kinases_human

preprocess:
  max_length: 600
  split:
    test_size: 0.2
    random_state: 42
  loader:
    batch_size: 2
    shuffle: true

model:
  pretrained: openfold_v1
  device: cuda

train:
  strategy: partial
  strategy_config:
    n_unfrozen_blocks: 4
    lr: 5.0e-5
  epochs: 25
  checkpoint_dir: checkpoints/kinase

evaluate:
  metrics:
    - fape
    - lddt
  save_predictions: results/kinase_structures/

tracking:
  backend: mlflow
  experiment: kinase-refinement

output:
  save_model: models/kinase_refined

Step 4: Adding Tracking¶

Add a tracking section to any recipe to enable experiment tracking:

tracking:
  backend: wandb               # wandb, comet, or mlflow
  project: my-project          # Project / experiment name

For multiple backends, use a list:

tracking:
  backends:
    - type: wandb
      project: my-project
    - type: mlflow
      experiment: my-project

This automatically creates a CompositeTracker under the hood.

Step 5: Custom Pipeline Steps¶

You can extend the pipeline with custom steps by writing a Python function and referencing it in the YAML.

Define a Custom Step¶

custom_steps.py

from molfun.pipelines import register_step

@register_step("filter_by_length")
def filter_by_length(data, min_length=50, max_length=500):
    """Filter sequences by length."""
    filtered = [
        record for record in data
        if min_length <= len(record.sequence) <= max_length
    ]
    print(f"Filtered: {len(data)} -> {len(filtered)} sequences")
    return filtered

Use in YAML¶

pipeline_with_custom_step.yaml

name: filtered-stability
description: Stability prediction with length filtering

custom_steps:
  - custom_steps.py              # Load custom step definitions

fetch:
  type: csv
  path: data/stability_data.csv
  columns:
    sequence: sequence
    label: ddg

# Custom step inserted between fetch and preprocess
filter_by_length:
  min_length: 50
  max_length: 400

preprocess:
  max_length: 400
  split:
    test_size: 0.2
    random_state: 42
  loader:
    batch_size: 4
    shuffle: true

model:
  pretrained: openfold_v1
  device: cuda
  head: affinity
  head_config:
    hidden_dim: 256
    num_layers: 2

train:
  strategy: head_only
  strategy_config:
    lr: 1.0e-3
  epochs: 20

evaluate:
  metrics:
    - pearson
    - rmse

output:
  save_model: models/filtered_stability

Step 6: Pipeline Composition¶

Run multiple pipelines in sequence, passing outputs between them:

from molfun.pipelines import Pipeline

# Train pipeline
train_pipeline = Pipeline.from_yaml("train_recipe.yaml")
train_results = train_pipeline.run()

# Evaluation pipeline (uses the model from training)
eval_recipe = {
    "name": "evaluation",
    "model": {"path": train_results.model_path},
    "fetch": {"type": "csv", "path": "data/test_set.csv"},
    "evaluate": {"metrics": ["pearson", "rmse", "mae"]},
}
eval_pipeline = Pipeline.from_dict(eval_recipe)
eval_results = eval_pipeline.run()

YAML Reference¶

Complete YAML schema

Section	Key	Type	Description
Top-level	`name`	string	Pipeline name
	`description`	string	Human-readable description
	`custom_steps`	list[string]	Python files to load custom steps from
fetch	`type`	string	`csv`, `pdbbind`, `collection`, `pdb_ids`
	`path`	string	Path to CSV file (for `csv` type)
	`subset`	string	PDBbind subset: `refined` or `general`
	`name`	string	Collection name (for `collection` type)
	`pdb_ids`	list[string]	PDB IDs to fetch (for `pdb_ids` type)
	`max_samples`	int	Maximum number of samples
preprocess	`max_length`	int	Maximum sequence length
	`split.test_size`	float	Test split fraction
	`split.val_size`	float	Validation split fraction
	`split.random_state`	int	Random seed
	`loader.batch_size`	int	DataLoader batch size
	`loader.shuffle`	bool	Shuffle training data
model	`pretrained`	string	Pretrained model name
	`device`	string	`cuda` or `cpu`
	`head`	string	Prediction head type
	`head_config`	dict	Head configuration
train	`strategy`	string	`full`, `head_only`, `lora`, `partial`
	`strategy_config`	dict	Strategy-specific parameters
	`epochs`	int	Number of training epochs
	`checkpoint_dir`	string	Directory for checkpoints
evaluate	`metrics`	list[string]	Metric names: `pearson`, `rmse`, `mae`, `fape`, `lddt`
	`save_predictions`	string	Path to save predictions
tracking	`backend`	string	`wandb`, `comet`, `mlflow`
	`project`	string	Project name
	`experiment`	string	Experiment name (MLflow)
output	`save_model`	string	Path to save trained model
	`merge_lora`	bool	Merge LoRA weights before saving
	`push_to_hub`	string	HuggingFace Hub repo ID

Best Practices¶

Version your recipes

Store YAML recipes in version control alongside your code. Each recipe is a complete, reproducible experiment definition.

Use environment variables for paths

fetch:
  type: csv
  path: ${DATA_DIR}/stability_data.csv

output:
  save_model: ${MODEL_DIR}/stability_headonly

Start simple, add complexity

Begin with the minimal recipe (fetch + model + train) and add preprocessing, evaluation, and tracking as needed.

Next Steps¶

First time? Start with the Stability Prediction tutorial and convert it to a YAML pipeline.
Need custom steps? The custom step API supports any Python function.
Team workflows? Combine with Experiment Tracking to share results across team members.