Skip to content

DataSplitter

Utilities for splitting datasets into train/validation/test sets using different strategies appropriate for biological data.

Quick Start

from molfun.data.splits import DataSplitter

splitter = DataSplitter()

# Random split
train, val, test = splitter.random(dataset, fractions=[0.8, 0.1, 0.1])

# Temporal split (by deposition date)
train, val, test = splitter.temporal(dataset, cutoff_date="2020-05-01")

# Identity-based split (sequence clustering)
train, val, test = splitter.identity(dataset, threshold=0.3)

Class Reference

DataSplitter

Static methods for splitting protein datasets.

All methods return (train, val, test) Subsets of the input dataset.

Usage

train, val, test = DataSplitter.random(dataset) train, val, test = DataSplitter.by_sequence_identity(dataset, threshold=0.3) train, val, test = DataSplitter.by_family(dataset, families)

random staticmethod

random(dataset: Dataset, val_frac: float = 0.1, test_frac: float = 0.1, seed: int = 42) -> tuple[Subset, Subset, Subset]

Simple random split. Fast but ignores sequence homology.

by_sequence_identity staticmethod

by_sequence_identity(dataset, threshold: float = 0.3, val_frac: float = 0.1, test_frac: float = 0.1, seed: int = 42, mmseqs_path: str = 'mmseqs') -> tuple[Subset, Subset, Subset]

Split by sequence clustering so that no two sequences in different splits share more than threshold identity.

Requires MMseqs2 installed (https://github.com/soedinglab/MMseqs2). Falls back to random split if MMseqs2 is not available.

Parameters:

Name Type Description Default
dataset

Must expose a .sequences property returning list[str].

required
threshold float

Sequence identity threshold (0.0–1.0). Clusters above this threshold are kept together.

0.3
val_frac float

Fraction of clusters for validation.

0.1
test_frac float

Fraction of clusters for test.

0.1
seed int

Random seed for cluster assignment.

42
mmseqs_path str

Path to mmseqs binary.

'mmseqs'

by_family staticmethod

by_family(dataset, families: list[str], val_families: set[str] | None = None, test_families: set[str] | None = None, val_frac: float = 0.1, test_frac: float = 0.1, seed: int = 42) -> tuple[Subset, Subset, Subset]

Split by protein family: entire families go to one split.

Parameters:

Name Type Description Default
dataset

The dataset to split.

required
families list[str]

list[str] of length len(dataset), family label per entry.

required
val_families set[str] | None

Explicit set of families for validation. If None, families are randomly assigned.

None
test_families set[str] | None

Explicit set of families for test. If None, families are randomly assigned.

None
val_frac float

Fraction of families for validation (if auto-assigning).

0.1
test_frac float

Fraction of families for test (if auto-assigning).

0.1
seed int

Random seed.

42

temporal staticmethod

temporal(dataset, years: list[int], val_cutoff: int = 2019, test_cutoff: int = 2020) -> tuple[Subset, Subset, Subset]

Temporal split based on deposition year.

Train: year < val_cutoff Val: val_cutoff <= year < test_cutoff Test: year >= test_cutoff

Parameters:

Name Type Description Default
dataset

The dataset to split.

required
years list[int]

list[int] of length len(dataset), year per entry.

required
val_cutoff int

Year boundary for validation.

2019
test_cutoff int

Year boundary for test.

2020

random

Standard random split.

train, val, test = splitter.random(
    dataset,
    fractions=[0.8, 0.1, 0.1],
    seed=42,
)
Parameter Type Default Description
dataset Dataset required Dataset to split
fractions list[float] [0.8, 0.1, 0.1] Train/val/test fractions (must sum to 1.0)
seed int 42 Random seed for reproducibility

Returns: Tuple of (train_dataset, val_dataset, test_dataset).


temporal

Split by deposition date to prevent data leakage from future structures.

train, val, test = splitter.temporal(
    dataset,
    cutoff_date="2020-05-01",
    val_cutoff_date="2021-01-01",
)
Parameter Type Default Description
dataset Dataset required Dataset to split (must have date metadata)
cutoff_date str required Train/val boundary date (ISO format)
val_cutoff_date str \| None None Val/test boundary date. If None, remaining data is split 50/50.

Returns: Tuple of (train_dataset, val_dataset, test_dataset).


identity

Cluster sequences by identity and split at the cluster level. Ensures no test protein is similar to any training protein above the threshold.

train, val, test = splitter.identity(
    dataset,
    threshold=0.3,
    fractions=[0.8, 0.1, 0.1],
    seed=42,
)
Parameter Type Default Description
dataset Dataset required Dataset to split (must have sequence data)
threshold float 0.3 Maximum sequence identity between splits (0.0--1.0)
fractions list[float] [0.8, 0.1, 0.1] Approximate train/val/test fractions
seed int 42 Random seed

Returns: Tuple of (train_dataset, val_dataset, test_dataset).


Choosing a Split Strategy

Strategy When to Use Prevents
random Quick experiments, baselines Nothing (data leakage possible)
temporal Realistic evaluation, time-series data Temporal leakage
identity Generalization to novel proteins Homology leakage

For publication-quality results, prefer identity splits with a threshold of 0.3 (30% sequence identity), which is the standard in protein structure prediction benchmarks (e.g., CASP, CAMEO).