DataSplitter¶

Utilities for splitting datasets into train/validation/test sets using different strategies appropriate for biological data.

Quick Start¶

from molfun.data.splits import DataSplitter

splitter = DataSplitter()

# Random split
train, val, test = splitter.random(dataset, fractions=[0.8, 0.1, 0.1])

# Temporal split (by deposition date)
train, val, test = splitter.temporal(dataset, cutoff_date="2020-05-01")

# Identity-based split (sequence clustering)
train, val, test = splitter.identity(dataset, threshold=0.3)

Class Reference¶

DataSplitter ¶

Static methods for splitting protein datasets.

All methods return (train, val, test) Subsets of the input dataset.

Usage

train, val, test = DataSplitter.random(dataset) train, val, test = DataSplitter.by_sequence_identity(dataset, threshold=0.3) train, val, test = DataSplitter.by_family(dataset, families)

random `staticmethod` ¶

random(dataset: Dataset, val_frac: float = 0.1, test_frac: float = 0.1, seed: int = 42) -> tuple[Subset, Subset, Subset]

Simple random split. Fast but ignores sequence homology.

by_sequence_identity `staticmethod` ¶

by_sequence_identity(dataset, threshold: float = 0.3, val_frac: float = 0.1, test_frac: float = 0.1, seed: int = 42, mmseqs_path: str = 'mmseqs') -> tuple[Subset, Subset, Subset]

Split by sequence clustering so that no two sequences in different splits share more than threshold identity.

Requires MMseqs2 installed (https://github.com/soedinglab/MMseqs2). Falls back to random split if MMseqs2 is not available.

Parameters:

Name	Type	Description	Default
`dataset`		Must expose a `.sequences` property returning list[str].	required
`threshold`	`float`	Sequence identity threshold (0.0–1.0). Clusters above this threshold are kept together.	`0.3`
`val_frac`	`float`	Fraction of clusters for validation.	`0.1`
`test_frac`	`float`	Fraction of clusters for test.	`0.1`
`seed`	`int`	Random seed for cluster assignment.	`42`
`mmseqs_path`	`str`	Path to mmseqs binary.	`'mmseqs'`

by_family `staticmethod` ¶

by_family(dataset, families: list[str], val_families: set[str] | None = None, test_families: set[str] | None = None, val_frac: float = 0.1, test_frac: float = 0.1, seed: int = 42) -> tuple[Subset, Subset, Subset]

Split by protein family: entire families go to one split.

Parameters:

Name	Type	Description	Default
`dataset`		The dataset to split.	required
`families`	`list[str]`	list[str] of length len(dataset), family label per entry.	required
`val_families`	`set[str] \| None`	Explicit set of families for validation. If None, families are randomly assigned.	`None`
`test_families`	`set[str] \| None`	Explicit set of families for test. If None, families are randomly assigned.	`None`
`val_frac`	`float`	Fraction of families for validation (if auto-assigning).	`0.1`
`test_frac`	`float`	Fraction of families for test (if auto-assigning).	`0.1`
`seed`	`int`	Random seed.	`42`

temporal `staticmethod` ¶

temporal(dataset, years: list[int], val_cutoff: int = 2019, test_cutoff: int = 2020) -> tuple[Subset, Subset, Subset]

Temporal split based on deposition year.

Train: year < val_cutoff Val: val_cutoff <= year < test_cutoff Test: year >= test_cutoff

Parameters:

Name	Type	Description	Default
`dataset`		The dataset to split.	required
`years`	`list[int]`	list[int] of length len(dataset), year per entry.	required
`val_cutoff`	`int`	Year boundary for validation.	`2019`
`test_cutoff`	`int`	Year boundary for test.	`2020`

random¶

Standard random split.

train, val, test = splitter.random(
    dataset,
    fractions=[0.8, 0.1, 0.1],
    seed=42,
)

Parameter	Type	Default	Description
`dataset`	`Dataset`	required	Dataset to split
`fractions`	`list[float]`	`[0.8, 0.1, 0.1]`	Train/val/test fractions (must sum to 1.0)
`seed`	`int`	`42`	Random seed for reproducibility

Returns: Tuple of (train_dataset, val_dataset, test_dataset).

temporal¶

Split by deposition date to prevent data leakage from future structures.

train, val, test = splitter.temporal(
    dataset,
    cutoff_date="2020-05-01",
    val_cutoff_date="2021-01-01",
)

Parameter	Type	Default	Description
`dataset`	`Dataset`	required	Dataset to split (must have date metadata)
`cutoff_date`	`str`	required	Train/val boundary date (ISO format)
`val_cutoff_date`	`str \\| None`	`None`	Val/test boundary date. If `None`, remaining data is split 50/50.

Returns: Tuple of (train_dataset, val_dataset, test_dataset).

identity¶

Cluster sequences by identity and split at the cluster level. Ensures no test protein is similar to any training protein above the threshold.

train, val, test = splitter.identity(
    dataset,
    threshold=0.3,
    fractions=[0.8, 0.1, 0.1],
    seed=42,
)

Parameter	Type	Default	Description
`dataset`	`Dataset`	required	Dataset to split (must have sequence data)
`threshold`	`float`	`0.3`	Maximum sequence identity between splits (0.0--1.0)
`fractions`	`list[float]`	`[0.8, 0.1, 0.1]`	Approximate train/val/test fractions
`seed`	`int`	`42`	Random seed

Returns: Tuple of (train_dataset, val_dataset, test_dataset).

Choosing a Split Strategy¶

Strategy	When to Use	Prevents
random	Quick experiments, baselines	Nothing (data leakage possible)
temporal	Realistic evaluation, time-series data	Temporal leakage
identity	Generalization to novel proteins	Homology leakage

For publication-quality results, prefer identity splits with a threshold of 0.3 (30% sequence identity), which is the standard in protein structure prediction benchmarks (e.g., CASP, CAMEO).

DataSplitter¶

Quick Start¶

Class Reference¶

DataSplitter ¶

random staticmethod ¶

by_sequence_identity staticmethod ¶

by_family staticmethod ¶

temporal staticmethod ¶

random¶

temporal¶

identity¶

Choosing a Split Strategy¶

random `staticmethod` ¶

by_sequence_identity `staticmethod` ¶

by_family `staticmethod` ¶

temporal `staticmethod` ¶