DataSplitter¶
Utilities for splitting datasets into train/validation/test sets using different strategies appropriate for biological data.
Quick Start¶
from molfun.data.splits import DataSplitter
splitter = DataSplitter()
# Random split
train, val, test = splitter.random(dataset, fractions=[0.8, 0.1, 0.1])
# Temporal split (by deposition date)
train, val, test = splitter.temporal(dataset, cutoff_date="2020-05-01")
# Identity-based split (sequence clustering)
train, val, test = splitter.identity(dataset, threshold=0.3)
Class Reference¶
DataSplitter ¶
Static methods for splitting protein datasets.
All methods return (train, val, test) Subsets of the input dataset.
Usage
train, val, test = DataSplitter.random(dataset) train, val, test = DataSplitter.by_sequence_identity(dataset, threshold=0.3) train, val, test = DataSplitter.by_family(dataset, families)
random
staticmethod
¶
random(dataset: Dataset, val_frac: float = 0.1, test_frac: float = 0.1, seed: int = 42) -> tuple[Subset, Subset, Subset]
Simple random split. Fast but ignores sequence homology.
by_sequence_identity
staticmethod
¶
by_sequence_identity(dataset, threshold: float = 0.3, val_frac: float = 0.1, test_frac: float = 0.1, seed: int = 42, mmseqs_path: str = 'mmseqs') -> tuple[Subset, Subset, Subset]
Split by sequence clustering so that no two sequences in different
splits share more than threshold identity.
Requires MMseqs2 installed (https://github.com/soedinglab/MMseqs2). Falls back to random split if MMseqs2 is not available.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
Must expose a |
required | |
threshold
|
float
|
Sequence identity threshold (0.0–1.0). Clusters above this threshold are kept together. |
0.3
|
val_frac
|
float
|
Fraction of clusters for validation. |
0.1
|
test_frac
|
float
|
Fraction of clusters for test. |
0.1
|
seed
|
int
|
Random seed for cluster assignment. |
42
|
mmseqs_path
|
str
|
Path to mmseqs binary. |
'mmseqs'
|
by_family
staticmethod
¶
by_family(dataset, families: list[str], val_families: set[str] | None = None, test_families: set[str] | None = None, val_frac: float = 0.1, test_frac: float = 0.1, seed: int = 42) -> tuple[Subset, Subset, Subset]
Split by protein family: entire families go to one split.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
The dataset to split. |
required | |
families
|
list[str]
|
list[str] of length len(dataset), family label per entry. |
required |
val_families
|
set[str] | None
|
Explicit set of families for validation. If None, families are randomly assigned. |
None
|
test_families
|
set[str] | None
|
Explicit set of families for test. If None, families are randomly assigned. |
None
|
val_frac
|
float
|
Fraction of families for validation (if auto-assigning). |
0.1
|
test_frac
|
float
|
Fraction of families for test (if auto-assigning). |
0.1
|
seed
|
int
|
Random seed. |
42
|
temporal
staticmethod
¶
temporal(dataset, years: list[int], val_cutoff: int = 2019, test_cutoff: int = 2020) -> tuple[Subset, Subset, Subset]
Temporal split based on deposition year.
Train: year < val_cutoff Val: val_cutoff <= year < test_cutoff Test: year >= test_cutoff
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
The dataset to split. |
required | |
years
|
list[int]
|
list[int] of length len(dataset), year per entry. |
required |
val_cutoff
|
int
|
Year boundary for validation. |
2019
|
test_cutoff
|
int
|
Year boundary for test. |
2020
|
random¶
Standard random split.
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset |
Dataset |
required | Dataset to split |
fractions |
list[float] |
[0.8, 0.1, 0.1] |
Train/val/test fractions (must sum to 1.0) |
seed |
int |
42 |
Random seed for reproducibility |
Returns: Tuple of (train_dataset, val_dataset, test_dataset).
temporal¶
Split by deposition date to prevent data leakage from future structures.
train, val, test = splitter.temporal(
dataset,
cutoff_date="2020-05-01",
val_cutoff_date="2021-01-01",
)
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset |
Dataset |
required | Dataset to split (must have date metadata) |
cutoff_date |
str |
required | Train/val boundary date (ISO format) |
val_cutoff_date |
str \| None |
None |
Val/test boundary date. If None, remaining data is split 50/50. |
Returns: Tuple of (train_dataset, val_dataset, test_dataset).
identity¶
Cluster sequences by identity and split at the cluster level. Ensures no test protein is similar to any training protein above the threshold.
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset |
Dataset |
required | Dataset to split (must have sequence data) |
threshold |
float |
0.3 |
Maximum sequence identity between splits (0.0--1.0) |
fractions |
list[float] |
[0.8, 0.1, 0.1] |
Approximate train/val/test fractions |
seed |
int |
42 |
Random seed |
Returns: Tuple of (train_dataset, val_dataset, test_dataset).
Choosing a Split Strategy¶
| Strategy | When to Use | Prevents |
|---|---|---|
| random | Quick experiments, baselines | Nothing (data leakage possible) |
| temporal | Realistic evaluation, time-series data | Temporal leakage |
| identity | Generalization to novel proteins | Homology leakage |
For publication-quality results, prefer identity splits with a threshold of 0.3 (30% sequence identity), which is the standard in protein structure prediction benchmarks (e.g., CASP, CAMEO).