Collections¶

Curated dataset collections for common benchmarks and training sets. Collections provide a convenient way to fetch well-known datasets with a single function call.

Quick Start¶

from molfun.data.collections import fetch_collection, list_collections

# See what's available
print(list_collections())
# ["casp14", "casp15", "cameo", "pdbbind_refined", ...]

# Fetch a collection
dataset = fetch_collection("casp15", split="test")

# Check collection size before downloading
from molfun.data.collections import count_collection
n = count_collection("pdbbind_refined")
print(f"PDBbind refined set: {n} complexes")

Functions¶

fetch_collection¶

fetch_collection ¶

fetch_collection(name: str, *, cache_dir: str | None = None, fmt: str = 'cif', max_structures: int | None = None, resolution_max: float | None = None, deduplicate: bool = False, identity: float = 0.3, workers: int = 4, progress: bool = False, storage_options: dict | None = None) -> list[str]

Fetch structures for a named collection.

Parameters:

Name	Type	Description	Default
`name`	`str`	Collection name (see `list_collections()`).	required
`cache_dir`	`str \| None`	Override cache directory.	`None`
`fmt`	`str`	File format (`"cif"` or `"pdb"`).	`'cif'`
`max_structures`	`int \| None`	Override collection default max.	`None`
`resolution_max`	`float \| None`	Override collection default resolution.	`None`
`deduplicate`	`bool`	If True, cluster by sequence and keep one representative.	`False`
`identity`	`float`	Sequence identity threshold for deduplication.	`0.3`
`workers`	`int`	Number of parallel download threads.	`4`
`progress`	`bool`	Show tqdm progress bar.	`False`
`storage_options`	`dict \| None`	fsspec options for remote storage.	`None`

Returns:

Type	Description
`list[str]`	List of file paths to downloaded structures.

Download and return a curated dataset collection.

from molfun.data.collections import fetch_collection

# Structure prediction benchmark
ds = fetch_collection("casp15", split="test", cache_dir="./data")

# Affinity dataset
ds = fetch_collection("pdbbind_refined", version="2020")

Parameter	Type	Default	Description
`name`	`str`	required	Collection name
`split`	`str \\| None`	`None`	Dataset split (collection-dependent)
`version`	`str \\| None`	`None`	Dataset version
`cache_dir`	`str \\| Path`	`"~/.molfun/collections"`	Download cache directory

Returns: Dataset appropriate for the collection type.

count_collection¶

count_collection ¶

count_collection(name: str, resolution_max: float | None = None) -> int

Count how many structures are available in RCSB for a collection.

Makes a single lightweight HTTP request (no downloads).

Parameters:

Name	Type	Description	Default
`name`	`str`	Collection name (see `list_collections()`).	required
`resolution_max`	`float \| None`	Override collection default resolution.	`None`

Returns:

Type	Description
`int`	Total number of matching structures in RCSB.

Return the number of entries in a collection without downloading it.

from molfun.data.collections import count_collection

n = count_collection("casp15")
print(f"CASP15 has {n} targets")

Parameter	Type	Description
`name`	`str`	Collection name

Returns: int

list_collections¶

list_collections ¶

list_collections(tag: str | None = None) -> list[CollectionSpec]

List available collections, optionally filtered by tag.

Parameters:

Name	Type	Description	Default
`tag`	`str \| None`	If given, return only collections containing this tag.	`None`

List all available collection names.

from molfun.data.collections import list_collections

for name in list_collections():
    n = count_collection(name)
    print(f"{name}: {n} entries")

Returns: list[str]

CollectionSpec¶

CollectionSpec `dataclass` ¶

Defines a reusable protein collection query.

Dataclass describing a collection's metadata.

Field	Type	Description
`name`	`str`	Collection identifier
`description`	`str`	Human-readable description
`task`	`str`	Task type: `"structure"`, `"affinity"`, `"property"`
`size`	`int`	Number of entries
`url`	`str`	Source URL
`citation`	`str`	BibTeX citation key

Available Collections¶

Name	Task	Size	Description
`casp14`	structure	87	CASP14 free-modeling targets
`casp15`	structure	109	CASP15 free-modeling targets
`cameo`	structure	~500/yr	CAMEO weekly structure prediction targets
`pdbbind_refined`	affinity	~5,300	PDBbind refined set
`pdbbind_core`	affinity	~290	PDBbind core set (test benchmark)
`pdbbind_general`	affinity	~19,400	PDBbind general set

Collections¶

Quick Start¶

Functions¶

fetch_collection¶

fetch_collection ¶

count_collection¶

count_collection ¶

list_collections¶

list_collections ¶

CollectionSpec¶

CollectionSpec dataclass ¶

Available Collections¶

CollectionSpec `dataclass` ¶