Skip to content

Collections

Curated dataset collections for common benchmarks and training sets. Collections provide a convenient way to fetch well-known datasets with a single function call.

Quick Start

from molfun.data.collections import fetch_collection, list_collections

# See what's available
print(list_collections())
# ["casp14", "casp15", "cameo", "pdbbind_refined", ...]

# Fetch a collection
dataset = fetch_collection("casp15", split="test")

# Check collection size before downloading
from molfun.data.collections import count_collection
n = count_collection("pdbbind_refined")
print(f"PDBbind refined set: {n} complexes")

Functions

fetch_collection

fetch_collection

fetch_collection(name: str, *, cache_dir: str | None = None, fmt: str = 'cif', max_structures: int | None = None, resolution_max: float | None = None, deduplicate: bool = False, identity: float = 0.3, workers: int = 4, progress: bool = False, storage_options: dict | None = None) -> list[str]

Fetch structures for a named collection.

Parameters:

Name Type Description Default
name str

Collection name (see list_collections()).

required
cache_dir str | None

Override cache directory.

None
fmt str

File format ("cif" or "pdb").

'cif'
max_structures int | None

Override collection default max.

None
resolution_max float | None

Override collection default resolution.

None
deduplicate bool

If True, cluster by sequence and keep one representative.

False
identity float

Sequence identity threshold for deduplication.

0.3
workers int

Number of parallel download threads.

4
progress bool

Show tqdm progress bar.

False
storage_options dict | None

fsspec options for remote storage.

None

Returns:

Type Description
list[str]

List of file paths to downloaded structures.

Download and return a curated dataset collection.

from molfun.data.collections import fetch_collection

# Structure prediction benchmark
ds = fetch_collection("casp15", split="test", cache_dir="./data")

# Affinity dataset
ds = fetch_collection("pdbbind_refined", version="2020")
Parameter Type Default Description
name str required Collection name
split str \| None None Dataset split (collection-dependent)
version str \| None None Dataset version
cache_dir str \| Path "~/.molfun/collections" Download cache directory

Returns: Dataset appropriate for the collection type.


count_collection

count_collection

count_collection(name: str, resolution_max: float | None = None) -> int

Count how many structures are available in RCSB for a collection.

Makes a single lightweight HTTP request (no downloads).

Parameters:

Name Type Description Default
name str

Collection name (see list_collections()).

required
resolution_max float | None

Override collection default resolution.

None

Returns:

Type Description
int

Total number of matching structures in RCSB.

Return the number of entries in a collection without downloading it.

from molfun.data.collections import count_collection

n = count_collection("casp15")
print(f"CASP15 has {n} targets")
Parameter Type Description
name str Collection name

Returns: int


list_collections

list_collections

list_collections(tag: str | None = None) -> list[CollectionSpec]

List available collections, optionally filtered by tag.

Parameters:

Name Type Description Default
tag str | None

If given, return only collections containing this tag.

None

List all available collection names.

from molfun.data.collections import list_collections

for name in list_collections():
    n = count_collection(name)
    print(f"{name}: {n} entries")

Returns: list[str]


CollectionSpec

CollectionSpec dataclass

Defines a reusable protein collection query.

Dataclass describing a collection's metadata.

Field Type Description
name str Collection identifier
description str Human-readable description
task str Task type: "structure", "affinity", "property"
size int Number of entries
url str Source URL
citation str BibTeX citation key

Available Collections

Name Task Size Description
casp14 structure 87 CASP14 free-modeling targets
casp15 structure 109 CASP15 free-modeling targets
cameo structure ~500/yr CAMEO weekly structure prediction targets
pdbbind_refined affinity ~5,300 PDBbind refined set
pdbbind_core affinity ~290 PDBbind core set (test benchmark)
pdbbind_general affinity ~19,400 PDBbind general set