Collections¶
Curated dataset collections for common benchmarks and training sets. Collections provide a convenient way to fetch well-known datasets with a single function call.
Quick Start¶
from molfun.data.collections import fetch_collection, list_collections
# See what's available
print(list_collections())
# ["casp14", "casp15", "cameo", "pdbbind_refined", ...]
# Fetch a collection
dataset = fetch_collection("casp15", split="test")
# Check collection size before downloading
from molfun.data.collections import count_collection
n = count_collection("pdbbind_refined")
print(f"PDBbind refined set: {n} complexes")
Functions¶
fetch_collection¶
fetch_collection ¶
fetch_collection(name: str, *, cache_dir: str | None = None, fmt: str = 'cif', max_structures: int | None = None, resolution_max: float | None = None, deduplicate: bool = False, identity: float = 0.3, workers: int = 4, progress: bool = False, storage_options: dict | None = None) -> list[str]
Fetch structures for a named collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Collection name (see |
required |
cache_dir
|
str | None
|
Override cache directory. |
None
|
fmt
|
str
|
File format ( |
'cif'
|
max_structures
|
int | None
|
Override collection default max. |
None
|
resolution_max
|
float | None
|
Override collection default resolution. |
None
|
deduplicate
|
bool
|
If True, cluster by sequence and keep one representative. |
False
|
identity
|
float
|
Sequence identity threshold for deduplication. |
0.3
|
workers
|
int
|
Number of parallel download threads. |
4
|
progress
|
bool
|
Show tqdm progress bar. |
False
|
storage_options
|
dict | None
|
fsspec options for remote storage. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of file paths to downloaded structures. |
Download and return a curated dataset collection.
from molfun.data.collections import fetch_collection
# Structure prediction benchmark
ds = fetch_collection("casp15", split="test", cache_dir="./data")
# Affinity dataset
ds = fetch_collection("pdbbind_refined", version="2020")
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str |
required | Collection name |
split |
str \| None |
None |
Dataset split (collection-dependent) |
version |
str \| None |
None |
Dataset version |
cache_dir |
str \| Path |
"~/.molfun/collections" |
Download cache directory |
Returns: Dataset appropriate for the collection type.
count_collection¶
count_collection ¶
Count how many structures are available in RCSB for a collection.
Makes a single lightweight HTTP request (no downloads).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Collection name (see |
required |
resolution_max
|
float | None
|
Override collection default resolution. |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Total number of matching structures in RCSB. |
Return the number of entries in a collection without downloading it.
from molfun.data.collections import count_collection
n = count_collection("casp15")
print(f"CASP15 has {n} targets")
| Parameter | Type | Description |
|---|---|---|
name |
str |
Collection name |
Returns: int
list_collections¶
list_collections ¶
List available collections, optionally filtered by tag.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tag
|
str | None
|
If given, return only collections containing this tag. |
None
|
List all available collection names.
from molfun.data.collections import list_collections
for name in list_collections():
n = count_collection(name)
print(f"{name}: {n} entries")
Returns: list[str]
CollectionSpec¶
CollectionSpec
dataclass
¶
Defines a reusable protein collection query.
Dataclass describing a collection's metadata.
| Field | Type | Description |
|---|---|---|
name |
str |
Collection identifier |
description |
str |
Human-readable description |
task |
str |
Task type: "structure", "affinity", "property" |
size |
int |
Number of entries |
url |
str |
Source URL |
citation |
str |
BibTeX citation key |
Available Collections¶
| Name | Task | Size | Description |
|---|---|---|---|
casp14 |
structure | 87 | CASP14 free-modeling targets |
casp15 |
structure | 109 | CASP15 free-modeling targets |
cameo |
structure | ~500/yr | CAMEO weekly structure prediction targets |
pdbbind_refined |
affinity | ~5,300 | PDBbind refined set |
pdbbind_core |
affinity | ~290 | PDBbind core set (test benchmark) |
pdbbind_general |
affinity | ~19,400 | PDBbind general set |