Data Fetchers¶

Fetchers download biological data from external sources: PDB structures, binding affinity datasets, and multiple sequence alignments.

Quick Start¶

from molfun.data.sources import PDBFetcher, AffinityFetcher, MSAProvider

# Fetch PDB structures
fetcher = PDBFetcher()
structures = fetcher.fetch(["1ABC", "2DEF", "3GHI"])

# Fetch affinity data
affinity = AffinityFetcher()
data = affinity.fetch(source="pdbbind", version="2020")

# Generate MSAs
msa = MSAProvider(backend="colabfold")
alignment = msa.fetch("MKFLILLFNILCLFPVLAADNH...")

PDBFetcher¶

PDBFetcher ¶

Download and cache PDB structures (mmCIF format) from RCSB.

Usage::

fetcher = PDBFetcher()

# By IDs
paths = fetcher.fetch(["1abc", "2xyz"])

# By Pfam family (e.g. protein kinases)
paths = fetcher.fetch_by_family("PF00069", max_structures=100)

# By EC number (e.g. all transferases → kinases)
paths = fetcher.fetch_by_ec("2.7.*", max_structures=200)

# By GO term (e.g. protein kinase activity)
paths = fetcher.fetch_by_go("GO:0004672")

# By organism (e.g. human only)
paths = fetcher.fetch_by_taxonomy(9606, max_structures=100)

# By keyword (free-text search)
paths = fetcher.fetch_by_keyword("tyrosine kinase", max_structures=50)

# With metadata + deduplication
records = fetcher.fetch_with_metadata(["1abc", "2xyz"])
paths = fetcher.fetch_deduplicated(pdb_ids, identity=0.3)

init ¶

__init__(cache_dir: str | None = None, fmt: str = 'cif', workers: int = 4, progress: bool = False, storage_options: dict | None = None)

Parameters:

Name	Type	Description	Default
`cache_dir`	`str \| None`	Directory to cache downloaded files. Supports local paths or remote URIs (s3://, gs://). Default: ~/.molfun/pdb_cache	`None`
`fmt`	`str`	File format — "cif" (mmCIF, default) or "pdb".	`'cif'`
`workers`	`int`	Default number of parallel download threads (1 = sequential).	`4`
`progress`	`bool`	Show tqdm progress bar by default (requires tqdm).	`False`
`storage_options`	`dict \| None`	fsspec options for remote cache_dir (e.g. `{"endpoint_url": "http://localhost:9000"}` for MinIO).	`None`

fetch ¶

fetch(pdb_ids: list[str], workers: int | None = None, progress: bool | None = None) -> list[str]

Download structures by PDB ID.

Parameters:

Name	Type	Description	Default
`pdb_ids`	`list[str]`	PDB IDs to download.	required
`workers`	`int \| None`	Number of parallel download threads (1 = sequential). Defaults to the instance `workers` setting (4). RCSB handles 4-8 concurrent connections well.	`None`
`progress`	`bool \| None`	Show a `tqdm` progress bar (requires tqdm installed). Defaults to the instance `progress` setting.	`None`

Returns:

Type	Description
`list[str]`	List of file paths (same order as input).
`list[str]`	Paths may be local or remote depending on cache_dir.
`list[str]`	Skips download if file already cached.

fetch_by_family ¶

fetch_by_family(pfam_id: str, max_structures: int = 500, resolution_max: float = 3.0) -> list[Path]

Fetch PDB structures belonging to a Pfam family via RCSB Search API.

Parameters:

Name	Type	Description	Default
`pfam_id`	`str`	Pfam accession (e.g. "PF00069").	required
`max_structures`	`int`	Maximum number of structures to retrieve.	`500`
`resolution_max`	`float`	Filter by resolution (Angstrom).	`3.0`

Returns:

Type	Description
`list[Path]`	List of local file paths.

fetch_by_uniprot ¶

fetch_by_uniprot(uniprot_ids: list[str], max_per_accession: int = 50, resolution_max: float = 3.0) -> list[Path]

Fetch PDB structures mapped to UniProt accessions via RCSB Search API.

Parameters:

Name	Type	Description	Default
`uniprot_ids`	`list[str]`	List of UniProt accessions (e.g. ["P12345"]).	required
`max_per_accession`	`int`	Max structures per UniProt ID.	`50`
`resolution_max`	`float`	Filter by resolution (Angstrom).	`3.0`

Returns:

Type	Description
`list[Path]`	List of local file paths (deduplicated).

fetch_by_ec ¶

fetch_by_ec(ec_number: str, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Fetch structures by Enzyme Commission (EC) number.

Supports wildcards: "2.7.*" matches all transferase kinases.

Parameters:

Name	Type	Description	Default
`ec_number`	`str`	Full or partial EC number (e.g. `"2.7.11.1"` or `"2.7.*"`).	required
`max_structures`	`int`	Maximum number of structures.	`500`
`resolution_max`	`float`	Filter by resolution (Angstrom).	`3.0`

fetch_by_go ¶

fetch_by_go(go_id: str, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Fetch structures annotated with a Gene Ontology (GO) term.

Parameters:

Name	Type	Description	Default
`go_id`	`str`	GO accession (e.g. `"GO:0004672"` for protein kinase activity).	required
`max_structures`	`int`	Maximum number of structures.	`500`
`resolution_max`	`float`	Filter by resolution (Angstrom).	`3.0`

fetch_by_taxonomy ¶

fetch_by_taxonomy(taxonomy_id: int, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Fetch structures from a specific organism via NCBI taxonomy ID.

Parameters:

Name	Type	Description	Default
`taxonomy_id`	`int`	NCBI taxonomy ID (e.g. 9606 for Homo sapiens, 10090 for Mus musculus).	required
`max_structures`	`int`	Maximum number of structures.	`500`
`resolution_max`	`float`	Filter by resolution (Angstrom).	`3.0`

fetch_by_keyword ¶

fetch_by_keyword(keyword: str, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Free-text search over RCSB metadata (title, abstract, etc.).

Parameters:

Name	Type	Description	Default
`keyword`	`str`	Search phrase (e.g. `"tyrosine kinase"`).	required
`max_structures`	`int`	Maximum number of structures.	`500`
`resolution_max`	`float`	Filter by resolution (Angstrom).	`3.0`

fetch_by_scop ¶

fetch_by_scop(scop_id: str, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Fetch structures by SCOPe classification ID.

Parameters:

Name	Type	Description	Default
`scop_id`	`str`	SCOPe sunid or lineage string (e.g. `"b.1.1.1"`).	required
`max_structures`	`int`	Maximum number of structures.	`500`
`resolution_max`	`float`	Filter by resolution (Angstrom).	`3.0`

fetch_combined ¶

fetch_combined(*, pfam_id: str | None = None, ec_number: str | None = None, go_id: str | None = None, taxonomy_id: int | None = None, keyword: str | None = None, uniprot_ids: list[str] | None = None, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Fetch structures matching ALL provided criteria (AND logic).

Allows combining multiple filters in a single RCSB query for precise domain-specific datasets.

Example::

# Human protein kinases at ≤2.5 Å
paths = fetcher.fetch_combined(
    pfam_id="PF00069",
    taxonomy_id=9606,
    resolution_max=2.5,
)

fetch_with_metadata ¶

fetch_with_metadata(pdb_ids: list[str]) -> list[StructureRecord]

Download structures and enrich them with RCSB metadata.

Returns a list of StructureRecord with resolution, organism, EC numbers, Pfam IDs, etc. populated via the RCSB GraphQL API.

search_ids ¶

search_ids(*, pfam_id: str | None = None, ec_number: str | None = None, go_id: str | None = None, taxonomy_id: int | None = None, keyword: str | None = None, max_results: int = 500, resolution_max: float = 3.0) -> list[str]

Like fetch_combined but returns PDB IDs without downloading.

Useful for deduplication pipelines where you want to filter IDs before committing to downloads.

fetch_deduplicated ¶

fetch_deduplicated(pdb_ids: list[str], identity: float = 0.3, coverage: float = 0.8) -> list[str]

Download structures and remove redundancy by sequence clustering.

Uses MMseqs2 easy-cluster if available, otherwise falls back to a simple hash-based greedy approach using RCSB sequence data.

Parameters:

Name	Type	Description	Default
`pdb_ids`	`list[str]`	PDB IDs to fetch.	required
`identity`	`float`	Sequence identity threshold for clustering (0-1).	`0.3`
`coverage`	`float`	Minimum coverage for clustering.	`0.8`

Returns:

Type	Description
`list[str]`	Paths to representative structures (one per cluster).

list_cached ¶

list_cached() -> list[str]

Return all cached structure files.

clear_cache ¶

clear_cache() -> int

Remove all cached files. Returns number of files removed.

count ¶

count(*, pfam_id: str | None = None, ec_number: str | None = None, go_id: str | None = None, taxonomy_id: int | None = None, keyword: str | None = None, resolution_max: float = 3.0) -> int

Count how many structures match a query without downloading.

Uses the RCSB total_count field — a single cheap HTTP request.

Example::

n = fetcher.count(pfam_id="PF00069")
print(f"Kinases available: {n}")

Download PDB/mmCIF files from the RCSB PDB.

fetch¶

fetcher = PDBFetcher(cache_dir="./pdb_cache", format="mmcif")

# Fetch by PDB IDs
structures = fetcher.fetch(["1ABC", "2DEF"])

# Fetch with filters
structures = fetcher.fetch(
    ids=None,
    resolution_max=2.5,
    organism="Homo sapiens",
    method="X-RAY DIFFRACTION",
    max_results=100,
)

Parameter	Type	Default	Description
`cache_dir`	`str \\| Path`	`"~/.molfun/pdb"`	Local cache directory
`format`	`str`	`"mmcif"`	File format: `"pdb"` or `"mmcif"`

fetch() Parameters¶

Parameter	Type	Default	Description
`ids`	`list[str] \\| None`	`None`	PDB IDs to download
`resolution_max`	`float \\| None`	`None`	Maximum resolution in Angstroms
`organism`	`str \\| None`	`None`	Source organism filter
`method`	`str \\| None`	`None`	Experimental method filter
`max_results`	`int \\| None`	`None`	Limit number of results (for queries)

Returns: list[Path] of downloaded file paths.

AffinityFetcher¶

AffinityFetcher ¶

Parse and serve binding affinity data.

Supports: - PDBbind index files (v2016–v2020): provide the path to INDEX_general_PL_data.{year} or the refined set index. - CSV files with columns: pdb_id, affinity, [resolution, year, ...].

Usage

fetcher = AffinityFetcher()

From PDBbind index file¶

records = fetcher.from_pdbbind_index("path/to/INDEX_refined_data.2020")

From CSV¶

records = fetcher.from_csv("my_dataset.csv")

Filter¶

refined = fetcher.filter(records, resolution_max=2.5, min_year=2015)

from_pdbbind_index `staticmethod` ¶

from_pdbbind_index(index_path: str, storage_options: dict | None = None) -> list[AffinityRecord]

Parse a PDBbind INDEX file (local or remote).

Expected format (space-separated, lines starting with # are comments): PDB_code resolution release_year -logKd/Ki Kd/Ki/IC50=value reference ligand_name

from_csv `staticmethod` ¶

from_csv(csv_path: str, pdb_col: str = 'pdb_id', affinity_col: str = 'affinity', resolution_col: str = 'resolution', sequence_col: str = 'sequence', delimiter: str = ',', storage_options: dict | None = None) -> list[AffinityRecord]

Load affinity records from a CSV file (local or remote).

At minimum needs columns for pdb_id and affinity.

filter `staticmethod` ¶

filter(records: list[AffinityRecord], resolution_max: float | None = None, min_year: int | None = None, pdb_ids: set[str] | None = None) -> list[AffinityRecord]

Filter records by resolution, year, or PDB ID whitelist.

to_label_dict `staticmethod` ¶

to_label_dict(records: list[AffinityRecord]) -> dict[str, float]

Convert records to {pdb_id: affinity} dict for dataset construction.

Download protein-ligand binding affinity datasets (e.g., PDBbind).

fetch¶

fetcher = AffinityFetcher(cache_dir="./affinity_cache")

data = fetcher.fetch(
    source="pdbbind",
    version="2020",
    split="refined",
)

Parameter	Type	Default	Description
`cache_dir`	`str \\| Path`	`"~/.molfun/affinity"`	Local cache directory

fetch() Parameters¶

Parameter	Type	Default	Description
`source`	`str`	`"pdbbind"`	Data source name
`version`	`str`	`"2020"`	Dataset version
`split`	`str`	`"refined"`	Dataset split: `"general"`, `"refined"`, `"core"`

Returns: AffinityDataset ready for training.

MSAProvider¶

MSAProvider ¶

Generates or loads MSAs for protein sequences.

Usage

Pre-computed .a3m files¶

msa = MSAProvider("precomputed", msa_dir="msas/") features = msa.get("MKFL...", "1abc")

ColabFold server (no local DB)¶

msa = MSAProvider("colabfold") features = msa.get("MKFL...", "1abc")

Single-sequence dummy (fast prototyping)¶

msa = MSAProvider("single") features = msa.get("MKFL...", "1abc")

get ¶

get(sequence: str, pdb_id: str) -> dict

Return MSA features for a sequence.

Returns dict with

"msa": LongTensor [N, L] residue indices "deletion_matrix": FloatTensor [N, L] deletion counts "msa_mask": FloatTensor [N, L] 1 for valid positions

Generate or fetch multiple sequence alignments for protein sequences.

fetch¶

msa = MSAProvider(backend="colabfold", cache_dir="./msa_cache")

# Single sequence
alignment = msa.fetch("MKFLILLFNILCLFPVLAADNH...")

# Batch
alignments = msa.fetch_batch(
    ["MKFLILLFNILCLFPVLAADNH...", "MKTAYIAKQRQISFVKSH..."],
    num_workers=4,
)

Parameter	Type	Default	Description
`backend`	`str`	`"colabfold"`	MSA generation backend
`cache_dir`	`str \\| Path`	`"~/.molfun/msa"`	Cache directory for alignments
`database`	`str`	`"uniref30"`	Sequence database to search

fetch() Parameters¶

Parameter	Type	Default	Description
`sequence`	`str`	required	Amino acid sequence
`max_seqs`	`int`	`512`	Maximum number of sequences in MSA

Returns: str -- path to the generated A3M file.

Data Fetchers¶

Quick Start¶

PDBFetcher¶

PDBFetcher ¶

__init__ ¶

fetch ¶

fetch_by_family ¶

fetch_by_uniprot ¶

fetch_by_ec ¶

fetch_by_go ¶

fetch_by_taxonomy ¶

fetch_by_keyword ¶

fetch_by_scop ¶

fetch_combined ¶

fetch_with_metadata ¶

search_ids ¶

fetch_deduplicated ¶

list_cached ¶

clear_cache ¶

count ¶

fetch¶

fetch() Parameters¶

AffinityFetcher¶

AffinityFetcher ¶

From PDBbind index file¶

From CSV¶

Filter¶

from_pdbbind_index staticmethod ¶

from_csv staticmethod ¶

filter staticmethod ¶

to_label_dict staticmethod ¶

fetch¶

fetch() Parameters¶

MSAProvider¶

MSAProvider ¶

Pre-computed .a3m files¶

ColabFold server (no local DB)¶

Single-sequence dummy (fast prototyping)¶

get ¶

fetch¶

fetch() Parameters¶

init ¶

from_pdbbind_index `staticmethod` ¶

from_csv `staticmethod` ¶

filter `staticmethod` ¶

to_label_dict `staticmethod` ¶