Skip to content

Data Fetchers

Fetchers download biological data from external sources: PDB structures, binding affinity datasets, and multiple sequence alignments.

Quick Start

from molfun.data.sources import PDBFetcher, AffinityFetcher, MSAProvider

# Fetch PDB structures
fetcher = PDBFetcher()
structures = fetcher.fetch(["1ABC", "2DEF", "3GHI"])

# Fetch affinity data
affinity = AffinityFetcher()
data = affinity.fetch(source="pdbbind", version="2020")

# Generate MSAs
msa = MSAProvider(backend="colabfold")
alignment = msa.fetch("MKFLILLFNILCLFPVLAADNH...")

PDBFetcher

PDBFetcher

Download and cache PDB structures (mmCIF format) from RCSB.

Usage::

fetcher = PDBFetcher()

# By IDs
paths = fetcher.fetch(["1abc", "2xyz"])

# By Pfam family (e.g. protein kinases)
paths = fetcher.fetch_by_family("PF00069", max_structures=100)

# By EC number (e.g. all transferases → kinases)
paths = fetcher.fetch_by_ec("2.7.*", max_structures=200)

# By GO term (e.g. protein kinase activity)
paths = fetcher.fetch_by_go("GO:0004672")

# By organism (e.g. human only)
paths = fetcher.fetch_by_taxonomy(9606, max_structures=100)

# By keyword (free-text search)
paths = fetcher.fetch_by_keyword("tyrosine kinase", max_structures=50)

# With metadata + deduplication
records = fetcher.fetch_with_metadata(["1abc", "2xyz"])
paths = fetcher.fetch_deduplicated(pdb_ids, identity=0.3)

__init__

__init__(cache_dir: str | None = None, fmt: str = 'cif', workers: int = 4, progress: bool = False, storage_options: dict | None = None)

Parameters:

Name Type Description Default
cache_dir str | None

Directory to cache downloaded files. Supports local paths or remote URIs (s3://, gs://). Default: ~/.molfun/pdb_cache

None
fmt str

File format — "cif" (mmCIF, default) or "pdb".

'cif'
workers int

Default number of parallel download threads (1 = sequential).

4
progress bool

Show tqdm progress bar by default (requires tqdm).

False
storage_options dict | None

fsspec options for remote cache_dir (e.g. {"endpoint_url": "http://localhost:9000"} for MinIO).

None

fetch

fetch(pdb_ids: list[str], workers: int | None = None, progress: bool | None = None) -> list[str]

Download structures by PDB ID.

Parameters:

Name Type Description Default
pdb_ids list[str]

PDB IDs to download.

required
workers int | None

Number of parallel download threads (1 = sequential). Defaults to the instance workers setting (4). RCSB handles 4-8 concurrent connections well.

None
progress bool | None

Show a tqdm progress bar (requires tqdm installed). Defaults to the instance progress setting.

None

Returns:

Type Description
list[str]

List of file paths (same order as input).

list[str]

Paths may be local or remote depending on cache_dir.

list[str]

Skips download if file already cached.

fetch_by_family

fetch_by_family(pfam_id: str, max_structures: int = 500, resolution_max: float = 3.0) -> list[Path]

Fetch PDB structures belonging to a Pfam family via RCSB Search API.

Parameters:

Name Type Description Default
pfam_id str

Pfam accession (e.g. "PF00069").

required
max_structures int

Maximum number of structures to retrieve.

500
resolution_max float

Filter by resolution (Angstrom).

3.0

Returns:

Type Description
list[Path]

List of local file paths.

fetch_by_uniprot

fetch_by_uniprot(uniprot_ids: list[str], max_per_accession: int = 50, resolution_max: float = 3.0) -> list[Path]

Fetch PDB structures mapped to UniProt accessions via RCSB Search API.

Parameters:

Name Type Description Default
uniprot_ids list[str]

List of UniProt accessions (e.g. ["P12345"]).

required
max_per_accession int

Max structures per UniProt ID.

50
resolution_max float

Filter by resolution (Angstrom).

3.0

Returns:

Type Description
list[Path]

List of local file paths (deduplicated).

fetch_by_ec

fetch_by_ec(ec_number: str, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Fetch structures by Enzyme Commission (EC) number.

Supports wildcards: "2.7.*" matches all transferase kinases.

Parameters:

Name Type Description Default
ec_number str

Full or partial EC number (e.g. "2.7.11.1" or "2.7.*").

required
max_structures int

Maximum number of structures.

500
resolution_max float

Filter by resolution (Angstrom).

3.0

fetch_by_go

fetch_by_go(go_id: str, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Fetch structures annotated with a Gene Ontology (GO) term.

Parameters:

Name Type Description Default
go_id str

GO accession (e.g. "GO:0004672" for protein kinase activity).

required
max_structures int

Maximum number of structures.

500
resolution_max float

Filter by resolution (Angstrom).

3.0

fetch_by_taxonomy

fetch_by_taxonomy(taxonomy_id: int, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Fetch structures from a specific organism via NCBI taxonomy ID.

Parameters:

Name Type Description Default
taxonomy_id int

NCBI taxonomy ID (e.g. 9606 for Homo sapiens, 10090 for Mus musculus).

required
max_structures int

Maximum number of structures.

500
resolution_max float

Filter by resolution (Angstrom).

3.0

fetch_by_keyword

fetch_by_keyword(keyword: str, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Free-text search over RCSB metadata (title, abstract, etc.).

Parameters:

Name Type Description Default
keyword str

Search phrase (e.g. "tyrosine kinase").

required
max_structures int

Maximum number of structures.

500
resolution_max float

Filter by resolution (Angstrom).

3.0

fetch_by_scop

fetch_by_scop(scop_id: str, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Fetch structures by SCOPe classification ID.

Parameters:

Name Type Description Default
scop_id str

SCOPe sunid or lineage string (e.g. "b.1.1.1").

required
max_structures int

Maximum number of structures.

500
resolution_max float

Filter by resolution (Angstrom).

3.0

fetch_combined

fetch_combined(*, pfam_id: str | None = None, ec_number: str | None = None, go_id: str | None = None, taxonomy_id: int | None = None, keyword: str | None = None, uniprot_ids: list[str] | None = None, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]

Fetch structures matching ALL provided criteria (AND logic).

Allows combining multiple filters in a single RCSB query for precise domain-specific datasets.

Example::

# Human protein kinases at ≤2.5 Å
paths = fetcher.fetch_combined(
    pfam_id="PF00069",
    taxonomy_id=9606,
    resolution_max=2.5,
)

fetch_with_metadata

fetch_with_metadata(pdb_ids: list[str]) -> list[StructureRecord]

Download structures and enrich them with RCSB metadata.

Returns a list of StructureRecord with resolution, organism, EC numbers, Pfam IDs, etc. populated via the RCSB GraphQL API.

search_ids

search_ids(*, pfam_id: str | None = None, ec_number: str | None = None, go_id: str | None = None, taxonomy_id: int | None = None, keyword: str | None = None, max_results: int = 500, resolution_max: float = 3.0) -> list[str]

Like fetch_combined but returns PDB IDs without downloading.

Useful for deduplication pipelines where you want to filter IDs before committing to downloads.

fetch_deduplicated

fetch_deduplicated(pdb_ids: list[str], identity: float = 0.3, coverage: float = 0.8) -> list[str]

Download structures and remove redundancy by sequence clustering.

Uses MMseqs2 easy-cluster if available, otherwise falls back to a simple hash-based greedy approach using RCSB sequence data.

Parameters:

Name Type Description Default
pdb_ids list[str]

PDB IDs to fetch.

required
identity float

Sequence identity threshold for clustering (0-1).

0.3
coverage float

Minimum coverage for clustering.

0.8

Returns:

Type Description
list[str]

Paths to representative structures (one per cluster).

list_cached

list_cached() -> list[str]

Return all cached structure files.

clear_cache

clear_cache() -> int

Remove all cached files. Returns number of files removed.

count

count(*, pfam_id: str | None = None, ec_number: str | None = None, go_id: str | None = None, taxonomy_id: int | None = None, keyword: str | None = None, resolution_max: float = 3.0) -> int

Count how many structures match a query without downloading.

Uses the RCSB total_count field — a single cheap HTTP request.

Example::

n = fetcher.count(pfam_id="PF00069")
print(f"Kinases available: {n}")

Download PDB/mmCIF files from the RCSB PDB.

fetch

fetcher = PDBFetcher(cache_dir="./pdb_cache", format="mmcif")

# Fetch by PDB IDs
structures = fetcher.fetch(["1ABC", "2DEF"])

# Fetch with filters
structures = fetcher.fetch(
    ids=None,
    resolution_max=2.5,
    organism="Homo sapiens",
    method="X-RAY DIFFRACTION",
    max_results=100,
)
Parameter Type Default Description
cache_dir str \| Path "~/.molfun/pdb" Local cache directory
format str "mmcif" File format: "pdb" or "mmcif"

fetch() Parameters

Parameter Type Default Description
ids list[str] \| None None PDB IDs to download
resolution_max float \| None None Maximum resolution in Angstroms
organism str \| None None Source organism filter
method str \| None None Experimental method filter
max_results int \| None None Limit number of results (for queries)

Returns: list[Path] of downloaded file paths.


AffinityFetcher

AffinityFetcher

Parse and serve binding affinity data.

Supports: - PDBbind index files (v2016–v2020): provide the path to INDEX_general_PL_data.{year} or the refined set index. - CSV files with columns: pdb_id, affinity, [resolution, year, ...].

Usage

fetcher = AffinityFetcher()

From PDBbind index file

records = fetcher.from_pdbbind_index("path/to/INDEX_refined_data.2020")

From CSV

records = fetcher.from_csv("my_dataset.csv")

Filter

refined = fetcher.filter(records, resolution_max=2.5, min_year=2015)

from_pdbbind_index staticmethod

from_pdbbind_index(index_path: str, storage_options: dict | None = None) -> list[AffinityRecord]

Parse a PDBbind INDEX file (local or remote).

Expected format (space-separated, lines starting with # are comments): PDB_code resolution release_year -logKd/Ki Kd/Ki/IC50=value reference ligand_name

from_csv staticmethod

from_csv(csv_path: str, pdb_col: str = 'pdb_id', affinity_col: str = 'affinity', resolution_col: str = 'resolution', sequence_col: str = 'sequence', delimiter: str = ',', storage_options: dict | None = None) -> list[AffinityRecord]

Load affinity records from a CSV file (local or remote).

At minimum needs columns for pdb_id and affinity.

filter staticmethod

filter(records: list[AffinityRecord], resolution_max: float | None = None, min_year: int | None = None, pdb_ids: set[str] | None = None) -> list[AffinityRecord]

Filter records by resolution, year, or PDB ID whitelist.

to_label_dict staticmethod

to_label_dict(records: list[AffinityRecord]) -> dict[str, float]

Convert records to {pdb_id: affinity} dict for dataset construction.

Download protein-ligand binding affinity datasets (e.g., PDBbind).

fetch

fetcher = AffinityFetcher(cache_dir="./affinity_cache")

data = fetcher.fetch(
    source="pdbbind",
    version="2020",
    split="refined",
)
Parameter Type Default Description
cache_dir str \| Path "~/.molfun/affinity" Local cache directory

fetch() Parameters

Parameter Type Default Description
source str "pdbbind" Data source name
version str "2020" Dataset version
split str "refined" Dataset split: "general", "refined", "core"

Returns: AffinityDataset ready for training.


MSAProvider

MSAProvider

Generates or loads MSAs for protein sequences.

Usage

Pre-computed .a3m files

msa = MSAProvider("precomputed", msa_dir="msas/") features = msa.get("MKFL...", "1abc")

ColabFold server (no local DB)

msa = MSAProvider("colabfold") features = msa.get("MKFL...", "1abc")

Single-sequence dummy (fast prototyping)

msa = MSAProvider("single") features = msa.get("MKFL...", "1abc")

get

get(sequence: str, pdb_id: str) -> dict

Return MSA features for a sequence.

Returns dict with

"msa": LongTensor [N, L] residue indices "deletion_matrix": FloatTensor [N, L] deletion counts "msa_mask": FloatTensor [N, L] 1 for valid positions

Generate or fetch multiple sequence alignments for protein sequences.

fetch

msa = MSAProvider(backend="colabfold", cache_dir="./msa_cache")

# Single sequence
alignment = msa.fetch("MKFLILLFNILCLFPVLAADNH...")

# Batch
alignments = msa.fetch_batch(
    ["MKFLILLFNILCLFPVLAADNH...", "MKTAYIAKQRQISFVKSH..."],
    num_workers=4,
)
Parameter Type Default Description
backend str "colabfold" MSA generation backend
cache_dir str \| Path "~/.molfun/msa" Cache directory for alignments
database str "uniref30" Sequence database to search

fetch() Parameters

Parameter Type Default Description
sequence str required Amino acid sequence
max_seqs int 512 Maximum number of sequences in MSA

Returns: str -- path to the generated A3M file.