Data Fetchers¶
Fetchers download biological data from external sources: PDB structures, binding affinity datasets, and multiple sequence alignments.
Quick Start¶
from molfun.data.sources import PDBFetcher, AffinityFetcher, MSAProvider
# Fetch PDB structures
fetcher = PDBFetcher()
structures = fetcher.fetch(["1ABC", "2DEF", "3GHI"])
# Fetch affinity data
affinity = AffinityFetcher()
data = affinity.fetch(source="pdbbind", version="2020")
# Generate MSAs
msa = MSAProvider(backend="colabfold")
alignment = msa.fetch("MKFLILLFNILCLFPVLAADNH...")
PDBFetcher¶
PDBFetcher ¶
Download and cache PDB structures (mmCIF format) from RCSB.
Usage::
fetcher = PDBFetcher()
# By IDs
paths = fetcher.fetch(["1abc", "2xyz"])
# By Pfam family (e.g. protein kinases)
paths = fetcher.fetch_by_family("PF00069", max_structures=100)
# By EC number (e.g. all transferases → kinases)
paths = fetcher.fetch_by_ec("2.7.*", max_structures=200)
# By GO term (e.g. protein kinase activity)
paths = fetcher.fetch_by_go("GO:0004672")
# By organism (e.g. human only)
paths = fetcher.fetch_by_taxonomy(9606, max_structures=100)
# By keyword (free-text search)
paths = fetcher.fetch_by_keyword("tyrosine kinase", max_structures=50)
# With metadata + deduplication
records = fetcher.fetch_with_metadata(["1abc", "2xyz"])
paths = fetcher.fetch_deduplicated(pdb_ids, identity=0.3)
__init__ ¶
__init__(cache_dir: str | None = None, fmt: str = 'cif', workers: int = 4, progress: bool = False, storage_options: dict | None = None)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache_dir
|
str | None
|
Directory to cache downloaded files. Supports local paths or remote URIs (s3://, gs://). Default: ~/.molfun/pdb_cache |
None
|
fmt
|
str
|
File format — "cif" (mmCIF, default) or "pdb". |
'cif'
|
workers
|
int
|
Default number of parallel download threads (1 = sequential). |
4
|
progress
|
bool
|
Show tqdm progress bar by default (requires tqdm). |
False
|
storage_options
|
dict | None
|
fsspec options for remote cache_dir
(e.g. |
None
|
fetch ¶
Download structures by PDB ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdb_ids
|
list[str]
|
PDB IDs to download. |
required |
workers
|
int | None
|
Number of parallel download threads (1 = sequential).
Defaults to the instance |
None
|
progress
|
bool | None
|
Show a |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of file paths (same order as input). |
list[str]
|
Paths may be local or remote depending on cache_dir. |
list[str]
|
Skips download if file already cached. |
fetch_by_family ¶
Fetch PDB structures belonging to a Pfam family via RCSB Search API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pfam_id
|
str
|
Pfam accession (e.g. "PF00069"). |
required |
max_structures
|
int
|
Maximum number of structures to retrieve. |
500
|
resolution_max
|
float
|
Filter by resolution (Angstrom). |
3.0
|
Returns:
| Type | Description |
|---|---|
list[Path]
|
List of local file paths. |
fetch_by_uniprot ¶
fetch_by_uniprot(uniprot_ids: list[str], max_per_accession: int = 50, resolution_max: float = 3.0) -> list[Path]
Fetch PDB structures mapped to UniProt accessions via RCSB Search API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_ids
|
list[str]
|
List of UniProt accessions (e.g. ["P12345"]). |
required |
max_per_accession
|
int
|
Max structures per UniProt ID. |
50
|
resolution_max
|
float
|
Filter by resolution (Angstrom). |
3.0
|
Returns:
| Type | Description |
|---|---|
list[Path]
|
List of local file paths (deduplicated). |
fetch_by_ec ¶
Fetch structures by Enzyme Commission (EC) number.
Supports wildcards: "2.7.*" matches all transferase kinases.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ec_number
|
str
|
Full or partial EC number (e.g. |
required |
max_structures
|
int
|
Maximum number of structures. |
500
|
resolution_max
|
float
|
Filter by resolution (Angstrom). |
3.0
|
fetch_by_go ¶
Fetch structures annotated with a Gene Ontology (GO) term.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
go_id
|
str
|
GO accession (e.g. |
required |
max_structures
|
int
|
Maximum number of structures. |
500
|
resolution_max
|
float
|
Filter by resolution (Angstrom). |
3.0
|
fetch_by_taxonomy ¶
fetch_by_taxonomy(taxonomy_id: int, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]
Fetch structures from a specific organism via NCBI taxonomy ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
taxonomy_id
|
int
|
NCBI taxonomy ID (e.g. 9606 for Homo sapiens, 10090 for Mus musculus). |
required |
max_structures
|
int
|
Maximum number of structures. |
500
|
resolution_max
|
float
|
Filter by resolution (Angstrom). |
3.0
|
fetch_by_keyword ¶
Free-text search over RCSB metadata (title, abstract, etc.).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keyword
|
str
|
Search phrase (e.g. |
required |
max_structures
|
int
|
Maximum number of structures. |
500
|
resolution_max
|
float
|
Filter by resolution (Angstrom). |
3.0
|
fetch_by_scop ¶
Fetch structures by SCOPe classification ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scop_id
|
str
|
SCOPe sunid or lineage string (e.g. |
required |
max_structures
|
int
|
Maximum number of structures. |
500
|
resolution_max
|
float
|
Filter by resolution (Angstrom). |
3.0
|
fetch_combined ¶
fetch_combined(*, pfam_id: str | None = None, ec_number: str | None = None, go_id: str | None = None, taxonomy_id: int | None = None, keyword: str | None = None, uniprot_ids: list[str] | None = None, max_structures: int = 500, resolution_max: float = 3.0) -> list[str]
Fetch structures matching ALL provided criteria (AND logic).
Allows combining multiple filters in a single RCSB query for precise domain-specific datasets.
Example::
# Human protein kinases at ≤2.5 Å
paths = fetcher.fetch_combined(
pfam_id="PF00069",
taxonomy_id=9606,
resolution_max=2.5,
)
fetch_with_metadata ¶
Download structures and enrich them with RCSB metadata.
Returns a list of StructureRecord with resolution, organism,
EC numbers, Pfam IDs, etc. populated via the RCSB GraphQL API.
search_ids ¶
search_ids(*, pfam_id: str | None = None, ec_number: str | None = None, go_id: str | None = None, taxonomy_id: int | None = None, keyword: str | None = None, max_results: int = 500, resolution_max: float = 3.0) -> list[str]
Like fetch_combined but returns PDB IDs without downloading.
Useful for deduplication pipelines where you want to filter IDs before committing to downloads.
fetch_deduplicated ¶
Download structures and remove redundancy by sequence clustering.
Uses MMseqs2 easy-cluster if available, otherwise falls back to
a simple hash-based greedy approach using RCSB sequence data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdb_ids
|
list[str]
|
PDB IDs to fetch. |
required |
identity
|
float
|
Sequence identity threshold for clustering (0-1). |
0.3
|
coverage
|
float
|
Minimum coverage for clustering. |
0.8
|
Returns:
| Type | Description |
|---|---|
list[str]
|
Paths to representative structures (one per cluster). |
count ¶
count(*, pfam_id: str | None = None, ec_number: str | None = None, go_id: str | None = None, taxonomy_id: int | None = None, keyword: str | None = None, resolution_max: float = 3.0) -> int
Count how many structures match a query without downloading.
Uses the RCSB total_count field — a single cheap HTTP request.
Example::
n = fetcher.count(pfam_id="PF00069")
print(f"Kinases available: {n}")
Download PDB/mmCIF files from the RCSB PDB.
fetch¶
fetcher = PDBFetcher(cache_dir="./pdb_cache", format="mmcif")
# Fetch by PDB IDs
structures = fetcher.fetch(["1ABC", "2DEF"])
# Fetch with filters
structures = fetcher.fetch(
ids=None,
resolution_max=2.5,
organism="Homo sapiens",
method="X-RAY DIFFRACTION",
max_results=100,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
cache_dir |
str \| Path |
"~/.molfun/pdb" |
Local cache directory |
format |
str |
"mmcif" |
File format: "pdb" or "mmcif" |
fetch() Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
ids |
list[str] \| None |
None |
PDB IDs to download |
resolution_max |
float \| None |
None |
Maximum resolution in Angstroms |
organism |
str \| None |
None |
Source organism filter |
method |
str \| None |
None |
Experimental method filter |
max_results |
int \| None |
None |
Limit number of results (for queries) |
Returns: list[Path] of downloaded file paths.
AffinityFetcher¶
AffinityFetcher ¶
Parse and serve binding affinity data.
Supports:
- PDBbind index files (v2016–v2020): provide the path to
INDEX_general_PL_data.{year} or the refined set index.
- CSV files with columns: pdb_id, affinity, [resolution, year, ...].
Usage
fetcher = AffinityFetcher()
From PDBbind index file¶
records = fetcher.from_pdbbind_index("path/to/INDEX_refined_data.2020")
From CSV¶
records = fetcher.from_csv("my_dataset.csv")
Filter¶
refined = fetcher.filter(records, resolution_max=2.5, min_year=2015)
from_pdbbind_index
staticmethod
¶
Parse a PDBbind INDEX file (local or remote).
Expected format (space-separated, lines starting with # are comments): PDB_code resolution release_year -logKd/Ki Kd/Ki/IC50=value reference ligand_name
from_csv
staticmethod
¶
from_csv(csv_path: str, pdb_col: str = 'pdb_id', affinity_col: str = 'affinity', resolution_col: str = 'resolution', sequence_col: str = 'sequence', delimiter: str = ',', storage_options: dict | None = None) -> list[AffinityRecord]
Load affinity records from a CSV file (local or remote).
At minimum needs columns for pdb_id and affinity.
filter
staticmethod
¶
filter(records: list[AffinityRecord], resolution_max: float | None = None, min_year: int | None = None, pdb_ids: set[str] | None = None) -> list[AffinityRecord]
Filter records by resolution, year, or PDB ID whitelist.
to_label_dict
staticmethod
¶
Convert records to {pdb_id: affinity} dict for dataset construction.
Download protein-ligand binding affinity datasets (e.g., PDBbind).
fetch¶
fetcher = AffinityFetcher(cache_dir="./affinity_cache")
data = fetcher.fetch(
source="pdbbind",
version="2020",
split="refined",
)
| Parameter | Type | Default | Description |
|---|---|---|---|
cache_dir |
str \| Path |
"~/.molfun/affinity" |
Local cache directory |
fetch() Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
source |
str |
"pdbbind" |
Data source name |
version |
str |
"2020" |
Dataset version |
split |
str |
"refined" |
Dataset split: "general", "refined", "core" |
Returns: AffinityDataset ready for training.
MSAProvider¶
MSAProvider ¶
Generates or loads MSAs for protein sequences.
Usage
Pre-computed .a3m files¶
msa = MSAProvider("precomputed", msa_dir="msas/") features = msa.get("MKFL...", "1abc")
ColabFold server (no local DB)¶
msa = MSAProvider("colabfold") features = msa.get("MKFL...", "1abc")
Single-sequence dummy (fast prototyping)¶
msa = MSAProvider("single") features = msa.get("MKFL...", "1abc")
get ¶
Return MSA features for a sequence.
Returns dict with
"msa": LongTensor [N, L] residue indices "deletion_matrix": FloatTensor [N, L] deletion counts "msa_mask": FloatTensor [N, L] 1 for valid positions
Generate or fetch multiple sequence alignments for protein sequences.
fetch¶
msa = MSAProvider(backend="colabfold", cache_dir="./msa_cache")
# Single sequence
alignment = msa.fetch("MKFLILLFNILCLFPVLAADNH...")
# Batch
alignments = msa.fetch_batch(
["MKFLILLFNILCLFPVLAADNH...", "MKTAYIAKQRQISFVKSH..."],
num_workers=4,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
backend |
str |
"colabfold" |
MSA generation backend |
cache_dir |
str \| Path |
"~/.molfun/msa" |
Cache directory for alignments |
database |
str |
"uniref30" |
Sequence database to search |
fetch() Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
sequence |
str |
required | Amino acid sequence |
max_seqs |
int |
512 |
Maximum number of sequences in MSA |
Returns: str -- path to the generated A3M file.