Parsers¶
File parsers for common structural biology and cheminformatics formats. Each parser reads a file and returns a structured Python object.
Quick Start¶
from molfun.data.parsers import PDBParser, CIFParser, A3MParser
# Parse a PDB file
parser = PDBParser()
structure = parser.parse("protein.pdb")
# Parse an mmCIF file
parser = CIFParser()
structure = parser.parse("protein.cif")
# Parse an MSA
parser = A3MParser()
msa = parser.parse("alignment.a3m")
PDBParser¶
PDBParser ¶
Bases: BaseStructureParser
Parse PDB format files without BioPython.
Handles standard ATOM records (columns are fixed-width per PDB spec). For full mmCIF support, use CIFParser instead.
Usage::
parser = PDBParser(max_seq_len=512)
structure = parser.parse_file("1abc.pdb")
features = structure.to_dict() # ready for dataset
Parse PDB format files into structured data.
from molfun.data.parsers import PDBParser
parser = PDBParser()
structure = parser.parse("protein.pdb")
print(structure.sequence) # "MKFLILLFNILCLFPVLAADNH..."
print(structure.positions.shape) # (N_residues, 37, 3)
print(structure.resolution) # 2.1
| Parameter | Type | Default | Description |
|---|---|---|---|
include_hetatm |
bool |
False |
Include HETATM records |
model_idx |
int |
0 |
NMR model index to use |
CIFParser¶
CIFParser ¶
Bases: BaseStructureParser
Parse mmCIF files via BioPython.
Falls back to PDBParser-style parsing for .pdb files if BioPython is available.
Usage::
parser = CIFParser(max_seq_len=512)
structure = parser.parse_file("1abc.cif")
features = structure.to_dict()
parse_text ¶
Parse mmCIF text (writes to temp file for BioPython).
Parse mmCIF format files (PDBx/mmCIF).
from molfun.data.parsers import CIFParser
parser = CIFParser()
structure = parser.parse("protein.cif")
| Parameter | Type | Default | Description |
|---|---|---|---|
include_hetatm |
bool |
False |
Include non-polymer entities |
auth_chains |
bool |
True |
Use author chain IDs (vs label chain IDs) |
A3MParser¶
A3MParser ¶
Bases: BaseAlignmentParser
Parse A3M multiple sequence alignments → MSA tensors.
Usage::
parser = A3MParser(max_depth=512)
alignment = parser.parse_file("msas/1abc.a3m")
msa_dict = alignment.to_dict() # {msa, deletion_matrix, msa_mask}
Parse A3M format multiple sequence alignments.
from molfun.data.parsers import A3MParser
parser = A3MParser()
msa = parser.parse("alignment.a3m")
print(msa.sequences) # list of aligned sequences
print(msa.descriptions) # list of sequence headers
print(len(msa)) # number of sequences
FASTAParser¶
FASTAParser ¶
Bases: BaseAlignmentParser
Parse FASTA sequences → single-row ParsedAlignment.
Multi-sequence FASTA is treated as a pre-aligned MSA (all sequences must have the same length after parsing, padded with gaps if needed).
Usage::
parser = FASTAParser()
alignment = parser.parse_text(">query\nMKFLAGHRT")
Parse FASTA format sequence files.
from molfun.data.parsers import FASTAParser
parser = FASTAParser()
entries = parser.parse("sequences.fasta")
for entry in entries:
print(entry.header, entry.sequence)
SDFParser¶
SDFParser ¶
Bases: BaseLigandParser
Parse SDF/MOL V2000 format files.
Handles multi-molecule SDF files (separated by $$$$). Extracts atoms, bonds, 3D coordinates, and SD property fields.
Usage::
parser = SDFParser()
molecules = parser.parse_file("ligands.sdf")
for mol in molecules:
print(mol.name, mol.num_atoms, mol.coords.shape)
Parse SDF (Structure-Data File) format for small molecules.
from molfun.data.parsers import SDFParser
parser = SDFParser()
molecules = parser.parse("ligands.sdf")
for mol in molecules:
print(mol.name, mol.num_atoms, mol.coordinates.shape)
Mol2Parser¶
MOL2Parser ¶
Bases: BaseLigandParser
Parse Tripos MOL2 format files.
Handles multi-molecule MOL2 files (multiple @
Usage::
parser = MOL2Parser()
molecules = parser.parse_file("ligand.mol2")
mol = molecules[0]
print(mol.name, mol.num_atoms, mol.coords.shape)
Parse Tripos Mol2 format files.
from molfun.data.parsers import Mol2Parser
parser = Mol2Parser()
molecules = parser.parse("ligand.mol2")
Summary¶
| Parser | Format | Use Case |
|---|---|---|
PDBParser |
.pdb |
Legacy protein structures |
CIFParser |
.cif |
Modern protein structures (recommended) |
A3MParser |
.a3m |
Multiple sequence alignments |
FASTAParser |
.fasta / .fa |
Protein/nucleotide sequences |
SDFParser |
.sdf |
Small molecule structures |
Mol2Parser |
.mol2 |
Small molecules with atom types |