Skip to content

Parsers

File parsers for common structural biology and cheminformatics formats. Each parser reads a file and returns a structured Python object.

Quick Start

from molfun.data.parsers import PDBParser, CIFParser, A3MParser

# Parse a PDB file
parser = PDBParser()
structure = parser.parse("protein.pdb")

# Parse an mmCIF file
parser = CIFParser()
structure = parser.parse("protein.cif")

# Parse an MSA
parser = A3MParser()
msa = parser.parse("alignment.a3m")

PDBParser

PDBParser

Bases: BaseStructureParser

Parse PDB format files without BioPython.

Handles standard ATOM records (columns are fixed-width per PDB spec). For full mmCIF support, use CIFParser instead.

Usage::

parser = PDBParser(max_seq_len=512)
structure = parser.parse_file("1abc.pdb")
features = structure.to_dict()  # ready for dataset

Parse PDB format files into structured data.

from molfun.data.parsers import PDBParser

parser = PDBParser()
structure = parser.parse("protein.pdb")

print(structure.sequence)        # "MKFLILLFNILCLFPVLAADNH..."
print(structure.positions.shape) # (N_residues, 37, 3)
print(structure.resolution)      # 2.1
Parameter Type Default Description
include_hetatm bool False Include HETATM records
model_idx int 0 NMR model index to use

CIFParser

CIFParser

Bases: BaseStructureParser

Parse mmCIF files via BioPython.

Falls back to PDBParser-style parsing for .pdb files if BioPython is available.

Usage::

parser = CIFParser(max_seq_len=512)
structure = parser.parse_file("1abc.cif")
features = structure.to_dict()

parse_text

parse_text(text: str) -> ParsedStructure

Parse mmCIF text (writes to temp file for BioPython).

Parse mmCIF format files (PDBx/mmCIF).

from molfun.data.parsers import CIFParser

parser = CIFParser()
structure = parser.parse("protein.cif")
Parameter Type Default Description
include_hetatm bool False Include non-polymer entities
auth_chains bool True Use author chain IDs (vs label chain IDs)

A3MParser

A3MParser

Bases: BaseAlignmentParser

Parse A3M multiple sequence alignments → MSA tensors.

Usage::

parser = A3MParser(max_depth=512)
alignment = parser.parse_file("msas/1abc.a3m")
msa_dict = alignment.to_dict()  # {msa, deletion_matrix, msa_mask}

Parse A3M format multiple sequence alignments.

from molfun.data.parsers import A3MParser

parser = A3MParser()
msa = parser.parse("alignment.a3m")

print(msa.sequences)    # list of aligned sequences
print(msa.descriptions) # list of sequence headers
print(len(msa))         # number of sequences

FASTAParser

FASTAParser

Bases: BaseAlignmentParser

Parse FASTA sequences → single-row ParsedAlignment.

Multi-sequence FASTA is treated as a pre-aligned MSA (all sequences must have the same length after parsing, padded with gaps if needed).

Usage::

parser = FASTAParser()
alignment = parser.parse_text(">query\nMKFLAGHRT")

Parse FASTA format sequence files.

from molfun.data.parsers import FASTAParser

parser = FASTAParser()
entries = parser.parse("sequences.fasta")

for entry in entries:
    print(entry.header, entry.sequence)

SDFParser

SDFParser

Bases: BaseLigandParser

Parse SDF/MOL V2000 format files.

Handles multi-molecule SDF files (separated by $$$$). Extracts atoms, bonds, 3D coordinates, and SD property fields.

Usage::

parser = SDFParser()
molecules = parser.parse_file("ligands.sdf")
for mol in molecules:
    print(mol.name, mol.num_atoms, mol.coords.shape)

Parse SDF (Structure-Data File) format for small molecules.

from molfun.data.parsers import SDFParser

parser = SDFParser()
molecules = parser.parse("ligands.sdf")

for mol in molecules:
    print(mol.name, mol.num_atoms, mol.coordinates.shape)

Mol2Parser

MOL2Parser

Bases: BaseLigandParser

Parse Tripos MOL2 format files.

Handles multi-molecule MOL2 files (multiple @MOLECULE sections). Extracts atom coordinates, Sybyl atom types, partial charges, and bonds.

Usage::

parser = MOL2Parser()
molecules = parser.parse_file("ligand.mol2")
mol = molecules[0]
print(mol.name, mol.num_atoms, mol.coords.shape)

Parse Tripos Mol2 format files.

from molfun.data.parsers import Mol2Parser

parser = Mol2Parser()
molecules = parser.parse("ligand.mol2")

Summary

Parser Format Use Case
PDBParser .pdb Legacy protein structures
CIFParser .cif Modern protein structures (recommended)
A3MParser .a3m Multiple sequence alignments
FASTAParser .fasta / .fa Protein/nucleotide sequences
SDFParser .sdf Small molecule structures
Mol2Parser .mol2 Small molecules with atom types