Library Usage¶
Representations¶
pygen-structures contains classes to represent CHARMM data, molecular (and molecular-like) structures and atoms.
Representing CHARMM File Types¶
The CHARMM forcefield views molecules as collections of _residues_, with _patches_ usually applied to define linkages and modifications (though for some classes of residue, such as the amino acids, linkages between residues are implicit).
The residue topology file (.rtf
) contains the definition of these
patches and residues - the information required to build the structure of the
molecule. This includes the types of the atoms, the necessary bonds, the
chirality of the tetrahedral centres (through the internal coordinate table),
the charges, and which internal coordinates are necessary for the simulation
(i.e. which improper torsions/cross-map terms are required). The masses
and charges are also present, as these are required to build the .psf
file.
The parameter file (.prm
) contains the definitions of all the parameterised
interactions (i.e. which bonds, angles, dihedrals, impropers, cross maps are
defined, and what the radii of the atoms are for the non-bonded terms).
For some residue and parameter definitions, particularly extensions to existing
files, the topology and parameter files are combined into a stream file
(.str
) which contains a topology and residue section concatenated together.
A good overview of the actual layout of the CHARMM forcefield file types (both residue topology files and parameter files) is given in the NAMD tutorial.
As pygen-structures is concerned with building structures and not with running
simulations, the representation of the .rtf
files is complete,
but the representation of .prm
files is limited to verifying that
the parameters are present rather than probing their values.
Residue Files¶
CHARMM .rtf
files are represented using the
CHARMMResidueTopologyFile
class,
which can be instantiated from a .rtf
or .str
file using its
from_file()
method, and additional files can be added using the
read_file()
method. Residue and patch definitions within the file are stored as instances of
CHARMMResidueDefinition
and
CHARMMPatchResidueDefinition
respectively. Additional residue definitions and patch
definitions can be added using
add_residue_definition()
and
add_patch_definition()
.
The residue definitions are stored in the dict
attribute
residues
,
and the patch definitions in the dict
attribute
patches
,
where the keys are the relevant residue/patch name. The masses of the atoms in
Daltons are stored in dict
attribute
masses
,
where keys are the atom types.
Parameter Files¶
CHARMM .prm
files are represented using the
CHARMMParameterFile
class.
Like
CHARMMResidueTopologyFile
,
this class can be instantiated using its
from_file()
method and additional files can be read using the
read_file()
method. Methods
get_unmatched()
and
check_parameters()
can be used to assert that all parameters of a Molecule exist in the parameter
file, with the former returning a dict
of internal coordinate type
to a list
of unmatched tuples and the latter
a bool
(True
if all parameters exist, False
otherwise).
This class contains five set
objects:
bonds
,
angles
,
dihedrals
,
impropers
,
and
cross_maps
,
which contain tuples of atom types from the forcefield. In the first three sets,
the tuples are also stored in reverse, as these degrees of freedom can be
specified in reverse order without altering the relevant parameters.
Representing Residues and Patches¶
The Difference Between Definitions and Residues¶
CHARMM residues as presented in .rtf
files are an abstract residue
definition, rather than physical residues. As such, these lack an associated
residue index, and their degrees of freedom (the bonds, impropers and cross
maps) refer to the atom names from the residue definition. These are represented
using the CHARMMResidueDefinition
class.
CHARMM residues in an actual structure are not abstract, and refer to real
residues in a physical molecule. As such, these _do_ have an associated index
and their degrees of freedom refer to atom IDs - tuples of
(residue_index: int, atom_name: str)
. This is to allow for easier removal of
dangling bonds and easier application of patches. These residues are represented
using the CHARMMResidue
class.
Patch definitions can only be applied to real residues.
The patch definitions, instead of the atom IDs used in the residues, use
atom references. These take the same format as atom IDs, but the residue index
is replaced by an integer referring to the order the residues are provided to
the patch. For the RAFF
patch, which creates the sugar raffinose from
[AGLC, BFRU, AGAL]
, atoms in AGLC
would take the form
(0, atom_id: str)
(1
for BFRU
, 2
for AGAL
). Patch
definitions are represented using the
CHARMMPatchResidueDefinition
class.
CHARMM Residue Definitions¶
CHARMM residue definitions
(CHARMMResidueDefinition
)
represent all of the data from a RESI
block in a .rtf
file.
This includes:
the
name
a
list
of atoms from the ATOM records (represented as(atom_name: str, atom_type: str, partial_charge: float)
)a
list
of bonds -tuple
of 2 atom_names - from BOND/DOUB records. Bonds in DOUB records appear twice.a
list
of impropers –tuple
of 4 atom_name from IMPR records.a
list
of cross_maps –tuple
of 8 atom_name from CMAP recordsa list of IC records. The first 4 items are atom names, the last 5 are floats containing bond length/angle information. Where angle i-j-k-l is an improper, the third atom_name is prefaced with
*
. The floats are:the bond length, i-j
the angle, i-j-k
the dihedral (or improper) i-j-k-l
the angle j-k-l
the bond length k-l
the default patch if residue is first in the chain. If
None
, this defaults to the default first patch of theCHARMMResidueTopologyFile
the residue definition is in. There is a semantic difference betweenNone
and'NONE'
- the former represents missing information, and the latter is an explicit “no first patch”.the default patch if residue is last in the chain. Treated much the same way as the first.
Residue definitions are usually created using the class method
from_block()
when a CHARMMResidueTopologyFile
is created. Residue definitions can be converted to residues using the
to_residue()
.
Residue definitions can be represented as
Structure
objects,
as RDKit Mol
objects, and as SMARTS patterns using methods
to_structure()
,
to_fragment_mol()
,
and to_smarts()
.
Not much is done with this at present, but in principle this could be used to
match missing residues in a .pdb
file.
CHARMM Residues¶
CHARMM residues (CHARMMResidue
)
contain the information from the residue definitions, but in terms
of atom IDs instead of atom names. The representation of the atoms themselves
is unchanged. Residues have an associated index, and may be altered using
patches. As such, they may contain new atoms and bonds which are not present in
the definition alone.
CHARMM Patches¶
CHARMM patch residue definitions
(CHARMMPatchResidueDefinition
)
contain the information from a PRES
block in a .rtf
file.
In general, the representation is similar to
CHARMMResidueDefinition
, but
rather than a `list
of tuple
for each degree of freedom, these are
stored as n_residues
long lists of these lists. For a patch which is to be
applied to three residues, bonds
might look something like this:
bonds = [
[((0, 'C'), (1, 'N')), ...],
[((1, 'CA'), (1, 'CB')), ...],
[((2, 'CA'), (2, 'CB')), ...]
]
When the patch is applied, 0
, 1
and 2
are replaced with the indices
of the residues the patch is applied to, and these new bonds are added to those
residues. Patches also contain deletions
, atom names to be deleted in each
residue. When atoms are deleted, all degrees of freedom containing these atoms
are removed.
Patches have two important methods:
apply()
, the method which applies the patch to a list of residues, andis_applicable_to()
, which can be used to test what positions a residue can be supplied to the patch in. This is based on the presence of atoms in thedeletions
list for each residue position.
What’s Special About FIRST and LAST?¶
If patches are specified as being applied to 'FIRST'
or 'LAST'
rather than 0
or -1
, the default first/last patch is not applied.
This is to distinguish between cases where a different terminal patch is
being applied (such as if protein patch CT2
was being applied instead of
CTER
) and cases where patches just happen to affect the first or last
residue.
Representing Molecules¶
pygen-structures uses three main classes for molecular representation.
Molecules¶
Complete molecules from the perspective of the CHARMM forcefield are stored as
Molecule
classes. These
are directly representable as .pdb
or .psf
files.
There are not many cases where it is advantageous to create a molecule directly. In most cases, it makes more sense to use the convenience functions to create molecules.
Important methods for the molecule class include:
finalize()
, which removes dangling bonds, applies patches, and instantiates theStructure
used to generate coordinates.to_pdb_block()
/to_pdb_file()
, used to create a PDB file from the molecule.to_psf_block()
/to_psf_file()
, used to create a PSF file from the molecule.check_parameters()
, used to check parameters against aCHARMMParameterFile
to_mol()
, used to obtain the RDKitMol
generated by the underlyingStructure
.
Molecule-like Structures¶
Molecular-like structures can be whole molecules or simply fragments of
molecules. These are represented using the
Structure
class.
This class generates molecular connection tables and instantiates RDKit Mol
objects based on the provided structures. This enables 3D coordinate generation.
Chirality is set using information from the IC table. Atoms are reordered so
that hydrogen atoms come after their parent heavy atom. This is in line with
the psfgen order.
Structures can be created from a list
of
Atom
and a list
of bond
tuple
containing two 0-based atom indices. Generated 3D coordinates are set
in the RDKit Mol
, stored in
mol
and in the x
, y
, and z
attributes of the Atom
objects.
The RDKit Mol
can also be retrieved using the
to_mol()
method.
Atoms¶
Atoms, represented using the Atom
class, are mostly based on the fields stored in a .pdb
file, with some extra
.psf
specific fields. These fields appear in PDB order, with PSF fields
attached on the end.
Atoms can be instantiated from PDB lines using the
from_pdb_line()
class method,
though this will not contain the PSF data.
Atoms can be represented as PDB atom lines using
to_pdb_line()
and as PSF lines
using to_psf_line()
.
Where the atom serial number exceeds 99999, PDB atom serials are switched to hexadecimal encoding. This is unlikely to happen with the current functionality (which is limited to around 15 residues). PSF lines are in the extended format.
Convenience Functions¶
Convenience functions represent the easiest way to interact with the code as a library.
Loading CHARMM Data¶
The CHARMM forcefield topology and parameters can be loaded using
load_charmm_dir()
. This function,
called with no arguments, will load the most recent (July 2019) CHARMM
distribution (with some modifications to the D-amino acid parameters)
and return it as a
CHARMMResidueTopologyFile
and a
CHARMMParameterFile
.
By specifying a path as a positional argument, that directory will be loaded
instead. The function will pick the latest versions of the parameter and
topology files (36 over 27, 36m over 36), so if you plan on using an older
version of the forcefield (this is not recommended) you will have to remove the
newer versions and change the file extensions to match the current conventions
(.rtf
for topology files and, .prm
for parameter files).
Creating Molecules¶
Molecule
objects can be
generated using code_to_mol()
(from a str
of single amino acid codes) or
sequence_to_mol()
(from a list
of generic CHARMM residue names).
These functions accept the same arguments, with the exception that
code_to_mol()
expects a string
protein sequence as the first arguemnt and
sequence_to_mol()
expects a
list of names. code_to_mol()
accepts an additional keyword argument, default_histidine
, which determines
the residue definition which is used for H
.
Both functions require as a minimum the sequence/code to create and a
CHARMMResidueTopologyFile
.
Creating Molecules from PDB Files¶
Molecule
objects can also be
loaded from PDB files using
pdb_to_mol()
, though at present
this only works for small molecules and makes no attempt to pattern match the
missing residues.
Examples¶
All examples should be prefaced with
>>> import pygen_structures as ps
>>> # Using the supplied toppar directory
>>> rtf, prm = ps.load_charmm_dir()
Basic Usage¶
To make HIS-GLU-TYR
, as outlined in the Command Line Usage, writing
the structure out to HEY.psf
and HEY.pdb
:
>>> mol = ps.code_to_mol('HEY', rtf)
>>> mol.to_pdb_file('HEY.pdb')
>>> mol.to_psf_file('HEY.psf')
And should we want to use the protonated form of histidine:
>>> mol = ps.code_to_mol('HEY', rtf, default_histidine='HSP')
>>> # or use the sequence
>>> sequence = ['HSP', 'GLU', 'TYR']
>>> mol = ps.sequence_to_mol(sequence, rtf)
>>> mol.to_pdb_file('HEY.pdb')
>>> mol.to_psf_file('HEY.psf')
Back to non-protein examples, we could create the trisaccharide raffinose.
This requires the use of a patch. The default segment ID, U
, is not
particularly descriptive, so we can specify a new segment, RAFF
. While
we’re at it, we can set the name for the COMPND
header to Raffinose
:
>>> sequence = ['AGLC', 'BFRU', 'AGAL']
>>> patches = {'RAFF': [0, 1, 2]}
>>> mol = ps.sequence_to_mol(sequence, rtf, patches, 'Raffinose', 'RAFF')
>>> mol.to_pdb_file('RAFF.pdb')
>>> mol.to_psf_file('RAFF.psf')
Loading raffinose back from PDB currently requires us to specify the patches:
>>> patches = {'RAFF': [0, 1, 2]}
>>> mol = ps.convenience_functions.pdb_to_mol('RAFF.pdb', rtf, patches)
Verification that parameters exist for created structures is simple:
>>> mol = ps.code_to_mol('AdP', rtf)
>>> # This is fixed in 0.2.3, and will return True
>>> mol.check_parameters(prm)
False
>>> unmatched = prm.get_unmatched(mol)
>>> unmatched
{'bonds': {('CPD1', 'CC')}, 'angles': set(), 'dihedrals': set(), 'impropers': set(), 'cross_maps': set()}
>>> mol2 = ps.code_to_mol('AP', rtf)
>>> prm.check_parameters(mol2)
True
>>> unmatched = prm.get_unmatched(mol2)
>>> unmatched
{'bonds': set(), 'angles': set(), 'dihedrals': set(), 'impropers': set(), 'cross_maps': set()}
For linkages between protein and sugars, we need to use a patch. We
can make the tripeptide ALA-ASN-ALA
, where glucose is linked to asparagine.
This requires use of the NGLA
patch and is a good example of one of the
quirks of the CHARMM forcefield definitions. If we specify AGLC
between
ASN
and ALA
, the protein chain will be broken - bonds between residues
in the peptide are defined using links to the previous (-C
) and next
(+N
) residues and these don’t exist where there is a sugar in the way.
Unfortunately, this means there is a “correct” way to specify these residues.
>>> sequence = ['ALA', 'ASN', 'ALA', 'AGLC']
>>> patches = {'CTER': [-2], 'NGLA': [1, -1]}
>>> mol = ps.sequence_to_mol(sequence, rtf, patches)
>>> # This order wouldn't link ASN and ALA because they aren't adjacent,
>>> # leading to a free-floating ALA residue.
>>> sequence = ['ALA', 'ASN', 'AGLC', 'ALA']
Manual Molecule Creation¶
If when generating a glycopeptide you absolutely require the sugar to be
in the middle of the protein chain, it is possible to obtain this by creating
the residues manually, and specifying the last_index
and next_index
.
>>> def make_residue(residue_name, index, last_index=None, next_index=None):
... return ps.CHARMMResidue.from_residue_definition(
... rtf.residues[residue_name],
... index,
... last_index,
... next_index
... )
...
>>> first = make_residue('ALA', 0)
>>> second = make_residue('ASN', 1, next_index=3)
>>> third = make_residue('AGLC', 2)
>>> last = make_residue('ALA', 3, last_index=1)
>>> residues = [first, second, third, last]
>>> patches = {'NGLA': [1, 2]}
>>> mol = ps.Molecule(residues=residues, topology=rtf, patches=patches)
>>> mol.finalize()
For the most part, though, creating molecules manually is not much more complex than using the convenience functions, just slightly more verbose.
>>> sequence = ['HSE', 'GLU', 'TYR']
>>> residues = []
>>> for index, residue_name in enumerate(sequence):
>>> residue = rtf.residues[residue_name].to_residue(index)
>>> residues.append(residue)
>>> mol = ps.Molecule('HEY', residues, rtf)
>>> mol.finalize()