Giter Club home page Giter Club logo

atomium's Introduction

atomium

travis coveralls pypi version commit downloads

atomium is a molecular modeller and file parser, capable of reading from and writing to .pdb, .cif and .mmtf files.

Example

>>> import atomium
>>> pdb = atomium.fetch("5HVD")
>>> pdb.model
<Model (1 chain, 6 ligands)>
>>> pdb.model.chain("A")
<Chain A (255 residues)>

Installing

pip

atomium can be installed using pip:

$ pip3 install atomium

atomium is written for Python 3, and does not support Python 2.

If you get permission errors, try using sudo:

$ sudo pip3 install atomium

Development

The repository for atomium, containing the most recent iteration, can be found here. To clone the atomium repository directly from there, use:

$ git clone git://github.com/samirelanduk/atomium.git

Requirements

atomium requires requests for fetching structures from the RCSB, paramiko for fetching structures over SSH, msgpack for parsing .mmtf files, and valerius for dealing with sequences.

Testing

To test a local version of atomium, cd to the atomium directory and run:

$ python -m unittest discover tests

You can opt to only run unit tests or integration tests:

$ python -m unittest discover tests.unit $ python -m unittest discover tests.integration

You can run the 'big test' to get a random 1000 structures, parse them all, and report any problems:

$ python tests/big.py

Finally, to perform speed profiles you can run:

$ python tests/time/time.py

...which creates various profiles that SnakeViz can visualise.

Overview

atomium is a Python library for opening and saving .pdb, .cif and .mmtf files, and presenting and manipulating the information contained within.

Loading Data

While you can use atomium to create models from scratch to build an entirely de novo structure, in practice you would generally use it to load molecular data from an existing file...

>>> import atomium
>>> pdb1 = atomium.open('../1LOL.pdb')
>>> mmtf1 = atomium.open('/structures/glucose.mmtf')
>>> cif1 = atomium.open('/structures/1XDA.cif')
>>> pdb3 = atomium.open('./5CPA.pdb.gz')
>>> pdb2 = atomium.fetch('5XME.pdb')
>>> cif2 = atomium.fetch('5XME')

In that latter case, you don't need the file to be saved locally - it will just go and grab the PDB with that code from the RCSB.

atomium will use the file extension you provide to decide how to parse it. If there isn't one, or it doesn't recognise the extension, it will peek at the file contents and try and guess whether it should be interpreted as .pdb, .cif or .mmtf.

Using Data

Once you've got your File object, what can you do with it?

Annotation

There is meta information contained within the File object:

>>> pdb1.title
'CRYSTAL STRUCTURE OF OROTIDINE MONOPHOSPHATE DECARBOXYLASE COMPLEX WITH XMP'
>>> pdb1.deposition_date
datetime.date(2002, 5, 6)
>>> pdb1.keywords
['TIM BARREL', 'LYASE']
>>> pdb1.classification
'LYASE'
>>> pdb1.source_organism
'METHANOTHERMOBACTER THERMAUTOTROPHICUS STR. DELTA H'
>>> pdb1.resolution
1.9
>>> pdb1.rvalue
0.193
>>> pdb1.rfree
0.229

atomium doesn't currently parse every bit of information from these files, but there is more than those shown above. See the full API docs for more details. In particular, you can access the processed intermediate MMCIF dictionary to get any attribute of these structures.

Models and Assembly

All .pdb files contain one or more models - little universes containing a molecular scene.

>>> pdb1.model
<Model (2 chains, 4 ligands)>
>>> pdb1.models
(<Model (2 chains, 4 ligands)>,)

Most just contain one - it's generally those that come from NMR experiments which contain multiple models. You can easily iterate through these to get their individual metrics:

>>> for model in pdb2.models:
        print(model.center_of_mass)

This model contains the 'asymmetric unit' - this is one or more protein (usually) chains arranged in space, which may not be how the molecule arranges itself in real life. It might just be how they arranged themselves in the experiment. To create the 'real thing' from the asymmetric unit, you use biological assemblies.

Most .pdb files contain one or more biological assemblies - instructions for how to create a more realistic structure from the chains present, which in atomium are accessed using File.assemblies.

In practice, what you need to know is that you can create a new model - not the one already there containing the asymmetric unit - as follows...

>>> pdb3 = atomium.fetch('1XDA')
>>> pdb3.model
<Model (8 chains, 16 ligands)>
>>> pdb3.generate_assembly(1)
<Model (2 chains, 4 ligands)>
>>> pdb3.generate_assembly(10)
<Model (6 chains, 12 ligands)>
>>> [pdb.generate_assembly(n + 1) for n in range(len(pdb.assemblies))]
[<Model (2 chains, 4 ligands)>, <Model (2 chains, 4 ligands)>, <Model (2 cha
ins, 4 ligands)>, <Model (2 chains, 4 ligands)>, <Model (12 chains, 24 ligan
ds)>, <Model (12 chains, 24 ligands)>, <Model (6 chains, 12 ligands)>, <Mode
l (6 chains, 12 ligands)>, <Model (6 chains, 12 ligands)>, <Model (6 chains,
 12 ligands)>, <Model (4 chains, 8 ligands)>, <Model (4 chains, 8 ligands)>]

Here you load a .pdb with multiple possible assemblies, have a quick look at the asymmetric unit with 1,842 atoms, and then generate first , and then all, of its possible biological assemblies by passing in their IDs.

Model Contents

The basic structures within a model are chains, residues, ligands, and atoms.

>>> pdb1.model.chains()
{<Chain A (204 residues)>, <Chain B (214 residues)>}
>>> pdb1.model.chain('B')
<Chain B (214 residues)>
>>> pdb1.model.residues(name='TYR')
{<Residue TYR (A.37)>, <Residue TYR (B.1037)>, <Residue TYR (A.45)>, <Residu
e TYR (A.154)>, <Residue TYR (B.1206)>, <Residue TYR (B.1154)>, <Residue TYR
 (B.1045)>, <Residue TYR (A.206)>}
>>> pdb1.model.residues(name__regex='TYR|PRO')
{<Residue PRO (A.101)>, <Residue PRO (A.46)>, <Residue PRO (A.161)>, <Residu
e TYR (A.45)>, <Residue PRO (B.1046)>, <Residue TYR (A.154)>, <Residue TYR (
B.1206)>, <Residue TYR (B.1045)>, <Residue PRO (B.1189)>, <Residue TYR (A.37
)>, <Residue PRO (B.1129)>, <Residue PRO (B.1077)>, <Residue PRO (A.211)>, <
Residue PRO (B.1180)>, <Residue PRO (B.1157)>, <Residue PRO (B.1211)>, <Resi
due PRO (B.1228)>, <Residue PRO (B.1101)>, <Residue TYR (B.1154)>, <Residue
PRO (A.157)>, <Residue PRO (A.77)>, <Residue PRO (A.180)>, <Residue TYR (B.1
037)>, <Residue PRO (A.129)>, <Residue PRO (B.1161)>, <Residue TYR (A.206)>}
>>> pdb1.model.chain('B').residue('B.1206')
<Residue TYR (B.1206)>
>>> pdb1.model.chain('B').residue('B.1206').helix
True
>>> pdb1.model.ligands()
{<Ligand BU2 (A.5001)>, <Ligand XMP (A.2001)>, <Ligand BU2 (B.5002)>, <Ligan
d XMP (B.2002)>}
>>> pdb1.model.ligand(name='BU2').atoms()
{<Atom 3196 (O3)>, <Atom 3192 (C1)>, <Atom 3193 (O1)>, <Atom 3197 (C4)>, <At
om 3194 (C2)>, <Atom 3195 (C3)>}
>>> pdb1.model.ligand(name='BU2').atoms(mass__gt=12)
{<Atom 3196 (O3)>, <Atom 3192 (C1)>, <Atom 3193 (O1)>, <Atom 3197 (C4)>, <At
om 3194 (C2)>, <Atom 3195 (C3)>}
>>> pdb1.model.ligand(name='BU2').atoms(mass__gt=14)
{<Atom 3196 (O3)>, <Atom 3193 (O1)>}

The examples above demonstrate atomium's selection language. In the case of the molecules - Model, Chain, Residue and Ligand - you can pass in an id or name, or search by regex pattern with id__regex or name__regex.

These structures have an even more powerful syntax too - you can pass in any property such as charge=1, any comparitor of a property such as mass__lt=100, or any regex of a property such as name__regex='[^C]'.

For pairwise comparisons, structures also have the AtomStructure.pairwise_atoms generator which will yield all unique atom pairs in the structure. These can obviously get very big indeed - a 5000 atom PDB file would have about 12 million unique pairs.

Structures can be moved around and otherwise compared with each other...

>>> pdb1.model.ligand(id='B:2002').mass
351.1022
>>> pdb1.model.ligand(id='B.2002').formula
Counter({'C': 10, 'O': 9, 'N': 4, 'P': 1})
>>> pdb1.model.ligand(id='B:2002').nearby_atoms(2.8)
{<Atom 3416 (O)>, <Atom 3375 (O)>, <Atom 1635 (OD1)>}
>>> pdb1.model.ligand(id='B.2002').nearby_atoms(2.8, name='OD1')
{<Atom 1635 (OD1)>}
>>> pdb1.model.ligand(id='B.2002').nearby_residues(2.8)
{<Residue ASP (B.1020)>}
>>> pdb1.model.ligand(id='B.2002').nearby_structures(2.8, waters=True)
{<Residue ASP (B.1020)>, <Water HOH (B.3155)>, <Water HOH (B.3059)>}
>>> import math
>>> pdb1.model.ligand(id='B.2002').rotate(math.pi / 2, 'x')
>>> pdb1.model.ligand(id='B.2002').translate(10, 10, 15)
>>> pdb1.model.ligand(id='B.2002').center_of_mass
(-9.886734282781484, -42.558415679537184, 77.33400578435568)
>>> pdb1.model.ligand(id='B.2002').radius_of_gyration
3.6633506511540825
>>> pdb1.model.ligand(id='B.2002').rmsd_with(pdb1.model.ligand(id='A.2001'))
0.133255572356

Here we look at one of the ligands, identify its mass and molecular formula, look at what atoms are within 2.8 Angstroms of it, and what residues are within that same distance, rotate it and translate it through space, see where its new center of mass is, and then finally get its RMSD with the other similar ligand in the model.

Any operation which involves identifying nearby structures or atoms can be sped up - dramatically in the case of very large structures - by calling Model.optimise_distances on the Model first. This prevents atomium from having to compare every atom with every other atom every time a proximity check is made.

The Atom objects themselves have their own useful properties.

>>> pdb1.model.atom(97)
<Atom 97 (CA)>
>>> pdb1.model.atom(97).mass
12.0107
>>> pdb1.model.atom(97).anisotropy
[0, 0, 0, 0, 0, 0]
>>> pdb1.model.atom(97).bvalue
24.87
>>> pdb1.model.atom(97).location
(-12.739, 31.201, 43.016)
>>> pdb1.model.atom(97).distance_to(pdb1.model.atom(1))
26.18289982030257
>>> pdb1.model.atom(97).nearby_atoms(2)
{<Atom 96 (N)>, <Atom 98 (C)>, <Atom 100 (CB)>}
>>> pdb1.model.atom(97).is_metal
False
>>> pdb1.model.atom(97).structure
<Residue ASN (A.23)>
>>> pdb1.model.atom(97).chain
<Chain A (204 residues)>

Chains are a bit different from other structures in that they are iterable, indexable, and return their residues as a tuple, not a set...

>>> pdb1.model.atom(97).chain
<Chain A (204 residues)>
>>> pdb1.model.chain('A')
<Chain A (204 residues)>
>>> len(pdb1.model.chain('A'))
204
>>> pdb1.model.chain('A')[10]
<Residue LEU (A.21)>
>>> pdb1.model.chain('A').residues()[:5]
(<Residue VAL (A.11)>, <Residue MET (A.12)>, <Residue ASN (A.13)>, <Residue
ARG (A.14)>, <Residue LEU (A.15)>)
>>> pdb1.model.chain('A').sequence
'LRSRRVDVMDVMNRLILAMDLMNRDDALRVTGEVREYIDTVKIGYPLVLSEGMDIIAEFRKRFGCRIIADFKVAD
IPETNEKICRATFKAGADAIIVHGFPGADSVRACLNVAEEMGREVFLLTEMSHPGAEMFIQGAADEIARMGVDLGV
KNYVGPSTRPERLSRLREIIGQDSFLISPGVGAQGGDPGETLRFADAIIVGRSIYLADNPAAAAAGIIESIKDLLI
PE'

The sequence is the 'real' sequence that exists in nature. Some of them will be missing from the model for practical reasons.

Residues can generate name information based on their three letter code, and are aware of their immediate neighbors.

>>> pdb1.model.residue('A.100')
<Residue PHE (A.100)>
>>> pdb1.model.residue('A.100').name
'PHE'
>>> pdb1.model.residue('A.100').code
'F'
>>> pdb1.model.residue('A.100').full_name
'phenylalanine'
>>> pdb1.model.residue('A.100').next
<Residue PRO (A.101)>
>>> pdb1.model.residue('A.100').previous
<Residue GLY (A.99)>

Saving Data

A model can be saved to file using:

>>> model.save("new.cif")
>>> model.save("new.pdb")

Any structure can be saved in this way, so you can save chains or molecules to their own seperate files if you so wish.

>>> model.chain("A").save("chainA.pdb")
>>> model.chain("B").save("chainB.cif")
>>> model.ligand(name="XMP").save("ligand.mmtf")

Note that if the model you are saving is one from a biological assembly, it will likely have many duplicated IDs, so saving to file may create unexpected results.

Changelog

Release 1.0.11

27 November 2021

  • Optimised distance lookup for finding atoms within sphere.

Release 1.0.10

29 May 2021

  • Fixed secondary structure parsing for multi character asym IDs in mmCIF.

Release 1.0.9

4 February 2021

  • Fixed temperature factor zero-padding in PDB saving.
  • Fixed MMTF decode bug in Ubuntu.

Release 1.0.8

9 December 2020

  • HETATM identity now preserved when parsing PDB files

Release 1.0.7

5 November 2020

  • Fixed blank ANISOU values in PDB saving.
  • Fixed negative residue IDs in PDB saving.
  • Fixed SyntaxWarning messages on PDB saving.

Release 1.0.6

8 September 2020

  • Added handling of new branched entities in MMCIF/MMTF.

Release 1.0.5

21 July 2020

  • Added ability to open compressed .gz files.

Release 1.0.4

1 May 2020

  • Made TER records more compliant in saved PDB files.
  • Specified required msgpack version to fix MMTF parsing issue.

Release 1.0.3

5 December 2019

  • Made quality information detection more broad.
  • Improved documentation.

Release 1.0.2

1 October 2019

  • Added distance optimiser for proximity checks.
  • Improved test coverage.

Release 1.0.1

26 September 2019

  • Added a pdb2json script for converting local structure files to JSON.
  • Improved speed comparison checks.

Release 1.0.0

23 June 2019

  • Saving now issues warning if the stucture has duplicate IDs.
  • Missing residues parsed for all three file types.
  • Crystallographic information now parsed.
  • Refactor of atomic structures.
  • Refactor of .mmtf parsing.
  • Structure copying now retains all properties.
  • Fixed bug in parsing .cif expression systems.
  • Full names of ligands and modified residues now parsed.
  • Secondary structure information parsed and available now.
  • Atoms now have covalent radius property for calculating bond cutoffs.
  • .pdb parsing can now handle heavy water (DOD).
  • General speed improvements.

Release 0.12.2

4 February 2019

  • Angle between superimposed atoms now possible.
  • Fixed source speices lookup in .cif files.
  • Fixed bug relating to embedded quotes in .cif files.

Release 0.12.1

13 January 2019

  • Fixed assembly parsing bug in small number of .cif files.

Release 0.12.0

2 January 2019

  • Refactored parse utilities to improve speed.
  • Added support for .mmtf files.
  • Added file writing for all three file types (.pdb, .cif, .mmtf).
  • Made .cif the default file type.
  • General library restructuring.

Release 0.11.1

13 September 2018

  • Fixed bug pertaining to residues with ID 0.
  • Fixed bug pertaining to SEQRES parsing when chain ID is numeric.
  • Changed format of residue IDs to include colon.
  • Considerable speed improvements in .mmcif parsing.

Release 0.11.0

22 August 2018

  • Added .mmcif parsing.
  • Changed how parsing in general is done under the hood.
  • Added atom angle calculation.
  • Fixed bug where modified residues were treated as ligands if authors used HETATM records.

Release 0.10.2

29 July 2018

  • Added function for getting PDBs over SSH.
  • Fixed biological assembly parsing bug.
  • Fixed chain copying of sequence bug.

Release 0.10.1

25 June 2018

  • Added function for returning best biological assembly.
  • Fixed bug with sorting None energy assemblies.
  • Fixed bug pertaining to excessive atom duplication when creating assembly.

Release 0.10.0

22 June 2018

  • Parsing of .pdb keywords.
  • Parsing of atom anisotropy.
  • Parsing of .pdb sequence information.
  • More R-factor information.
  • Biological assembly parsing and generation.
  • More powerful transformations rather than just simple rotation.
  • Backend simplifications.
  • Powerful new atom querying syntax.

Release 0.9.1

17 May 2018

  • Added Residue one-letter codes.
  • Fixed stray print statement.

Release 0.9.0

10 April 2018

  • Turned many methods into properties.
  • Added full residue name generation.
  • Made bind site detection more picky.
  • Added coordinate rounding to deal with floating point rounding errors.
  • Atomic structures now 'copy'able.
  • Refactored atom querying.
  • Added grid generation.
  • Implemented Kabsch superposition/rotation.
  • Implemented RMSD comparison.
  • Created Complex class (for later).

Release 0.8.0

2 December 2017

  • Added option to get water residues in binding sites.

  • Added extra PDB meta information parsing, such as:

    • Classification
    • Experimental Technique
    • Source Organism
    • Expression Organism
    • R-factor

Release 0.7.0

2 November 2017

  • PDBs with multiple occupancy can now be parsed correctly.
  • Added pairwise atom generator.
  • PDB parser now extracts resolution.
  • Further speed increased to PDB parser.
  • Miscellaneous bug fixes.
  • Implemented Continuous Integration.

Release 0.6.0

3 October 2017

  • Now allows for fetching and opening of PDB data dictionaries.
  • Added parsing/saving of HEADER and TITLE records in PDB files.
  • Added ability to exclude elements from atom search.
  • Added ability to get nearby atoms in a model.
  • Added bind site identification.
  • Fixed chain length bottleneck in PDB model saving.
  • Overhauled PDB parsing by replacing classes with built in Python types.
  • Fixed bug where numerical residue names were interpreted as integers.
  • Changed atoms so that they can allow negative B factors.
  • Added loading of .xyz data dictionaries.
  • Miscellaneous speed increases.

Release 0.5.0

16 September 2017

  • Added atom temperature factors.
  • Added bond vector production.
  • Added parse time tests and reduced parse time by over a half.
  • Changed way atoms are stored in structures to make ID lookup orders of magnitude faster.
  • Made IDs immutable.
  • Added multiple model parsing and saving.
  • Added option to fetch PDBs from PDBe rather than RCSB.

Release 0.4.0

26 August 2017

  • Added PDB parsing.
  • Added PDB saving.
  • Gave atoms ability to get specific bond with other atom.
  • Added bond angle calculation.
  • Added ability to filter out water molecules.

Release 0.3.0

11 August 2017

  • Added classes for Molecules, Chains, Residues, and their interfaces.
  • Added charges to atoms and structures.
  • Add ability to create AtomicStructures from AtomicStructures.

Release 0.2.0

14 June 2017

  • Made all Atomic Structures savable.
  • Added Atom IDs and uniqueness constraints.
  • Added Atom Bonds.

Release 0.1.1

1 June 2017

  • Fixed setup.py
  • Minor typos

Release 0.1.0

1 June 2017

  • Added basic Model and Atom classes.
  • Added .xyz parsing.

atomium's People

Contributors

filipemaia avatar juvilius avatar mnahinkhan avatar samirelanduk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

atomium's Issues

Add clustering of chains into complexes

Description of Feature

PDB files contain COMPND records, detailing the names of complexes and the chains that make them up. These should be parsed to add complexes to models.

Proposed Example Code

>>> model.complexes()
set(<Complex 1>, <Complex 2>)
>>> model.complex("1").chains()
set(<Chain A>, <Chain B>)

Standalone version of atomium

I am having some trouble with people installing atomium from pip (It seems to be a Blender related issue) so I was hoping to try and bundle a version of atomium inside of the addon itself.

I know atomium requires a couple of other packages etc for fetching - would you have advice on how to go about doing this? Do you have the ability to build a standalone version of atomium that I could include, rather than having to install using pip? I'm still early on my python journey (die-hard R fan) so package management inside of python is still new territory for me.

the "add_secondary_structure_to_polymers" function in the mmcif.py module only recognises single-letter chain identifiers

For Bug Reports

the "add_secondary_structure_to_polymers" function in the mmcif.py module only recognises single-letter chain identifiers

Problem

cryoEM derived mmcif files often have multi-litter chain identifiers, those chains are not recognised and secondary structure information is not forwarded to downstream functions

the reason can be found in line 567:
<chain = model["polymer"].get(segment[0][0])>
it should be exchanged by:
<chain = model["polymer"].get(segment[0].split('.')[0])>
or similar behaving commands to allow multi-letter chain identifiers

Python Version/Operating System

atomimuim 1.0.6

TER line doesn't comply to PDB TER records spec

Description of Feature

The current terminal line only contains

TER

and doesn't comply to the PDB TER records spec where the TER line is specified as

The TER record has the same residue name, chain identifier, sequence number and insertion code as the terminal residue. The serial number of the TER record is one number greater than the serial number of the ATOM/HETATM preceding the TER.

This leads to errors in programs using PDBs written by atomium. I found this problem trying to use MMligner with atomium written PDBs

Loading 6LU7 from mmtf fails in decode_dict

Expected behaviour

Should read the 6lu7.mmtf file and return a atomium.data.File instance.

Actual behaviour

Crashes within decode_dict.

~/.pyenv/versions/3.8.6/envs/helico-base/lib/python3.8/site-packages/atomium/mmtf.py in decode_dict(d)
     41             elif isinstance(new_value[0], bytes):
     42                 new_value = [x.decode() for x in new_value]
---> 43         new[key.decode()] = new_value
     44     return new
     45 

AttributeError: 'str' object has no attribute 'decode'

I think the .decode() call is unnecessary.

Example code to reproduce

mmtf1 = atomium.open('/home/zach/Downloads/6lu7.mmtf')

Python Version/Operating System

Python 3.8.6 in a venv, on Ubuntu 20.04.1.

Getting the Residue to which an Atom belongs

The Overview Docs show an example of how to get a Structure to which an Atom belongs:

>>> pdb1.model.atom(97).structure
<Residue ASN (A.23)>

Unfortunately it seems Atom objects have no attribute "structure". Indeed, I cannot find a reference to such an attribute within the source code.

Was this a feature that was removed? Are there other supported ways in which I might be able to get the residue to which an Atom belongs?

Thank you very much!

Auto-creation of bonds.

I'm sorry if this is obvious, but I have looked around and tried to have a go at it myself, but I am unable to find a way to "auto" generate bonds from a pdb, based on proximity to other atoms etc. Is this implemented somewhere by any chance?

I am currently testing out & implementing atomium for use inside of blender which has enabled me to import pdb files which I couldn't do previously. Any extra information that I can squeeze out of this library would be a great help with not having to reimplement it myself!

README error

Just spotted a documentation error, in the README you have an example containing:

>>> model.translate(34, -12, 3.5)
>>> model.rotate("x", 45)
>>> model.atom(element="O")

Unfortunately
model.rotate("x", 45)
returns:
ValueError: 45 is not a valid axis

Switching the order to:
model.rotate(45, "x")
runs as expected though.

Great work on atomium, really simple to integrate within other scripts.. cheers!

Implement parallel processing

The multiprocessing library could speed up parts of the PDB parsing process - especially those parts that are just processing thousands of records.

KeyError: 'branched' when fetching some pdb files

Expected behaviour

fetch a mmCIF structure

Actual behaviour

Key error

Example code to reproduce

s =atomium.fetch('6xlu')

Python Version/Operating System

3.8.3

For Bug Reports


KeyError Traceback (most recent call last)
~/anaconda3/envs/Py3/lib/python3.8/site-packages/atomium/mmcif.py in add_atom_to_non_polymer(atom, aniso, model, mol_type, names)
519 try:
--> 520 model[mol_type][mol_id]["atoms"][
521 int(atom["id"])

KeyError: 'branched'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in
1 # Downloading PDB
----> 2 s =atomium.fetch('6xlu')

~/anaconda3/envs/Py3/lib/python3.8/site-packages/atomium/utilities.py in fetch(code, *args, **kwargs)
73 if response.status_code == 200:
74 text = response.content if code.endswith(".mmtf") else response.text
---> 75 return parse_string(text, code, *args, **kwargs)
76 raise ValueError("Could not find anything at {}".format(url))
77

~/anaconda3/envs/Py3/lib/python3.8/site-packages/atomium/utilities.py in parse_string(filestring, path, file_dict, data_dict)
122 parsed = file_func(filestring)
123 if not file_dict:
--> 124 parsed = data_func(parsed)
125 if not data_dict:
126 filetype = data_func.name.split("_")[0].replace("mmc", "c")

~/anaconda3/envs/Py3/lib/python3.8/site-packages/atomium/mmcif.py in mmcif_dict_to_data_dict(mmcif_dict)
206 update_quality_dict(mmcif_dict, data_dict)
207 update_geometry_dict(mmcif_dict, data_dict)
--> 208 update_models_list(mmcif_dict, data_dict)
209 return data_dict
210

~/anaconda3/envs/Py3/lib/python3.8/site-packages/atomium/mmcif.py in update_models_list(mmcif_dict, data_dict)
432 add_atom_to_polymer(atom, aniso, model, names)
433 else:
--> 434 add_atom_to_non_polymer(atom, aniso, model, mol_type, names)
435 data_dict["models"].append(model)
436 for model in data_dict["models"]:

~/anaconda3/envs/Py3/lib/python3.8/site-packages/atomium/mmcif.py in add_atom_to_non_polymer(atom, aniso, model, mol_type, names)
523 except:
524 name = atom["auth_comp_id"]
--> 525 model[mol_type][mol_id] = {
526 "name": name, "full_name": names.get(name),
527 "internal_id": atom["label_asym_id"],

KeyError: 'branched'

Get PDB files from remote machines

Description of Feature

Hey, I was wondering if you're considered implementing a helper function to fetch data with ssh. This could be done easily with a helper function that accepts file objects, because there is a library that provides the same python file API to handle files in a remote machine.
This could be really useful in our lab because we mirror pdb files and accessing these files is quicker than downloading them from RCSB. However, these are in a remote machine and cannot just be opened with open from the python standard library.

Proposed Example Code

The implementation could be something like this:

def fetch_io(file):
    filestring = file.readlines()
    return pdb_string_to_pdb_dict(filestring)

And then maybe add a check to see if the file is open? And then the user is responsible for closing it.

This could be used like this

import paramiko

# create the remote file object to handle files in a remote machine
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect('remote-machine', username='user123')

# and then have a function that accepts this object
pdb = fetch_io(ssh)

And could accept file objects in general, so:

with open('my_pdb.pdb', 'r') as f:
  pdb = fetch_io(f)

getting the residue details from an atom

it would be great if while iterating over nearby atoms it was possible to access the chain and residues for these atoms. Is there a way to do this?

for example:

for atom_i in model.atoms():
  for atom_j in atom_i.nearby(6.):
    print(atom_j.chain, atom_j.residue_name, atom_j.residue_number)

Parsing the Rfree Value

It would also be useful to parse the Rfree and number of reflections used to generate the Rfree from the PDB remarks.

FREE R VALUE
FREE R VALUE TEST SET COUNT

Something like:

def extract_rfree(pdb_dict, lines):
    """Takes a ``dict`` and adds rfree information to it by parsing
    REMARK 3.
    :param dict pdb_dict: the ``dict`` to update.
    :param list lines: the file lines to read from."""

    remark_lines = get_lines("REMARK", lines)
    pattern = r"FREE R VALUE\s+:(.+)"
    for remark in remark_lines:
        if int(remark[7:10]) == 3 and remark[10:].strip():
            matches = re.findall(pattern, remark)
            if matches:
                try:
                    pdb_dict["rfree"] = float(matches[0].strip())
                    break
                except: pass
    else:
pdb_dict["rfree"] = None


def extract_rfree_count(pdb_dict, lines):
    """Takes a ``dict`` and adds rfree information to it by parsing
    REMARK 3.
    :param dict pdb_dict: the ``dict`` to update.
    :param list lines: the file lines to read from."""

    remark_lines = get_lines("REMARK", lines)
    pattern = r"FREE R VALUE TEST SET COUNT\s+:(.+)"
    for remark in remark_lines:
        if int(remark[7:10]) == 3 and remark[10:].strip():
            matches = re.findall(pattern, remark)
            if matches:
                try:
                    pdb_dict["freecount"] = float(matches[0].strip())
                    break
                except: pass
    else:
pdb_dict["freecount"] = None

Error with save

I am attempting to save a model I created from a set of atoms I got using the .atoms_in_sphere() function. This is the code and the traceback I am getting.

pdb = atomium.open(pdb_path)
pdb.model.optimise_distances()

pocket_atoms = pdb.model.atoms_in_sphere((25.55,13.88,-29.54), 8)

pocket = atomium.structures.Residue(*pocket_atoms, id='A', name='A')
x = set()
x.add(pocket)

new_chain = atomium.Chain(id='A', *x)
new_model = atomium.Model(new_chain)
new_model.save("test.pdb")

Traceback (most recent call last):
File "/home/ray/Pocket-Prediction/src/describe_pocket.py", line 182, in
describe_pockets('./data/4bcf.pdb')
File "/home/ray/Pocket-Prediction/src/describe_pocket.py", line 178, in describe_pockets
get_pocket_atoms(pdb_path)
File "/home/ray/Pocket-Prediction/src/describe_pocket.py", line 129, in get_pocket_atoms
new_model.save("pocket_ctsb_test.pdb")
File "/home/ray/anaconda3/lib/python3.8/site-packages/atomium/structures.py", line 231, in save
string = structure_to_pdb_string(self)
File "/home/ray/anaconda3/lib/python3.8/site-packages/atomium/pdb.py", line 565, in structure_to_pdb_string
atom_to_atom_line(atom, lines)
File "/home/ray/anaconda3/lib/python3.8/site-packages/atomium/pdb.py", line 606, in atom_to_atom_line
residue_id = int("".join([c for c in id_ if c.isdigit() or c == "-"]))
ValueError: invalid literal for int() with base 10: ''

atomium breaks on RCSB generated assembly files

For some reason, the .cif files for specific assemblies that you can download for assemblies don't have a struct_asym table. atomium needs this table because there is no other way to map atom chain IDs to entity types, so when parsing you get a Key Error.

I am not writing a work around for this now because these assemblies can be generated anyway, but it should be fixed at some point.

Add option for offline mirror of PDB

Using an environment variable, user should be able to specify a directory that atomium will look in first before fetching a file over the internet.

Atomimu should not depend on `strptime` for parse date in PDB

For Bug Reports

Expected behaviour

Read PDB file if using non English locale

Actual behaviour

~/.pyenv/versions/3.8.3/envs/partseg3.8/lib/python3.8/site-packages/atomium/utilities.py in open(path, *args, **kwargs)
     39         except:
     40             with builtins.open(path, "rb") as f: filestring = f.read()
---> 41         return parse_string(filestring, path, *args, **kwargs)
     42 
     43 

~/.pyenv/versions/3.8.3/envs/partseg3.8/lib/python3.8/site-packages/atomium/utilities.py in parse_string(filestring, path, file_dict, data_dict)
    122     parsed = file_func(filestring)
    123     if not file_dict:
--> 124         parsed = data_func(parsed)
    125         if not data_dict:
    126             filetype = data_func.__name__.split("_")[0].replace("mmc", "c")

~/.pyenv/versions/3.8.3/envs/partseg3.8/lib/python3.8/site-packages/atomium/pdb.py in pdb_dict_to_data_dict(pdb_dict)
     80       "geometry": {"assemblies": [], "crystallography": {}}, "models": []
     81     }
---> 82     update_description_dict(pdb_dict, data_dict)
     83     update_experiment_dict(pdb_dict, data_dict)
     84     update_quality_dict(pdb_dict, data_dict)

~/.pyenv/versions/3.8.3/envs/partseg3.8/lib/python3.8/site-packages/atomium/pdb.py in update_description_dict(pdb_dict, data_dict)
     95     :param dict data_dict: The data dictionary to update."""
     96 
---> 97     extract_header(pdb_dict, data_dict["description"])
     98     extract_title(pdb_dict, data_dict["description"])
     99     extract_keywords(pdb_dict, data_dict["description"])

~/.pyenv/versions/3.8.3/envs/partseg3.8/lib/python3.8/site-packages/atomium/pdb.py in extract_header(pdb_dict, description_dict)
    174         line = pdb_dict["HEADER"][0]
    175         if line[50:59].strip():
--> 176             description_dict["deposition_date"] = datetime.strptime(
    177              line[50:59], "%d-%b-%y"
    178             ).date()

~/.pyenv/versions/3.8.3/lib/python3.8/_strptime.py in _strptime_datetime(cls, data_string, format)
    566     """Return a class cls instance based on the input string and the
    567     format string."""
--> 568     tt, fraction, gmtoff_fraction = _strptime(data_string, format)
    569     tzname, gmtoff = tt[-2:]
    570     args = tt[:6] + (fraction,)

~/.pyenv/versions/3.8.3/lib/python3.8/_strptime.py in _strptime(data_string, format)
    347     found = format_regex.match(data_string)
    348     if not found:
--> 349         raise ValueError("time data %r does not match format %r" %
    350                          (data_string, format))
    351     if len(data_string) != found.end():

ValueError: time data '24-JUL-96' does not match format '%d-%b-%y'

For my case

In[11]: date(1996, 6, 15).strftime("%d-%b-%y")
Out[11]: '15-cze-96'

Not '15-Jun-96'

Example code to reproduce

On non-English OS:

import atomium
import locale

locale.setlocale(local.LC_ALL, "")
atomium.open("/home/czaki/Pobrane/tmp/3mht.pdb")

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes directly told that strptime will use local abrev.

Python Version/Operating System

Python 3.8.3, Ubuntu 21.10

Zero-pad pdb temperature factor

For Bug Reports

Hi, it seems that when saving a structure to PDB format, the temperature factor does not get zero padded correctly, e.g.:

ATOM    194  O   LYS A  20      34.379  53.562  37.179  1.00 53.10           O  # expected
ATOM    194  O   LYS A  20      34.379  53.562  37.179  1.00  53.1           O  # actual

Example code to reproduce

import atomium

model = atomium.fetch("5LXR").model
print((atomium.pdb.structure_to_pdb_string((model.residue("A.7"))).split("\n")[4]))

Python Version/Operating System

Python 3.7.7, Macos, atomium v1.0.8

PDB format output with numbers as chain ID

Hi. I am using atomium to extract molecules from mmCIF files and write them into PDB format. Generally, this works really well, but I encountered an issue with structures where the chain ID is a number instead of a letter.

Expected behaviour

The chain ID should not be written as part of the residue number, but only in the column reserved for the chain ID.

Actual behaviour

When the chain ID is a number, it is written into the PDB string twice (once as chain ID and once as part of the residue number). The resulting files are too broad for the PDB specification and are parsed badly by many other programs.

Example code to reproduce

import atomium
cif = atomium.fetch("6L4T")
lig = [l for l in cif.model.ligands() if l.id == "13.308"][0]
print(atomium.pdb.structure_to_pdb_string(lig))

Output (truncated):

HETATM20582  NB  KC1 1313308     208.930 314.544 325.109  1.00 90.18           N  
HETATM20583  ND  KC1 1313308     205.979 312.067 326.352  1.00 90.18           N  
HETATM20584  C1A KC1 1313308     208.131 312.489 328.676  1.00 90.18           C  
HETATM20585  C1B KC1 1313308     209.880 315.122 325.835  1.00 90.18           C  
HETATM20586  C1C KC1 1313308     206.761 314.055 322.987  1.00 90.18           C  
HETATM20587  C1D KC1 1313308     204.767 311.511 325.824  1.00 90.18           C  

Note that the chain ID ("13") is written twice.

Python Version/Operating System

I am using atomium 1.0.11 (from conda-forge) on Python 3.10 / Linux

Thanks in advance for your support, and thanks for publishing atomium as open-source :-)

Unable to write atoms with negative residue id

In atom_to_atom_line (atomium.pdb) there is a fragment:

if a.het:
    id_, residue_name = a.het.id, a.het._name
    chain_id = a.chain.id if a.chain is not None else ""
    residue_id = int("".join([c for c in id_ if c.isdigit()]))
    insert_code = id_[-1] if id_ and id_[-1].isalpha() else ""

It handles the insert codes nicely (e.g. 10A) but removes the minus sign in negative residue ids. If we have residues -1 and 1, both will end up being 1 in the PDB file.

HETATM records change to ATOM when saving model

Hey, first of all thanks for this great tool.

I found the following issue with this use case: I'd like to extract a single model from a structure that has multiple ones as a pdb file.

Expected behaviour

The output files keeps the information of HETATM and ATOM records

HETATM    1  C   FVA A   1      -3.595   0.079   3.555  1.00  0.00           C  
HETATM    2  N   FVA A   1      -2.330  -0.205   1.496  1.00  0.00           N  
HETATM    3  O   FVA A   1      -3.911  -1.055   3.906  1.00  0.00           O  
HETATM    4  CA  FVA A   1      -3.501   0.435   2.070  1.00  0.00           C  
HETATM    5  CB  FVA A   1      -4.752   0.042   1.281  1.00  0.00           C  
HETATM    6  CG1 FVA A   1      -5.974   0.826   1.764  1.00  0.00           C  
HETATM    7  CG2 FVA A   1      -4.535   0.232  -0.221  1.00  0.00           C  
HETATM    8  O1  FVA A   1      -1.445   1.720   0.692  1.00  0.00           O  
HETATM    9  CN  FVA A   1      -1.409   0.501   0.859  1.00  0.00           C  
ATOM     10  N   GLY A   2      -3.315   1.072   4.387  1.00  0.00           N  
ATOM     11  CA  GLY A   2      -3.364   0.879   5.826  1.00  0.00           C  
ATOM     12  C   GLY A   2      -2.503   1.917   6.549  1.00  0.00           C  
ATOM     13  O   GLY A   2      -3.009   2.947   6.992  1.00  0.00           O  
ATOM     14  N   ALA A   3      -1.218   1.610   6.645  1.00  0.00           N  
ATOM     15  CA  ALA A   3      -0.282   2.504   7.305  1.00  0.00           C  
ATOM     16  C   ALA A   3       1.118   2.286   6.729  1.00  0.00           C  
ATOM     17  O   ALA A   3       1.488   1.161   6.395  1.00  0.00           O  
ATOM     18  CB  ALA A   3      -0.333   2.271   8.817  1.00  0.00           C  

Actual behaviour

All atom coordinates are written as ATOM records

ATOM      1  C   FVA A   1      -3.595   0.079   3.555  1.00   0.0           C  
ATOM      2  N   FVA A   1      -2.330  -0.205   1.496  1.00   0.0           N  
ATOM      3  O   FVA A   1      -3.911  -1.055   3.906  1.00   0.0           O  
ATOM      4  CA  FVA A   1      -3.501   0.435   2.070  1.00   0.0           C  
ATOM      5  CB  FVA A   1      -4.752   0.042   1.281  1.00   0.0           C  
ATOM      6  CG1 FVA A   1      -5.974   0.826   1.764  1.00   0.0           C  
ATOM      7  CG2 FVA A   1      -4.535   0.232  -0.221  1.00   0.0           C  
ATOM      8  O1  FVA A   1      -1.445   1.720   0.692  1.00   0.0           O  
ATOM      9  CN  FVA A   1      -1.409   0.501   0.859  1.00   0.0           C  
ATOM     10  N   GLY A   2      -3.315   1.072   4.387  1.00   0.0           N  
ATOM     11  CA  GLY A   2      -3.364   0.879   5.826  1.00   0.0           C  
ATOM     12  C   GLY A   2      -2.503   1.917   6.549  1.00   0.0           C  
ATOM     13  O   GLY A   2      -3.009   2.947   6.992  1.00   0.0           O  
ATOM     14  N   ALA A   3      -1.218   1.610   6.645  1.00   0.0           N  
ATOM     15  CA  ALA A   3      -0.282   2.504   7.305  1.00   0.0           C  
ATOM     16  C   ALA A   3       1.118   2.286   6.729  1.00   0.0           C  
ATOM     17  O   ALA A   3       1.488   1.161   6.395  1.00   0.0           O  
ATOM     18  CB  ALA A   3      -0.333   2.271   8.817  1.00   0.0           C  

Example code to reproduce

import atomium
structure = atomium.fetch("1GRM")
mod = structure.models[0]
mod.save("1GRM_model1.pdb")

Then compare with https://files.rcsb.org/view/1GRM.pdb

Python Version/Operating System

Python 3.6, atomium 1.0.7

Converting PDB/mmft with a specific dict

I would like to convert mmft/PDB (and back) following exactly the https://mmtf.rcsb.org/v1.0 conventions/dictionaries. As a stupid test I download mmtf file from RCSB and re-formated but the output is almost 2 times larger (I am guessing is related to the bit encoding). I need to convert PDB into mmft and also mmft to PDB but exactly conforming RCSB dictionary. Is it possible with your nice tool?

Thanks for your help

Pablo

Here at http://chaconlab.org are my coordinates just in case you want to contact me...

Proposed Example Code

python3.6 converter.py 2fxm.mmtf temp.mmtf

import sys
import atomium

print ('Converting', sys.argv[1], sys.argv[2])
pdb = atomium.open(sys.argv[1])
print(" ", pdb.model)
pdb.model.save(sys.argv[2])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.