Selecting and grouping entities

Section author: Marcin Cieslik

The feature, which distinguishes PyCogent’s approach to the handling of macromolecular structures is the flexible and concise way of selecting, grouping and retrieving data from entities. The concepts of entity and hierarchy are similar.

Overview of methods and functions covered.

The methods covered in this section of the manual deal with selecting entities for purposes like: “select all hydrogen atoms from chain B”, “mask all hetero atoms”, “remove all water molecules” etc. We start with the high-level functions first, which are concise and standard to low-level methods for fine grained manipulations

Selection based on hierarchy.

Let’s start by accessing a PDB file and creating a structure entity. We establish a connection to the PDB file server download a file and parse it.

>>> from cogent.parse.pdb import PDBParser
>>> from cogent.db.pdb import Pdb
>>> pdb = Pdb()
>>> socket_handle = pdb['2E1F']
>>> structure = PDBParser(socket_handle)

Let’s see what we got

>>> print structure.header['name']
>>> print structure.header['experiment_type']

WOW, thats descriptive. At least we know it is an X-Ray structure. Now how many chains does it have?

>>> structure[(0,)].getChildren()
[<Chain id=A>]

We found the ‘A’ chain of the first (0-based indexing) model. We can dig deeper

>>> structure[(0,)][('A',)].sortedkeys()[0:2]
[(('H_HOH', 1, ' '),), (('H_HOH', 2, ' '),)]

Only waters? Probably not. You can see what is inside a chain by looking inside the dictionary to get the list of short ids and child entities:

>>> chain_A = structure[(0,)][('A',)]
>>> # chain_A.keys() # get the short_ids
>>> # chain_A.values() # get the children
>>> len(chain_A)

This number is too high because we counted water molecules not only amino acids. But typing structure[(0,)][('A',)] is pretty boring and it requires to inspect the number of models and chain ids first. The function which allows to select entities from the hierarchy based on their identity is called einput

>>> from cogent.struct.selection import einput
>>> all_residues = einput(structure, 'R', 'my_residues')
>>> all_atoms = einput(structure, 'A')
>>> len(all_residues)

Still waters are included.

Selection based on properties.

We already have a collection of entities all_residues which contains all residues in the structure regardless of the number of chains and models. Our task is to determine the number of non-water residues. The property which allows us to distinguish a water molecule from an amino acid is the name, which is stored as the name attribute.

>>> first_child = chain_A.sortedvalues()[0]

We could write a loop to select those residues we can either loop over the residues in chain_A or all_residues as they are the same:

>>> non_water = []
>>> for residue in chain_A:
...     if != 'H_HOH':
...          non_water.append(residue)
>>> len(non_water)

To make this more convenient each entity e.g. a Chain instance has a method to select children based on a property selectChildren. The equivalent of the above expression is:

>>> non_water = chain_A.selectChildren('H_HOH', 'ne', 'name').values()


>>> non_water = all_residues.selectChildren('H_HOH', 'ne', 'name').values()
>>> len(non_water)

The first argument is a value, the second an operator name from the operator module, here ‘ne’ is for ‘Not Equal’. The last argument ‘name’ is resolved by the data_children method which allows the user to retrieve data from a child entities attributes, xtra dictionary or methods. Here we get the data from the ‘name’ attribute. The selectChildren method returns a dictionary, where keys are the short ids and values are the child entities. The result can be put into a new entity holder.

>>> non_water_holder = einput(non_water, 'R')

But having to first group the entities via einput then select them only to put them into a new container seems awkward. It can be done in one step using the select function.

>>> from cogent.struct.selection import select
>>> non_water_holder = select(structure, 'R', 'H_HOH', 'ne', 'name')
>>> len(non_water_holder)

Is there a serine(s) in the sequence?

>>> serines = select(structure, 'R', 'SER', 'eq', 'name')
>>> serines.sortedkeys()[0]
('2E1F', 0, 'A', ('SER', 1146, ' '))

The function raises a ValueError if no entities can be selected.