# Community analysis¶

## alpha diversity¶

### Phylogenetic Diversity (PD)¶

For each environment (i.e. sample), calculates the amount of branch length in a phylogenetic tree that lead to its sequences.

First we will load in a Newick formatted tree.

```
>>> from cogent.parse.tree import DndParser
>>> from cogent.maths.unifrac.fast_tree import UniFracTreeNode
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> tree = DndParser(tree_in, UniFracTreeNode)
```

Next we will load information on which sequences in the tree come from which environment.

```
>>> from cogent.maths.unifrac.fast_unifrac import count_envs
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> envs = count_envs(envs_in)
```

Finally, we can calculate the PD values for each environment in the tree

```
>>> from cogent.maths.unifrac.fast_unifrac import PD_whole_tree
>>> envs, PD_values = PD_whole_tree(tree, envs)
>>> print envs
['E_FL', 'E_PA', 'O_FL', 'O_UN', 'R_FL', 'R_PA']
>>> print PD_values
[ 5.85389 7.60352 2.19215 2.81821 3.93728 3.7534 ]
```

`PD_vals`

is a `numpy`

`array`

with the values representing each environment in envs.

## beta diversity¶

### Unifrac¶

The Fast UniFrac implementation in PyCogent is the source code for the Fast UniFrac web tool and the QIIME pipeline for Microbial community analysis.

### Calculate a UniFrac Distance Matrix and apply PCoA and UPGMA¶

The UniFrac analysis is run on open tree and environment file objects. The resulting dictionary has a distance matrix of pairwise UniFrac values (‘distance_matrix’), a Newick string representing the results of performing UPGMA clustering on this distance matrix (‘cluster_envs’) and the results of running Principal Coordinates Analysis on the distance matrix (‘pcoa’). One can specify weighted UniFrac with `weighted=True`

. Here we run an unweighted analysis.

```
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_unifrac_file(tree_in, envs_in, weighted=False)
>>> print result['cluster_envs']
((((('E_FL':0.339607103063,'R_FL':0.339607103063):0.0279991540511,'R_PA':0.367606257114):0.0103026524101,'E_PA':0.377908909524):0.0223322024492,'O_UN':0.400241111973):0.00976759866402,'O_FL':0.410008710637);
>>> print result['pcoa']
=================================================================================================
Type Label vec_num-0 vec_num-1 vec_num-2 vec_num-3 vec_num-4 vec_num-5
-------------------------------------------------------------------------------------------------
Eigenvectors E_FL 0.05 0.22 -0.09 -0.26 -0.29 -0.00
Eigenvectors E_PA -0.36 0.24 0.21 -0.08 0.18 -0.00
Eigenvectors O_FL 0.32 -0.26 0.30 -0.13 0.05 -0.00
Eigenvectors O_UN -0.28 -0.40 -0.24 -0.04 0.01 -0.00
Eigenvectors R_FL 0.29 0.18 -0.28 0.09 0.22 -0.00
Eigenvectors R_PA -0.02 0.02 0.11 0.42 -0.17 -0.00
Eigenvalues eigenvalues 0.40 0.36 0.29 0.27 0.19 -0.00
Eigenvalues var explained (%) 26.34 23.84 19.06 18.02 12.74 -0.00
-------------------------------------------------------------------------------------------------
```

### Perform pairwise tests of whether samples are significantly different with UniFrac¶

The analysis is run on open tree and environment file objects. In this example, we use unweighted unifrac (`weighted=False`

), we permute the environment assignments on the tree 50 times (`num_iters=50`

) and we perform UniFrac on all pairs of environments (`test_on="Pairwise"`

). A list is returned with a tuple for each pairwise comparison with items: 0 - the first environment, 1 - the second environment, 2- the uncorrected p-value and 3 - the p-value after correcting for multiple comparisons with the Bonferroni correction.

```
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_permutations_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_unifrac_permutations_file(tree_in, envs_in, weighted=False, num_iters=50, test_on="Pairwise")
>>> print result[0]
('E_FL', 'E_PA', 0.17999999999999999, 1.0)
```

### Perform a single UniFrac significance test on the whole tree¶

The analysis is run on open tree and environment file objects. In this example, we use weighted unifrac (`weighted=True`

), we permute the environment assignments on the tree 50 times (`num_iters=50`

) and we perform a unifrac significance test on the whole tree (`test_on="Tree"`

). The resulting list has only one item since a single test was performed. It is a 3 item tuple where the second and third values are the p-value.

```
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_permutations_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_unifrac_permutations_file(tree_in, envs_in, weighted=True, num_iters=50, test_on="Tree")
>>> print result
[('whole tree', 0.56000000000000005, 0.56000000000000005)]
```

### P-test¶

### Perform pairwise tests of whether samples are significantly different with the P-test (Martin, 2002)¶

The analysis is run on open tree and environment file objects. In this example, we permute the environment assignments on the tree 50 times (`num_iters=50`

) and perform the p test for all pairs of environments (`test_on="Pairwise"`

). A list is returned with a tuple for each pairwise comparison with items: 0 - the first environment, 1 - the second environment, 2- the uncorrected p-value and 3 - the p-value after correcting for multiple comparisons with the Bonferroni correction.

```
>>> from cogent.maths.unifrac.fast_unifrac import fast_p_test_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_p_test_file(tree_in, envs_in, num_iters=50, test_on="Pairwise")
>>> print result[0]
('E_FL', 'E_PA', 0.040000000000000001, 0.59999999999999998)
```

### Taxon-based¶

#### Computing a distance matrix between samples¶

PyCogent provides many different ways to compute pairwise distances between objects. `cogent/maths/distance_transform.py`

provides a set of functions to calculate dissimilarities/distances between samples, given an abundance matrix. Here is one example:

```
>>> from cogent.maths.distance_transform import dist_euclidean
>>> from numpy import array
>>> abundance_data = array([[1, 3],
... [5, 2],
... [0.1, 22]],'float')
```

Note

see `distance_transform.py`

for other metrics than euclidean

We now have 3 samples, and the abundance of each column (e.g.: species) in that sample. The first sample has 1 individual of species 1, 3 individuals of species 2. We now compute the relatedness between these samples, using euclidean distance between the rows:

```
>>> dists = dist_euclidean(abundance_data)
>>> print str(dists.round(2))
[[ 0. , 4.12, 19.02]
[ 4.12, 0. , 20.59 ]
[ 19.02, 20.59 , 0. ]]
```

this distance matrix can be visualized via multivariate reduction techniques such as Multivariate data analysis.