Dealing with HTS data¶
FASTQ formatted files¶
Parsing¶
FASTQ format can be exported by Illumina’s pipeline software.
>>> from cogent.parse.fastq import MinimalFastqParser
>>> for label, seq, qual in MinimalFastqParser('data/fastq.txt'):
... print label
... print seq
... print qual
GAPC_0015:6:1:1259:10413#0/1
AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
````Y^T]`]c^cabcacc`^Lb^ccYT\T\Y\WF
GAPC_0015:6:1:1283:11957#0/1
TATGTATATATAACATATACATATATACATACATA
]KZ[PY]_[YY^```ac^\\`bT``c`\aT``bbb...
Converting quality scores to numeric data¶
In FASTQ format, ASCII characters are used to represent base-call quality. Unfortunately, vendors differ in the range of characters used. According to their documentation, Illumina uses the character range from 64-104. We parse the sequence file and convert the characters into integers on the fly.
>>> from cogent.parse.fastq import MinimalFastqParser
>>> for label, seq, qual in MinimalFastqParser('data/fastq.txt'):
... qual = map(lambda x: ord(x)-64, qual)
... print label
... print seq
... print qual
GAPC_0015:6:1:1259:10413#0/1
AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
[32, 32, 32, 32, 25, 30, 20, 29, 32, 29, 35, 30, 35, 33, 34, 35, 33, ...