Biopython begins


Warning: preg_match_all(): Compilation failed: invalid range in character class at offset 7 in /home/customer/www/phd.organelas.com/public_html/wp-content/themes/p2/inc/mentions.php on line 77

Learning a bit of bioinformatics to deal with transcriptome data. Idea is to use Biopython tools to automate sequence BLASTing with Python scripts. Here are some notes that might be useful to keep; extracted from the Tutorial and Cookbook.

Sequence

Sequence objects Seq are composed of a string of letters and an alphabet, which determines what the letters mean (eg. is A a nucleotide or Alanine?). Seq objects act like “read only” python strings (use str(seq) to compare sequences). To modify sequences use from Bio.Seq import MutableSeq.

>>> from Bio.Seq import Seq
>>> my_seq = Seq('AGTACACTGGT')
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> print my_seq
AGTACACTGGT
>>> my_seq.alphabet
Alphabet()

When creating sequences the alphabet should be explicitly specified following the IUPAC alphabet recommendations defined in Bio.Alphabet module.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq('AGTACACTGGT', IUPAC.unambiguous_dna)
>>> my_seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> my_seq.alphabet
IUPACUnambiguousDNA()
>>> my_prot = Seq('AGTACACTGGT', IUPAC.protein)
>>> my_prot
Seq('AGTACACTGGT', IUPACProtein())
>>> my_prot.alphabet
IUPACProtein()

There are methods to obtain the complement and reverse_complement of a sequence:

>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
>>> my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
>>> my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

The biological transcription process run a reverse complement on the template strand. On bioinformatics it is easier to only replace the coding strand with Us.

 	DNA coding strand (aka Crick strand, strand +1)	 
5’	ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG	3’
 	|||||||||||||||||||||||||||||||||||||||	 
3’	TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC	5’
 	DNA template strand (aka Watson strand, strand −1)	 
 
 	                  |	 
 	            Transcription	 
 	                  ↓	 
 
5’	AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG	3’
 	Single stranded messenger RNA	 

Whenever a transcribe (or back_transcribe) is executed, the alphabet is changed accordingly:

>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> messenger_rna = coding_dna.transcribe()
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
>>> messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

You can translate a messenger rna or the coding dna directly. A translation table can be specified to correctly deal with differing stop codons (see NCBI Genetic Codes):

>>> messenger_rna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(table='Vertebrate Mitochondrial')
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

Data for each table can be accessed via from Bio.Data import CodonTable.

SeqRecord

Class that allows the association of features and identifiers with a sequence. Has the following attributes:

  • seq – The sequence itself [Seq object].
  • id – The primary ID to identify the sequence [string].
  • name – A “common” name/id for the sequence [string].
  • description – A human readable description for the sequence [string].
  • letter_annotations – Per-letter-annotations, keys are the name and information is contained as Python sequence [dictionary].
  • annotations – Additional information about the sequence [dictionary].
  • features – List of SeqFeature objects with more structured information about the features on a sequence [list].
  • dbxrefs – List of database cross-references [list].

Parse this information from files using SeqIO (GenBank files provide annotations and features, FASTA are more compact):

>>> from Bio import SeqIO
>>> record = SeqIO.read('NC_005816.gb', 'genbank')
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=['Project:10638'])
<ol>
<li>Use &quot;SeqIO.parse()&quot; if the file has more than 1 entry.</li>
</ol>


SeqFeature

Essential for describing aspects of a sequence. See here.

Parsing

Sequences from a taxon can be manually queried from the Nucleotide database, which collects data from several sources and results exported in different formats.

SeqIO can parse results from GenBank and FASTA files returning an iterator. First argument is the file (or handle) and second argument is the file format (important!). In case you need to read a one entry file use Bio.SeqIO.read().

from Bio import SeqIO
for seq_record in SeqIO.parse('clypeasteroida.gb', 'genbank'):
    # Just print some info.
    print seq_record.id
    print repr(seq_record.seq)
    print len(seq_record)
<ol>
<li>Use &quot;SeqIO.parse('clypeasteroida.fasta', 'fasta')&quot; for FASTA files.</li>
</ol>


Load the whole records into memory with:

from Bio import SeqIO
records = list(SeqIO.parse('clypeasteroida.gb', 'genbank'))

Get the organism scientific name from a record:

>>> print first_record.annotations["organism"]
Clypeaster subdepressus

Sequence as dictionaries

Bio.SeqIO.to_dict() – flexible and memory demanding. Keys can be specified if needed (for FASTA) using the argument key_function.

from Bio import SeqIO
orchid_dict = SeqIO.to_dict(SeqIO.parse('clypeasteroida.gb', 'genbank'))

Bio.SeqIO.index() – middle ground, dictionary parsing on demand. Use only files, not handles.

from Bio import SeqIO
orchid_dict = SeqIO.index('clypeasteroida.gb', 'genbank')

Bio.SeqIO.index_db() – low memory and bit slower; stores identifiers on a SQLite database. Three arguments, name of the database (preferably .idx), file (or list of files) to read, and format.

from Bio import SeqIO
indexed_dict = SeqIO.index_db('database.idx', files, 'genbank')

Use get_raw() method with a record id to access the raw text entry.

BLASTing

Internet

Use qblast() function from Bio.Blast.NCBIWWW with 3 arguments: blast program (blastn, blastp, blastx, tblast, tblastx; see descriptions), databases to be searched (see descriptions), and string containing the sequence (sequence itself, sequence in FASTA or identifier). Function returns XML.

from Bio.Blast import NCBIWWW
from Bio import SeqIO
record = SeqIO.read(open('m_cold.fasta'), format='fasta')
result_handle = NCBIWWW.qblast('blastn', 'nr', record.format('fasta'))
<ol>
<li>Also: result_handle = NCBIWWW.qblast('blastn', 'nr', record.seq)</li>
</ol>


It is better to save the results to a file to avoid reconnection to NCBI:

<ol>
<li>Save results to file.</li>
</ol>


save_file = open('my_blast.xml', 'w')
save_file.write(result_handle.read())
save_file.close()
result_handle.close()
<ol>
<li>Open file to read results.</li>
</ol>


result_handle = open(&quot;my_blast.xml&quot;)

Local

BLAST+ is the standalone program and Bio.Blast.Applications has wrappers for it (except for makeblastdb, which could be useful in the future).

This is a regular command line, assuming nr database is installed and configured:

blastx -query opuntia.fasta -db nr -out opuntia.xml -evalue 0.001 -outfmt 5

And this is the same command wrapped:

>>> from Bio.Blast.Applications import NcbiblastxCommandline
>>> blastx_cline = NcbiblastxCommandline(query='opuntia.fasta', db='nr', evalue=0.001,
...                                      outfmt=5, out='opuntia.xml')
>>> print blastx_cline
blastx -out opuntia.xml -outfmt 5 -query opuntia.fasta -db nr -evalue 0.001
>>> stdout, stderr = blastx_cline()

BLAST parsing

XML is the more stable format for parsing results. It does not matter how the data is fetched, Biopython will be able to parse it if it is XML. To parse the handle directly or an opened .xml file:

from Bio.Blast import NCBIXML
blast_records = NCBIXML.parse(result_handle)
for record in blast_records:
    <do something>

BLAST class

Example of information stored in a BLAST record:

E_VALUE_THRESH = 0.04
for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < E_VALUE_THRESH:
            print '****Alignment****'
            print 'sequence:', alignment.title
            print 'length:', alignment.length
            print 'e value:', hsp.expect
            print hsp.query[0:75] + '...'
            print hsp.match[0:75] + '...'
            print hsp.sbjct[0:75] + '...'

Error handling

Not easy to get a ValueError in the middle of a huge file, so use BlastErrorParser:

<ol>
<li>Define filenames.</li>
</ol>


import os
blast_file = 'big_blast.out'
error_file = 'big_blast.problems'
<ol>
<li>Create error handler.</li>
</ol>


from Bio.Blast import NCBIStandalone
error_handle = open(error_file, 'w')
blast_error_parser = NCBIStandalone.BlastErrorParser(error_handle)
<ol>
<li>Trigger iterator with the error parser.</li>
</ol>


result_handle = open(blast_file)
iterator = NCBIStandalone.Iterator(result_handle, blast_error_parser)

Entrez

Still need to properly read Chapter 8 and take notes.

Parse sequence(s) from id(s) using Entrez:

from Bio import Entrez
from Bio import SeqIO
Entrez.email = 'A.N.Other@example.com'
handle = Entrez.efetch(db='nucleotide', rettype='gb', \
                       id='6273291,6273290,6273289')
for seq_record in SeqIO.parse(handle, 'gb'):
    print seq_record.id, seq_record.description[:50] + '...'
    print 'Sequence length %i,' % len(seq_record),
    print '%i features,' % len(seq_record.features),
    print 'from: %s' % seq_record.annotations['source']
handle.close()