Updates from July, 2011 Toggle Comment Threads | Keyboard Shortcuts

  • Bruno Vellutini 11:24 on 2011/07/27 Permalink
    Tags: , , ,   

    First dirty script 

    Following the previous entry I started to write a script to programatically BLAST sequences against a local database and GenBank database. It is a Python script using Biopython module and BLAST+.

    As a test case I downloaded sequences from a Bugula neritina transcriptome (from this study(20540116)), selected a few candidate-genes from another B. neritina study(10.1186/2041-9139-2-13), and executed local BLASTs at the database.

    Transcriptome sequences are in FASTA format and need to be converted to a database BLAST+ can understand. This is accomplished with the following command:

    makeblastdb -in bugula.fasta -parse_seqids -dbtype nucl
    

    Now that the local database is ready to be BLASTed I put GenBank (.gb) files from single candidate-genes into the candidates folder. Then run the script with: python blaster.py or ./blaster.py. It will pickup each candidate-gene from the folder, BLAST against the database, and write the results to the local_blasts folder.

    I need to process (select, limit, etc) these local BLASTS results and run the obtained sequences against GenBank. This last part is still missing from the script, but should be ok to write.

    Below is the zip file with the test case (except for the Bugula database) and the script itself:

    BLAST Batch Script [zip]

    #!/usr/bin/python
    <ol>
    <li>-*- coding: utf-8 -*-</li>
    </ol>
    
    
    
    '''BLASTing batch script.
    
    Algorithm:
        - Take several genes of interest.
        - BLAST them locally against a transcriptome assembly.
        - Filter and process results to extract relevant sequences.
        - BLAST resulting sequences against GenBank.
        - Retrieve results for further analysis.
    
    Notes:
        - Read files with sequences from a folder (FASTA or GenBank format). Could also be a list of IDs or
          an pickled object, etc...
    
    Dependencies:
        - Python 2.7.1
        - Biopython 1.56
        - BLAST+ 2.2.25
    
    Configuration (Ubuntu):
        - Regular packages for Python and Biopython.
        - Add BLAST+ commands to path putting this code to .bashrc:
    
            PATH=$PATH:/home/nelas/Downloads/ncbi-blast-2.2.25+/bin
            export PATH
    '''
    
    import os
    
    from Bio import SeqIO
    from Bio.Blast import NCBIWWW
    from Bio.Blast import NCBIXML
    from Bio.Blast.Applications import NcbiblastnCommandline
    
    <ol>
    <li>Folder with candidate-genes.</li>
    </ol>
    
    
    CANDIDATE_FOLDER = 'candidates'
    
    <ol>
    <li>Folder with candidate-genes BLASTs.</li>
    </ol>
    
    
    CANDIDATE_BLASTS = 'local_blasts'
    
    <ol>
    <li>Folder with candidate-genes BLASTs.</li>
    </ol>
    
    
    GENBANK_BLASTS = 'genbank_blasts'
    
    <ol>
    <li>Database name.</li>
    </ol>
    
    
    DATABASE = 'bugula.fasta'
    <ol>
    <li>The source file from Bugula neritina transcriptome was downloaded from here:</li>
    <li>http://www.ncbi.nlm.nih.gov/sra/SRX015704</li>
    </ol>
    
    
    
    <ol>
    <li>Note: the original FASTA file needs to be converted to a BLASTable database.</li>
    <li>This can be done via command line and preferably before the script is run:</li>
    <li>makeblastdb -in bugula.fasta -parse_seqids -dbtype nucl</li>
    </ol>
    
    
    
    <ol>
    <li>Deprecated way of doing it, just for reference:</li>
    </ol>
    
    
    
    <ol>
    <li>Format a FASTA file to a readable database.</li>
    <li>formatdb -i bugula.fasta -p F -o</li>
    <li>BLAST against nucleotide with XML output:</li>
    <li>blast2 -p blastn -d bugula.fasta -o blastbugula.xml -i candidates/BnFoxA.gb -m 7</li>
    </ol>
    
    
    
    <ol>
    <li>Build list with path to candidate-genes.</li>
    </ol>
    
    
    candidates = os.listdir(CANDIDATE_FOLDER)
    <ol>
    <li>Only read GenBank files.</li>
    </ol>
    
    
    candidates = [file for file in candidates if file.endswith('.gb')]
    
    print '\n%d genes to be BLASTed against %s database!' % (len(candidates),
            DATABASE)
    
    for gene_file in candidates:
        # Define filepath.
        gene_filepath = os.path.join(CANDIDATE_FOLDER, gene_file)
    
        # Create record object.
        record = SeqIO.read(gene_filepath, 'genbank')
    
        # Create temporary FASTA file, if not available.
        try:
            fasta_file = open(gene_filepath.split('.')[0] + '.fasta', 'r')
        except IOError:
            fasta_file = open(gene_filepath.split('.')[0] + '.fasta', 'w')
            fasta_file.write(record.format('fasta'))
            fasta_file.close()
            fasta_file = open(gene_filepath.split('.')[0] + '.fasta', 'r')
    
        # Instantiate the BLASTn command.
        cline = NcbiblastnCommandline(
                query=fasta_file.name, db=DATABASE,
                out=os.path.join(
                    CANDIDATE_BLASTS, gene_file.split('.')[0]),
                    #CANDIDATE_BLASTS, gene_file.split('.')[0] + '.xml'),
                #outfmt=5 # Export in XML
                )
    
        # Execute BLAST.
        stdout, stderr = cline()
    
        # Close file.
        fasta_file.close()
    
        print '\t' + gene_file
    
    #TODO Process local BLASTs
    
    #TODO BLAST results against GenBank
    
    #TODO Process remote BLASTs
    
    print 'Done, bye!\n'
    

    Bibliography

     
  • Bruno Vellutini 21:32 on 2011/07/25 Permalink
    Tags: , , ,   

    Biopython begins 

    Learning a bit of bioinformatics to deal with transcriptome data. Idea is to use Biopython tools to automate sequence BLASTing with Python scripts. Here are some notes that might be useful to keep; extracted from the Tutorial and Cookbook.

    Sequence

    Sequence objects Seq are composed of a string of letters and an alphabet, which determines what the letters mean (eg. is A a nucleotide or Alanine?). Seq objects act like “read only” python strings (use str(seq) to compare sequences). To modify sequences use from Bio.Seq import MutableSeq.

    >>> from Bio.Seq import Seq
    >>> my_seq = Seq('AGTACACTGGT')
    >>> my_seq
    Seq('AGTACACTGGT', Alphabet())
    >>> print my_seq
    AGTACACTGGT
    >>> my_seq.alphabet
    Alphabet()
    

    When creating sequences the alphabet should be explicitly specified following the IUPAC alphabet recommendations defined in Bio.Alphabet module.

    >>> from Bio.Seq import Seq
    >>> from Bio.Alphabet import IUPAC
    >>> my_seq = Seq('AGTACACTGGT', IUPAC.unambiguous_dna)
    >>> my_seq
    Seq('AGTACACTGGT', IUPACUnambiguousDNA())
    >>> my_seq.alphabet
    IUPACUnambiguousDNA()
    >>> my_prot = Seq('AGTACACTGGT', IUPAC.protein)
    >>> my_prot
    Seq('AGTACACTGGT', IUPACProtein())
    >>> my_prot.alphabet
    IUPACProtein()
    

    There are methods to obtain the complement and reverse_complement of a sequence:

    >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
    >>> my_seq
    Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
    >>> my_seq.complement()
    Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
    >>> my_seq.reverse_complement()
    Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
    

    The biological transcription process run a reverse complement on the template strand. On bioinformatics it is easier to only replace the coding strand with Us.

     	DNA coding strand (aka Crick strand, strand +1)	 
    5’	ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG	3’
     	|||||||||||||||||||||||||||||||||||||||	 
    3’	TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC	5’
     	DNA template strand (aka Watson strand, strand −1)	 
     
     	                  |	 
     	            Transcription	 
     	                  ↓	 
     
    5’	AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG	3’
     	Single stranded messenger RNA	 
    

    Whenever a transcribe (or back_transcribe) is executed, the alphabet is changed accordingly:

    >>> coding_dna
    Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
    >>> messenger_rna = coding_dna.transcribe()
    >>> messenger_rna
    Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
    >>> messenger_rna.back_transcribe()
    Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
    

    You can translate a messenger rna or the coding dna directly. A translation table can be specified to correctly deal with differing stop codons (see NCBI Genetic Codes):

    >>> messenger_rna.translate()
    Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
    >>> coding_dna.translate()
    Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
    >>> coding_dna.translate(table='Vertebrate Mitochondrial')
    Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
    

    Data for each table can be accessed via from Bio.Data import CodonTable.

    SeqRecord

    Class that allows the association of features and identifiers with a sequence. Has the following attributes:

    • seq – The sequence itself [Seq object].
    • id – The primary ID to identify the sequence [string].
    • name – A “common” name/id for the sequence [string].
    • description – A human readable description for the sequence [string].
    • letter_annotations – Per-letter-annotations, keys are the name and information is contained as Python sequence [dictionary].
    • annotations – Additional information about the sequence [dictionary].
    • features – List of SeqFeature objects with more structured information about the features on a sequence [list].
    • dbxrefs – List of database cross-references [list].

    Parse this information from files using SeqIO (GenBank files provide annotations and features, FASTA are more compact):

    &gt;&gt;&gt; from Bio import SeqIO
    &gt;&gt;&gt; record = SeqIO.read('NC_005816.gb', 'genbank')
    &gt;&gt;&gt; record
    SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
    IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
    description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
    dbxrefs=['Project:10638'])
    <ol>
    <li>Use &quot;SeqIO.parse()&quot; if the file has more than 1 entry.</li>
    </ol>
    
    
    

    SeqFeature

    Essential for describing aspects of a sequence. See here.

    Parsing

    Sequences from a taxon can be manually queried from the Nucleotide database, which collects data from several sources and results exported in different formats.

    SeqIO can parse results from GenBank and FASTA files returning an iterator. First argument is the file (or handle) and second argument is the file format (important!). In case you need to read a one entry file use Bio.SeqIO.read().

    from Bio import SeqIO
    for seq_record in SeqIO.parse('clypeasteroida.gb', 'genbank'):
        # Just print some info.
        print seq_record.id
        print repr(seq_record.seq)
        print len(seq_record)
    <ol>
    <li>Use &quot;SeqIO.parse('clypeasteroida.fasta', 'fasta')&quot; for FASTA files.</li>
    </ol>
    
    
    

    Load the whole records into memory with:

    from Bio import SeqIO
    records = list(SeqIO.parse('clypeasteroida.gb', 'genbank'))
    

    Get the organism scientific name from a record:

    >>> print first_record.annotations["organism"]
    Clypeaster subdepressus
    

    Sequence as dictionaries

    Bio.SeqIO.to_dict() – flexible and memory demanding. Keys can be specified if needed (for FASTA) using the argument key_function.

    from Bio import SeqIO
    orchid_dict = SeqIO.to_dict(SeqIO.parse('clypeasteroida.gb', 'genbank'))
    

    Bio.SeqIO.index() – middle ground, dictionary parsing on demand. Use only files, not handles.

    from Bio import SeqIO
    orchid_dict = SeqIO.index('clypeasteroida.gb', 'genbank')
    

    Bio.SeqIO.index_db() – low memory and bit slower; stores identifiers on a SQLite database. Three arguments, name of the database (preferably .idx), file (or list of files) to read, and format.

    from Bio import SeqIO
    indexed_dict = SeqIO.index_db('database.idx', files, 'genbank')
    

    Use get_raw() method with a record id to access the raw text entry.

    BLASTing

    Internet

    Use qblast() function from Bio.Blast.NCBIWWW with 3 arguments: blast program (blastn, blastp, blastx, tblast, tblastx; see descriptions), databases to be searched (see descriptions), and string containing the sequence (sequence itself, sequence in FASTA or identifier). Function returns XML.

    from Bio.Blast import NCBIWWW
    from Bio import SeqIO
    record = SeqIO.read(open('m_cold.fasta'), format='fasta')
    result_handle = NCBIWWW.qblast('blastn', 'nr', record.format('fasta'))
    <ol>
    <li>Also: result_handle = NCBIWWW.qblast('blastn', 'nr', record.seq)</li>
    </ol>
    
    
    

    It is better to save the results to a file to avoid reconnection to NCBI:

    <ol>
    <li>Save results to file.</li>
    </ol>
    
    
    save_file = open('my_blast.xml', 'w')
    save_file.write(result_handle.read())
    save_file.close()
    result_handle.close()
    <ol>
    <li>Open file to read results.</li>
    </ol>
    
    
    result_handle = open(&quot;my_blast.xml&quot;)
    

    Local

    BLAST+ is the standalone program and Bio.Blast.Applications has wrappers for it (except for makeblastdb, which could be useful in the future).

    This is a regular command line, assuming nr database is installed and configured:

    blastx -query opuntia.fasta -db nr -out opuntia.xml -evalue 0.001 -outfmt 5
    

    And this is the same command wrapped:

    >>> from Bio.Blast.Applications import NcbiblastxCommandline
    >>> blastx_cline = NcbiblastxCommandline(query='opuntia.fasta', db='nr', evalue=0.001,
    ...                                      outfmt=5, out='opuntia.xml')
    >>> print blastx_cline
    blastx -out opuntia.xml -outfmt 5 -query opuntia.fasta -db nr -evalue 0.001
    >>> stdout, stderr = blastx_cline()
    

    BLAST parsing

    XML is the more stable format for parsing results. It does not matter how the data is fetched, Biopython will be able to parse it if it is XML. To parse the handle directly or an opened .xml file:

    from Bio.Blast import NCBIXML
    blast_records = NCBIXML.parse(result_handle)
    for record in blast_records:
        <do something>
    

    BLAST class

    Example of information stored in a BLAST record:

    E_VALUE_THRESH = 0.04
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < E_VALUE_THRESH:
                print '****Alignment****'
                print 'sequence:', alignment.title
                print 'length:', alignment.length
                print 'e value:', hsp.expect
                print hsp.query[0:75] + '...'
                print hsp.match[0:75] + '...'
                print hsp.sbjct[0:75] + '...'
    

    Error handling

    Not easy to get a ValueError in the middle of a huge file, so use BlastErrorParser:

    <ol>
    <li>Define filenames.</li>
    </ol>
    
    
    import os
    blast_file = 'big_blast.out'
    error_file = 'big_blast.problems'
    <ol>
    <li>Create error handler.</li>
    </ol>
    
    
    from Bio.Blast import NCBIStandalone
    error_handle = open(error_file, 'w')
    blast_error_parser = NCBIStandalone.BlastErrorParser(error_handle)
    <ol>
    <li>Trigger iterator with the error parser.</li>
    </ol>
    
    
    result_handle = open(blast_file)
    iterator = NCBIStandalone.Iterator(result_handle, blast_error_parser)
    

    Entrez

    Still need to properly read Chapter 8 and take notes.

    Parse sequence(s) from id(s) using Entrez:

    from Bio import Entrez
    from Bio import SeqIO
    Entrez.email = 'A.N.Other@example.com'
    handle = Entrez.efetch(db='nucleotide', rettype='gb', \
                           id='6273291,6273290,6273289')
    for seq_record in SeqIO.parse(handle, 'gb'):
        print seq_record.id, seq_record.description[:50] + '...'
        print 'Sequence length %i,' % len(seq_record),
        print '%i features,' % len(seq_record.features),
        print 'from: %s' % seq_record.annotations['source']
    handle.close()
    
     
  • Bruno Vellutini 15:03 on 2011/07/15 Permalink
    Tags: , ,   

    Polypide development in ancestrula and budding 

    Are the developmental mechanisms regulating the polypide development in the ancestrula similar to the mechanisms regulating polypide development after asexual budding?

    Look for possibly related articles.

     
  • Bruno Vellutini 14:52 on 2011/07/15 Permalink
    Tags: comparative,   

    Comparative development of cyphonautes and coronates 

    In-depth comparison of cyphonautes and coronates body patterning.

    Early development

    • How different are the fate maps between the two types of larvae?
    • What are the key differences in the regulation of development that might be related to the differing morphology?

    Metamorphosis

    • Which tissues originate which adult structures and is there any variation when comparing cyphonautes and coronates?
    • Which organs will form the cystid in different larvae?
    • Which tissues form the polypide rudiment in different larvae?

    Questions notes

    Develop a method to discover which larval type is ancestral. How? Only six genera have cyphonautes and they are considered to be “more primitive”, Cheilostomes and Ctenostomes. Can the comparison of the early development patterning and gene expression between cyphonautes and coronates reveal any key regulatory mechanisms? Can I use cooption events to solve this? Absence of spiral cleavage in all bryozoans might indicate an early feature of the group. Brooding could have disrupted early development, or in a Buss (or Metchnikoff?) sense, allowed variation to appear. If first bryozoans were brooders, then are cyphonautes derived/specialized larvae? Does having similar cleavage mean something? Would cyphonautes have evolved by paedomorphosis of coronate-like larvae? Or vice-versa?

    Extra: Cyphonautes morphogenesis

    • How cyphonautes got flattened?
     
  • Bruno Vellutini 14:36 on 2011/07/15 Permalink
    Tags: , cell lineage,   

    Bryozoan cell lineage tracing 

    Elucidate the origins of larval and adult tissues of a bryozoan.

    [short background]

    Embryonic origin of larval structures

    Use cell lineage tracing to find the fate of cells after initial cleavages.

    • Which structures do they form and can this be understood as a regular ecto/meso/endo layer schema?
    • What is the fate of internalized cells after gastrulation?

    Development and embryonic origin of adult tissues

    • What is the fate of larval tissues, specially the blastemas?
    • Which adult structures do they form and how development occurs?
    • Formation of coelomic cavities during metamorphosis?

    Questions

    • Use gene expression patterns from typical spiral cleavage embryos to compare with bryozoan cleavage and maybe identify what might have happened during its loss.
     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel