Tagged: blast Toggle Comment Threads | Keyboard Shortcuts

  • Bruno Vellutini 20:28 on 2012/02/21 Permalink
    Tags: , blast, ,   

    BLASTer script improvements 

    I made some improvements in the BLASTer script, described below.

    • To deal with some NCBI updates I changed the rettype and retmode of the Entrez.efetch command.
    • Since I am now trying candidate genes from other species I set the reciprocal BLAST to the respective organism database. Now each locus shows each reciprocal BLAST separately, when needed.
    • The loci frame and strand is now stored and printed along.
    • Major output formatting adjustments.

    An output example:

    Locus_978_Transcript_1/1_Confidence_1.000_Length_2580 | frame: +1 | candidates: DDX4, vas, vasa, pl10
    
    				gene                id                  accession           e-value
      Homo sapiens
    				DDX4                54514               NP_001160005.1      3.78e-141
    				DDX4                54514               NP_001136021.1      3.78e-141
    				DDX4                54514               NP_077726.1         3.78e-141 <<
    				DDX4                54514               NP_001160006.1      4.93e-141
    				DDX3X               1654                NP_001180346.1      7.40e-121
    				DDX3X               1654                NP_001347.3         7.40e-121
    				DDX3X               1654                NP_001180345.1      7.40e-121
      Danio rerio
    				vasa                30263               NP_571132.1         1.36e-146 <<
    				ddx3                566947              NP_001119895.1      3.05e-122
    				pl10                30116               NP_571016.2         1.16e-121 <<
    				ddx42               503932              NP_001032894.2      5.45e-79
    				ddx5                322206              NP_997777.1         4.32e-76
      Drosophila melanogaster
    				bel                 45826               NP_536783.1         8.20e-120
    				CG10077             38756               NP_648062.2         1.32e-77
    				CG6418              39218               NP_648413.1         2.11e-75
    				CG7878              40959               NP_649767.1         8.04e-75
    				Rm62                40739               NP_731034.1         2.19e-72
    				Rm62                40739               NP_731035.2         2.19e-72
    				Rm62                40739               NP_731033.1         2.19e-72
    				Rm62                40739               NP_001189182.1      2.19e-72
    				Rm62                40739               NP_731032.1         2.19e-72
    				Rm62                40739               NP_001163528.1      2.19e-72
    				Rm62                40739               NP_731031.1         2.19e-72
    				Rm62                40739               NP_524243.2         2.19e-72

    Now the next steps will be to integrate better with the Zim Desktop Wiki.

     
  • Bruno Vellutini 18:00 on 2011/12/01 Permalink
    Tags: blast, , , , ,   

    Bugula neritina germline BLASTing 

    Candidate genes related to germline development were batch BLASTed against the transcriptome of Bugula neritina using the new BLASTer script. Only 2 genes yield transcriptome alignments, Piwi and Pumilio. Here are the reverse BLAST (Bugula sequence against human protein database) results that returned matches (hit the same gene_id of the candidate gene):

    gnl|SRA|SRR034781.65778.2		(candidates: PIWIL2, PIWIL1)
    	gene		id		accession		e-value
    	PIWIL1		9271		NP_004755.2		2.77e-31 <<
    	PIWIL1		9271		NP_004755.2		2.77e-31 <<
    	PIWIL1		9271		NP_001177900.1		2.77e-31
    	PIWIL1		9271		NP_001177900.1		2.77e-31
    	PIWIL2		55124		NP_001129193.1		2.45e-29 <<
    	PIWIL2		55124		NP_001129193.1		2.45e-29 <<
    	PIWIL2		55124		NP_060538.2		2.45e-29
    	PIWIL2		55124		NP_060538.2		2.45e-29
    
    gnl|SRA|SRR034781.74499.2		(candidates: PIWIL2)
    	gene		id		accession		e-value
    	PIWIL1		9271		NP_004755.2		4.21e-33
    	PIWIL3		440822		NP_001008496.2		8.22e-29
    	PIWIL4		143689		NP_689644.2		2.39e-28
    	PIWIL2		55124		NP_001129193.1		1.31e-26 <<
    	PIWIL2		55124		NP_060538.2		1.31e-26
    
    gnl|SRA|SRR034781.49286.2		(candidates: PUM2, PUM1)
    	gene		id		accession		e-value
    	PUM1		9698		NP_001018494.1		1.67e-54 <<
    	PUM1		9698		NP_001018494.1		1.67e-54 <<
    	PUM1		9698		NP_001018494.1		3.43e-11 <<
    	PUM1		9698		NP_001018494.1		1.22e-08 <<
    	PUM1		9698		NP_055491.1		1.67e-54
    	PUM1		9698		NP_055491.1		1.67e-54
    	PUM1		9698		NP_055491.1		2.62e-11
    	PUM1		9698		NP_055491.1		1.10e-09
    
    gnl|SRA|SRR034781.80037.2		(candidates: PIWIL2, PIWIL1)
    	gene		id		accession		e-value
    	PIWIL1		9271		NP_004755.2		4.61e-32 <<
    	PIWIL3		440822		NP_001008496.2		5.27e-28
    	PIWIL4		143689		NP_689644.2		2.00e-27
    	PIWIL2		55124		NP_001129193.1		8.42e-26 <<
    	PIWIL2		55124		NP_060538.2		8.42e-26

    Complete reverse BLAST results are at results.txt and the sequences themselves at results.fa. Manually selected alignments file is selected.txt.

    Obs: low number of hits may be due to the quality of the transcriptome assembly, sequences seem to be short.

    Primers

    PUM1

    5′ and 3′ RACE primers needed

    Bn PUM1 F1    48-77      5'- CCAGACCAGACGGATGTGATTCTCTCAGAG -3'
    	                    30 nt primer on plus strand
    	                    pct G+C:   53.3 	Tm:   72.4
    Bn PUM1 F2    136-164    5'- ACGTGCTGGAACACGGTAGACTGGAGGAG -3'
    	                    29 nt primer on plus strand
    	                    pct G+C:   58.6 	Tm:   74.1
    
    Bn PUM1 R1    456-429    5'- ACTTCCTCAGTGTAGAAACATGGGGGCG -3'
    	                    28 nt primer on minus strand
    	                    pct G+C:   53.6 	Tm:   72.1
    Bn PUM1 R2    120-93     5'- TGCCATACTGATCCATCACCAGTCGGTC -3'
    	                    28 nt primer on minus strand
    	                    pct G+C:   53.6 	Tm:   73.3

    PIWIL1

    Only 5′ RACE primers needed.

    Bn PIWIL1 R1             5'- AACTGCCCATCTCCAACTCCATCACGGTAG -3'
    Bn PIWIL1 R2             5'- TTAGGGAAGCCACAAAACCACCGACAG -3'
     
  • Bruno Vellutini 18:00 on 2011/11/28 Permalink
    Tags: , blast, , genbank, ncbi,   

    BLASTer script update 

    Updated the previous dirty script to add missing features. Now it also makes the reciprocal BLAST for fragments similar to candidate genes.

    1. BLAST each candidate gene against your local database of interest (eg, transcriptome of an organism).
    2. Best 3 loci hits for the gene are BLASTed against the reciprocal candidate genes database (eg, human protein).
    3. Each locus is printed showing the best 5 genes (by gene id) and their e-values from the reciprocal BLAST and a FASTA file with the loci sequences for further analysis.

    Installation

    1. Install dependencies.
      • Python 2.7 (not Python 3)
      • Biopython 1.56
      • BLAST+ 2.2.25
    2. Copy the blaster.py to a new folder where it will run.
    3. Make sure you know where the local and reciprocal databases are (or copy them to the folder).

    Usage

    1. Convert databases to be BLASTable (if needed).
      makeblastdb -in membranipora.fa -parse_seqids -dbtype nucl
      makeblastdb -in human_protein.fa -parse_seqids -dbtype prot
      
    2. Copy FASTA files with 1 gene sequence per file into the candidates folder.
    3. Run the BLASTer command (database could also be in another folder).
      python blaster.py -d membranipora.fa -b tblastn
      
    4. Results (loci) are printed to 2 files. results.txt shows each locus with reciprocal BLAST output and e-values. results.fa aggregates the loci and their sequences.

    Arguments

    -c, --candidates — Folder with “.fa” or “.gb” files of candidate genes.
    One gene per file.
    -d, --database — Local database with new data (eg, transcriptome).
    -b, -blast — BLAST command (blastn, blastp, blastx, tblastn, tblastx).

    Code

    Copy the code below and save it as a new file blaster.py OR download BLASTer.

    #!/usr/bin/python
    <ol>
    	<li>-*- coding: utf-8 -*-</li>
    </ol>
    '''BLASTing batch script.
    
    Dependencies:
    <ul>
    <li>Python 2.7.1</li>
    <li>Biopython 1.56</li>
    <li>BLAST+ 2.2.25</li>
    </ul>
    
    
    
    Configuration (Linux):
    <ul>
    <li>Regular packages for Python and Biopython.</li>
    <li>Add BLAST+ commands to path putting this code to .bashrc:</li>
    </ul>
    
    
    
    PATH=$PATH:/home/nelas/Downloads/ncbi-blast-2.2.25+/bin
    export PATH
    
    See README file for details or execute the command for usage and arguments:
    
    python blaster.py -h
    
    '''
    
    import getopt
    import logging
    import os
    import pickle
    import re
    import subprocess
    import sys
    
    from Bio import Entrez, SeqIO
    from Bio.Seq import Seq
    from Bio.SeqRecord import SeqRecord
    from Bio.Blast import NCBIWWW, NCBIXML
    from Bio.Blast.Applications import NcbiblastnCommandline, NcbiblastpCommandline, NcbiblastxCommandline, NcbitblastnCommandline, NcbitblastxCommandline
    
    class Sequence(object):
    '''Represents a known sequence.
    
    filepath: path to fasta file.
    ref: sequence reference number.
    description: long description.
    sequence: the sequence itself.
    gene_id: NCBI gene identification.
    gene_name: NCBI gene name.
    blast_output: path to .xml file.
    loci: list of loci with recurrency?.
    '''
    Entrez.email = 'your@email.com'
    EVALUE_THRESH = 0.001
    
    <ol>
    <li>Cache folder.</li>
    </ol>
    
    
    cache_folder = 'seqs'
    <ol>
    <li>Check if cache folder exists.</li>
    </ol>
    
    
    if not os.path.isdir(cache_folder):
    os.mkdir(cache_folder)
    
    #TODO add __str__ to classes.
    
    def __init__(self, filepath=None, ref=None, database=None):
    #TODO Make limit a variable variable.
    self.limit = 3
    self.loci = {}
    self.gene_name = ''
    self.database = database
    if filepath:
    self.filepath = filepath
    self.parse_fasta(self.filepath)
    self.set_gene_id()
    self.set_gene_name()
    else:
    cache_path = os.path.join(self.cache_folder, ref)
    try:
    cache_file = open(cache_path, 'rb')
    cached = pickle.load(cache_file)
    self.load_cache(cached)
    cache_file.close()
    except:
    self.fetch(ref)
    cache_file = open(cache_path, 'wb')
    pickle.dump(self, cache_file)
    cache_file.close()
    
    def load_cache(self, cached):
    '''Load attributes saved in cache files.'''
    self.description = cached.description
    self.ref = cached.ref
    self.sequence = cached.sequence
    self.gene_name = cached.gene_name
    self.gene_id = cached.gene_id
    
    def parse_fasta(self, filepath):
    '''Parse data from FASTA file.'''
    record = SeqIO.read(filepath, 'fasta')
    <ol>
    <li>Define attributes.</li>
    </ol>
    
    
    self.description = record.description.split('|')[-1]
    #XXX Better parse ref!
    self.ref = record.id.split('|')[3]
    self.sequence = str(record.seq)
    self.gene_name = filepath.split('/')[-1][:-3]
    
    def set_gene_id(self):
    '''Get and set gene id from NCBI.'''
    #import pdb; pdb.set_trace()
    #FIXME Handle error when not returning id.
    handle = Entrez.esearch(db='gene', term=self.ref)
    record = Entrez.read(handle)
    if record['IdList']:
    self.gene_id = record['IdList'][0]
    else:
    self.gene_id = None
    
    def set_gene_name(self):
    '''Get and set gene name from filepath.'''
    filepath = os.path.basename(self.filepath)
    self.gene_name = os.path.splitext(filepath)[0]
    
    def fetch(self, ref):
    '''Fetch data from NCBI.'''
    handle = Entrez.efetch(db='protein', id=ref, rettype='gb')
    record = SeqIO.read(handle, 'genbank')
    #TODO maybe just save file locally...
    handle = Entrez.efetch(db='protein', id=ref, rettype='gb')
    handle_string = handle.read()
    <ol>
    <li>Define attributes.</li>
    </ol>
    
    
    #FIXME See if this is ok!
    self.description = record.description
    self.ref = record.id
    self.sequence = str(record.seq)
    self.set_gene_id()
    <ol>
    <li>Find gene name by regular expression.</li>
    </ol>
    
    
    try:
    pattern = re.compile(r'gene="(.+)"')
    reobj = pattern.search(handle_string)
    self.gene_name = reobj.group(1)
    except:
    self.gene_name = 'Not found'
    <ol>
    <li>To extract gene description, name and other info from xml parsing.</li>
    </ol>
    
    
    #&gt;&gt;&gt; record['Bioseq-set_seq-set'][0]['Seq-entry_set']['Bioseq-set']['Bioseq-set_seq-set'][0]['Seq-entry_seq']['Bioseq']['Bioseq_annot'][0]['Seq-annot_data']['Seq-annot_data_ftable'][1]['Seq-feat_data']['SeqFeatData']['SeqFeatData_gene']['Gene-ref']
    #&gt;&gt;&gt; {u'Gene-ref_desc': 'DEAD (Asp-Glu-Ala-Asp) box polypeptide 4', u'Gene-ref_syn': ['VASA'], u'Gene-ref_locus': 'DDX4'}
    
    def parse_blast(self):
    '''Parse BLAST output file.'''
    blast_records = NCBIXML.parse(open(self.blast_output))
    n = 0
    <ol>
    <li>Iterate over BLAST results from database.</li>
    </ol>
    
    
    for blast_record in blast_records:
    for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
    if n &lt; self.limit:
    if hsp.expect &lt; self.EVALUE_THRESH:
    
    print 'Instantiating &gt;&gt; %s' % alignment.title
    locus_id = alignment.title.split()[0]
    
    <ol>
    <li>Save locus to dictionary if key does not exists.</li>
    </ol>
    
    
    #NOTE This means that only the first hsp found in the locus is added,
    <ol>
    <li>while the other ones are skipped</li>
    </ol>
    
    
    if not locus_id in self.loci.keys():
    self.loci[locus_id] = {
    'score': hsp.score,
    'evalue': hsp.expect
    }
    
    n += 1
    
    class Locus(object):
    '''Represents a new sequence.
    
    candidate: instance of Candidate.
    description: long description.
    sequence: the sequence itself.
    score: from Candidate BLAST.
    evalue: from Candidate BLAST.
    reverse_blast_output: reverse BLAST to human protein db.
    '''
    EVALUE_THRESH = 0.001
    
    <ol>
    <li>Reverse BLAST folder.</li>
    </ol>
    
    
    reverse_folder = 'reverse'
    
    def __init__(self, id, candidate, database):
    <ol>
    <li>Linkback to candidates.</li>
    </ol>
    
    
    self.candidates = {}
    
    self.reciprocals = []
    
    self.update_candidates(candidate)
    
    self.reciprocal = False
    self.rank = 0
    self.id = id
    self.database = database
    
    <ol>
    <li>Create filename before the rest.</li>
    </ol>
    
    
    self.set_filename()
    
    self.reverse_results_folder = os.path.join(self.reverse_folder, 'results')
    
    <ol>
    <li>Check if reverse folder exists.</li>
    </ol>
    
    
    if not os.path.isdir(self.reverse_folder):
    os.mkdir(self.reverse_folder)
    <ol>
    <li>Check if reverse results folder exists.</li>
    </ol>
    
    
    if not os.path.isdir(self.reverse_results_folder):
    os.mkdir(self.reverse_results_folder)
    
    self.set_filepath()
    self.set_blast_output()
    self.set_sequence()
    
    self.write_fasta()
    
    def update_candidates(self, candidate):
    '''Update list of linked candidate genes.'''
    if not candidate.gene_name in self.candidates.keys():
    self.candidates[candidate.gene_name] = candidate
    
    def set_sequence(self):
    '''Get and set sequence from database.'''
    parsed = SeqIO.parse(self.database, 'fasta')
    for locus in parsed:
    if locus.id == self.id:
    self.sequence = str(locus.seq)
    if not self.sequence:
    print 'SEQUENCE NOT SET! Check the IDs.'
    
    def set_blast_output(self):
    '''Build filepath for reverse blast output.'''
    blast_output = os.path.join(self.reverse_results_folder, '%s.xml' % self.filename)
    self.reverse_blast_output = blast_output
    
    def set_filepath(self):
    '''Build filepath from folder and filename.'''
    locus_filepath = os.path.join(self.reverse_folder, '%s.fa' % self.filename)
    self.filepath = locus_filepath
    
    def set_filename(self):
    '''Extract filename from ID.'''
    filename = '_'.join(self.id.split('_')[:2])
    self.filename = filename
    
    def write_fasta(self):
    '''Write FASTA file to filepath.'''
    #TODO Insert line breaks for sequences.
    locus_file = open(self.filepath, 'w')
    locus_file.write('&gt;%s\n%s' % (self.id, self.sequence))
    locus_file.close()
    
    def parse_blast(self):
    '''Parse reciprocal BLAST.'''
    print 'Parsing reciprocal BLAST: %s' % self.filename
    blast_records = NCBIXML.parse(open(self.reverse_blast_output))
    for blast_record in blast_records:
    for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
    if hsp.expect &lt; self.EVALUE_THRESH:
    #XXX Better parse ref!
    ref = alignment.title.split('|')[3]
    #print 'Creating object: %s' % ref
    sequence = Sequence(ref=ref)
    sequence.evalue = hsp.expect
    sequence.score = hsp.score
    self.reciprocals.append(sequence)
    <ol>
    <li>Process results.</li>
    </ol>
    
    
    #self.process()
    
    def process(self):
    '''Define if locus is reciprocal to the gene.
    
    True: if it matches the same gene_id and ref from the candidate gene.
    '''
    #TODO Handle when gene_id is None?
    rank = 0
    print '\nrank\tgene\tref\t\thit'
    for sequence in self.reciprocals:
    if sequence.gene_id == current_gene_id:
    #DOSTUFF
    continue
    else:
    break
    current_gene_id = sequence.gene_id
    
    rank += 1
    #if sequence.gene_id == self.candidate.gene_id:
    <ol>
    <li>if sequence.ref == self.candidate.ref:</li>
    <li>print '%d\t%s\t%s\t%s' % (rank, sequence.gene_id, sequence.ref, '+')</li>
    <li>print '\nFound reciprocal! Gene: %s, Ref: %s\n' % (sequence.gene_id, sequence.ref)</li>
    <li>self.reciprocal = True</li>
    <li>self.rank = rank</li>
    <li>break</li>
    </ol>
    
    
    print '%d\t%s\t%s\t%s' % (rank, sequence.gene_id, sequence.ref, '-')
    
    def blast(blast_type, arguments):
    '''Execute BLAST command.'''
    <ol>
    <li>Instantiate the BLAST command.</li>
    </ol>
    
    
    if blast_type == 'blastn':
    cline = NcbiblastnCommandline(**arguments)
    elif blast_type == 'blastp':
    cline = NcbiblastpCommandline(**arguments)
    elif blast_type == 'blastx':
    cline = NcbiblastxCommandline(**arguments)
    elif blast_type == 'tblastn':
    cline = NcbitblastnCommandline(**arguments)
    elif blast_type == 'tblastx':
    cline = NcbitblastnCommandline(**arguments)
    
    <ol>
    <li>Execute BLAST.</li>
    </ol>
    
    
    stdout, stderr = cline()
    print '%s BLASTed!' % arguments['query']
    
    def prepare(candidates, candidates_folder):
    '''Check candidate gene files (convert to FASTA, if needed).'''
    for gene in candidates:
    gene_filepath = os.path.join(candidates_folder, gene)
    
    if gene.endswith('.gb'):
    <ol>
    <li>Create record object.</li>
    </ol>
    
    
    record = SeqIO.read(gene_filepath, 'genbank')
    <ol>
    <li>Create FASTA file.</li>
    </ol>
    
    
    try:
    fasta_file = open(gene_filepath.split('.')[0] + '.fa', 'r')
    except IOError:
    fasta_file = open(gene_filepath.split('.')[0] + '.fa', 'w')
    fasta_file.write(record.format('fasta'))
    fasta_file.close()
    elif gene.endswith('.fa'):
    continue
    else:
    logger.debug('File type not supported: %s', gene)
    
    def usage():
    '''Explanation for arguments.'''
    
    print
    print 'USAGE: ./blaster.py -d Membranipora.fasta -b tblastn'
    print
    print ' -c, --candidates \n\tFolder with `.fa` or `.gb` files of candidate genes. One gene per file.'
    print ' -d, --database \n\tLocal database with new data (eg, transcriptome).'
    print ' -b, --blast \n\tBLAST command (blastn, blastp, blastx, tblastn, tblastx).'
    print
    
    #TODO Omit the candidates folder option? Should not be needed.
    #TODO Ability to specify limit for candidate blast results to be analysed.
    #TODO Choose evalue threshold.
    
    def main(argv):
    
    <ol>
    <li>Available BLASTs.</li>
    </ol>
    
    
    blast_types = ['blastn', 'blastp', 'blastx', 'tblastn', 'tblastx']
    <ol>
    <li>Database name.</li>
    </ol>
    
    
    database = None
    <ol>
    <li>BLAST command.</li>
    </ol>
    
    
    blast_type = None
    
    <ol>
    <li>Folder with candidate-genes.</li>
    </ol>
    
    
    candidates_folder = 'candidates'
    <ol>
    <li>Folder with candidate-genes BLASTs.</li>
    </ol>
    
    
    candidates_results_folder = os.path.join(candidates_folder, 'results')
    
    <ol>
    <li>Parse arguments.</li>
    </ol>
    
    
    try:
    opts, args = getopt.getopt(argv, 'hc:d:b:', [
    'help',
    'candidates=',
    'database=',
    'blast=',
    ])
    except getopt.GetoptError:
    usage()
    logger.critical('Argument error: "%s". Aborting...',
    ' '.join(argv))
    sys.exit(2)
    
    <ol>
    <li>Argument handler.</li>
    </ol>
    
    
    for opt, arg in opts:
    if opt in ('-h', '--help'):
    usage()
    sys.exit()
    elif opt in ('-c', '--candidates'):
    candidates_folder = arg
    elif opt in ('-d', '--database'):
    database = os.path.abspath(arg)
    elif opt in ('-b', '--blast'):
    blast_type = arg
    
    <ol>
    <li>Print summary of arguments.</li>
    </ol>
    
    
    logger.debug('Arguments: candidates=%s, database=%s, blast=%s', candidates_folder, database, blast_type)
    
    <ol>
    <li>Check if BLAST command was specified.</li>
    </ol>
    
    
    if not blast_type:
    logger.critical('BLAST command was not specified (use "-b"). Aborting...')
    sys.exit(2)
    else:
    if not blast_type in blast_types:
    logger.critical('Unknown BLAST command: %s', blast_type)
    logger.info('Available BLAST commands: %s', ', '.join(blast_types))
    sys.exit(2)
    
    <ol>
    <li>Get candidate genes.</li>
    </ol>
    
    
    try:
    candidates = os.listdir(candidates_folder)
    except OSError:
    logger.critical('Folder "%s" does not exist! Aborting...', candidates_folder)
    sys.exit(2)
    
    <ol>
    <li>Issue error if there are no candidate genes.</li>
    </ol>
    
    
    if not candidates:
    logger.critical('There are no candidate genes at %s folder! Aborting...',
    candidates_folder)
    sys.exit(2)
    
    <ol>
    <li>Check if results exists.</li>
    </ol>
    
    
    if not os.path.isdir(candidates_results_folder):
    os.mkdir(candidates_results_folder)
    
    <ol>
    <li>Process candidate genes.</li>
    </ol>
    
    
    prepare(candidates, candidates_folder)
    <ol>
    <li>Get proper genes, now.</li>
    </ol>
    
    
    candidates = os.listdir(candidates_folder)
    <ol>
    <li>Only read FASTA files.</li>
    </ol>
    
    
    candidates = [file for file in candidates if file.endswith('.fa')]
    #TODO make it recognize more FASTA extensions.
    
    <ol>
    <li>Print info before starting.</li>
    </ol>
    
    
    logger.info('%d genes to be BLASTed against %s database!', len(candidates),
    database)
    
    <ol>
    <li>Main object to store genes and loci.</li>
    </ol>
    
    
    genes = {}
    loci = {}
    
    <ol>
    <li>BLAST each gene against local database.</li>
    </ol>
    
    
    for genefile in candidates:
    print
    logger.info('BLASTing %s against %s...', genefile, database)
    
    <ol>
    <li>Instantiate Sequence.</li>
    </ol>
    
    
    gene_filepath = os.path.join(candidates_folder, genefile)
    candidate = Sequence(filepath=gene_filepath, database=database)
    
    <ol>
    <li>Define variables.</li>
    </ol>
    
    
    output_filepath = os.path.join(candidates_results_folder, '%s.xml' % candidate.gene_name)
    candidate.blast_output = output_filepath
    
    <ol>
    <li>Only BLAST if needed.</li>
    </ol>
    
    
    try:
    blastfile = open(candidate.blast_output)
    blastfile.close()
    except:
    arguments = {
    'query': candidate.filepath,
    'db': database,
    'out': candidate.blast_output,
    'outfmt': 5, # Export in XML
    }
    <ol>
    <li>Execute BLAST command.</li>
    </ol>
    
    
    blast(blast_type, arguments)
    
    logger.info('Parsing XML...')
    print '\nCandidate &gt;&gt; Gene: %s, ID:%s, Ref: %s, Description: %s' % (
    candidate.gene_name, candidate.gene_id,
    candidate.ref, candidate.description
    )
    candidate.parse_blast()
    
    #DO LOCUS STUFF
    for locus_id in candidate.loci.keys():
    <ol>
    <li>If locus was already specified.</li>
    </ol>
    
    
    if locus_id in loci.keys():
    locus = loci[locus_id]
    locus.update_candidates(candidate)
    else:
    <ol>
    <li>Instantiate Locus object.</li>
    </ol>
    
    
    locus = Locus(locus_id, candidate, database)
    
    <ol>
    <li>Reciprocal BLAST querying locus sequence against human database.</li>
    </ol>
    
    
    try:
    blastfile = open(locus.reverse_blast_output)
    blastfile.close()
    except:
    #XXX Find a better way to overcome problem when path has a "|".
    reverse_args = {
    'query': locus.filepath.replace('|', '\|'),
    'db': '/home/nelas/Biologia/Doutorado/genomic_databases/human_protein.fa',
    'out': locus.reverse_blast_output.replace('|', '\|'),
    'outfmt': 5,
    }
    if blast_type == 'tblastn':
    blast('blastx', reverse_args)
    elif blast_type == 'blastp':
    blast('blastp', reverse_args)
    
    <ol>
    <li>Parse reverse BLAST output.</li>
    </ol>
    
    
    locus.parse_blast()
    
    <ol>
    <li>Add locus to main list.</li>
    </ol>
    
    
    loci[locus_id] = locus
    
    <ol>
    <li>Add to main dictionary.</li>
    </ol>
    
    
    genes[candidate.gene_name] = candidate
    
    <ol>
    <li>Print loci equivalent to candidate genes.</li>
    </ol>
    
    
    logger.info('Creating result files...')
    
    fasta_loci = []
    output_file = open('results.txt', 'w')
    
    #TODO Print date and variables used to results.txt.
    for locus_id, locus in loci.iteritems():
    fasta_loci.append(SeqRecord(Seq(locus.sequence), id=locus_id,
    description='related to: %s' % ', '.join(locus.candidates.keys())))
    print
    print locus_id, locus.candidates.keys()
    output_file.write('%s\t\t(candidates: %s)\n\n' % (locus_id,
    ', '.join(locus.candidates.keys())))
    try:
    first = locus.reciprocals[0]
    except:
    print '\tNo reciprocals.'
    current_gene_id = first.gene_id
    n_genes = 0
    output_file.write('\tgene\t\tid\t\taccession\t\te-value\n')
    for sequence in locus.reciprocals:
    <ol>
    <li>Limits to 5 most similar genes.</li>
    </ol>
    
    
    if sequence.gene_id == current_gene_id or n_genes &lt; 5:
    print '\t%s, %s, %s, %s' % (
    sequence.gene_name, sequence.gene_id,
    sequence.ref, sequence.evalue
    )
    
    output_file.write('\t%s\t\t%s\t\t%s\t\t%.2e' % (
    sequence.gene_name, sequence.gene_id,
    sequence.ref, sequence.evalue
    ))
    accession_numbers = [seq.ref for seq in locus.candidates.values()]
    if sequence.ref in accession_numbers:
    output_file.write(' &lt;&lt;')
    output_file.write('\n')
    else:
    break
    current_gene_id = sequence.gene_id
    n_genes += 1
    output_file.write('\n\n')
    print
    output_file.close()
    
    wrote_fasta = SeqIO.write(fasta_loci, 'results.fa', 'fasta')
    
    logger.info('%d records written to results.fa', wrote_fasta)
    logger.info('Done, bye!')
    
    if __name__ == '__main__':
    <ol>
    <li>Create logger.</li>
    </ol>
    
    
    logger = logging.getLogger('blaster')
    logger.setLevel(logging.DEBUG)
    logger.propagate = False
    <ol>
    <li>Define message format.</li>
    </ol>
    
    
    formatter = logging.Formatter('[%(levelname)s] %(asctime)s @ %(module)s %(funcName)s (l%(lineno)d): %(message)s')
    
    <ol>
    <li>Create logger console handler.</li>
    </ol>
    
    
    console_handler = logging.StreamHandler()
    console_handler.setLevel(logging.DEBUG)
    <ol>
    <li>Format console output.</li>
    </ol>
    
    
    console_handler.setFormatter(formatter)
    <ol>
    <li>Add console to logger.</li>
    </ol>
    
    
    logger.addHandler(console_handler)
    
    <ol>
    <li>Create logger file handler.</li>
    </ol>
    
    
    file_handler = logging.FileHandler('log_blaster.log')
    file_handler.setLevel(logging.DEBUG)
    <ol>
    <li>Format file output.</li>
    </ol>
    
    
    file_handler.setFormatter(formatter)
    <ol>
    <li>Add file to logger.</li>
    </ol>
    
    
    logger.addHandler(file_handler)
    
    <ol>
    <li>Initiates main function.</li>
    </ol>
    
    
    logger.info('BLASTer is running...')
    main(sys.argv[1:])
    
    
     
  • Bruno Vellutini 07:01 on 2011/11/23 Permalink
    Tags: blast,   

    BLAST+ programs 

    nucleotide blast: Search a nucleotide database using a nucleotide query [blastn, megablast, discontiguous megablast]
    protein blast: Search protein database using a protein query [blastp, psi-blast, phi-blast]
    blastx: Search protein database using a translated nucleotide query
    tblastn: Search translated nucleotide database using a protein query
    tblastx: Search translated nucleotide database using a translated nucleotide query

    Reference: http://blast.ncbi.nlm.nih.gov/Blast.cgi

     
  • Bruno Vellutini 11:24 on 2011/07/27 Permalink
    Tags: , blast, ,   

    First dirty script 

    Following the previous entry I started to write a script to programatically BLAST sequences against a local database and GenBank database. It is a Python script using Biopython module and BLAST+.

    As a test case I downloaded sequences from a Bugula neritina transcriptome (from this study(20540116)), selected a few candidate-genes from another B. neritina study(10.1186/2041-9139-2-13), and executed local BLASTs at the database.

    Transcriptome sequences are in FASTA format and need to be converted to a database BLAST+ can understand. This is accomplished with the following command:

    makeblastdb -in bugula.fasta -parse_seqids -dbtype nucl
    

    Now that the local database is ready to be BLASTed I put GenBank (.gb) files from single candidate-genes into the candidates folder. Then run the script with: python blaster.py or ./blaster.py. It will pickup each candidate-gene from the folder, BLAST against the database, and write the results to the local_blasts folder.

    I need to process (select, limit, etc) these local BLASTS results and run the obtained sequences against GenBank. This last part is still missing from the script, but should be ok to write.

    Below is the zip file with the test case (except for the Bugula database) and the script itself:

    BLAST Batch Script [zip]

    #!/usr/bin/python
    <ol>
    <li>-*- coding: utf-8 -*-</li>
    </ol>
    
    
    
    '''BLASTing batch script.
    
    Algorithm:
        - Take several genes of interest.
        - BLAST them locally against a transcriptome assembly.
        - Filter and process results to extract relevant sequences.
        - BLAST resulting sequences against GenBank.
        - Retrieve results for further analysis.
    
    Notes:
        - Read files with sequences from a folder (FASTA or GenBank format). Could also be a list of IDs or
          an pickled object, etc...
    
    Dependencies:
        - Python 2.7.1
        - Biopython 1.56
        - BLAST+ 2.2.25
    
    Configuration (Ubuntu):
        - Regular packages for Python and Biopython.
        - Add BLAST+ commands to path putting this code to .bashrc:
    
            PATH=$PATH:/home/nelas/Downloads/ncbi-blast-2.2.25+/bin
            export PATH
    '''
    
    import os
    
    from Bio import SeqIO
    from Bio.Blast import NCBIWWW
    from Bio.Blast import NCBIXML
    from Bio.Blast.Applications import NcbiblastnCommandline
    
    <ol>
    <li>Folder with candidate-genes.</li>
    </ol>
    
    
    CANDIDATE_FOLDER = 'candidates'
    
    <ol>
    <li>Folder with candidate-genes BLASTs.</li>
    </ol>
    
    
    CANDIDATE_BLASTS = 'local_blasts'
    
    <ol>
    <li>Folder with candidate-genes BLASTs.</li>
    </ol>
    
    
    GENBANK_BLASTS = 'genbank_blasts'
    
    <ol>
    <li>Database name.</li>
    </ol>
    
    
    DATABASE = 'bugula.fasta'
    <ol>
    <li>The source file from Bugula neritina transcriptome was downloaded from here:</li>
    <li>http://www.ncbi.nlm.nih.gov/sra/SRX015704</li>
    </ol>
    
    
    
    <ol>
    <li>Note: the original FASTA file needs to be converted to a BLASTable database.</li>
    <li>This can be done via command line and preferably before the script is run:</li>
    <li>makeblastdb -in bugula.fasta -parse_seqids -dbtype nucl</li>
    </ol>
    
    
    
    <ol>
    <li>Deprecated way of doing it, just for reference:</li>
    </ol>
    
    
    
    <ol>
    <li>Format a FASTA file to a readable database.</li>
    <li>formatdb -i bugula.fasta -p F -o</li>
    <li>BLAST against nucleotide with XML output:</li>
    <li>blast2 -p blastn -d bugula.fasta -o blastbugula.xml -i candidates/BnFoxA.gb -m 7</li>
    </ol>
    
    
    
    <ol>
    <li>Build list with path to candidate-genes.</li>
    </ol>
    
    
    candidates = os.listdir(CANDIDATE_FOLDER)
    <ol>
    <li>Only read GenBank files.</li>
    </ol>
    
    
    candidates = [file for file in candidates if file.endswith('.gb')]
    
    print '\n%d genes to be BLASTed against %s database!' % (len(candidates),
            DATABASE)
    
    for gene_file in candidates:
        # Define filepath.
        gene_filepath = os.path.join(CANDIDATE_FOLDER, gene_file)
    
        # Create record object.
        record = SeqIO.read(gene_filepath, 'genbank')
    
        # Create temporary FASTA file, if not available.
        try:
            fasta_file = open(gene_filepath.split('.')[0] + '.fasta', 'r')
        except IOError:
            fasta_file = open(gene_filepath.split('.')[0] + '.fasta', 'w')
            fasta_file.write(record.format('fasta'))
            fasta_file.close()
            fasta_file = open(gene_filepath.split('.')[0] + '.fasta', 'r')
    
        # Instantiate the BLASTn command.
        cline = NcbiblastnCommandline(
                query=fasta_file.name, db=DATABASE,
                out=os.path.join(
                    CANDIDATE_BLASTS, gene_file.split('.')[0]),
                    #CANDIDATE_BLASTS, gene_file.split('.')[0] + '.xml'),
                #outfmt=5 # Export in XML
                )
    
        # Execute BLAST.
        stdout, stderr = cline()
    
        # Close file.
        fasta_file.close()
    
        print '\t' + gene_file
    
    #TODO Process local BLASTs
    
    #TODO BLAST results against GenBank
    
    #TODO Process remote BLASTs
    
    print 'Done, bye!\n'
    

    Bibliography

     
  • Bruno Vellutini 21:32 on 2011/07/25 Permalink
    Tags: , blast, ,   

    Biopython begins 

    Learning a bit of bioinformatics to deal with transcriptome data. Idea is to use Biopython tools to automate sequence BLASTing with Python scripts. Here are some notes that might be useful to keep; extracted from the Tutorial and Cookbook.

    Sequence

    Sequence objects Seq are composed of a string of letters and an alphabet, which determines what the letters mean (eg. is A a nucleotide or Alanine?). Seq objects act like “read only” python strings (use str(seq) to compare sequences). To modify sequences use from Bio.Seq import MutableSeq.

    >>> from Bio.Seq import Seq
    >>> my_seq = Seq('AGTACACTGGT')
    >>> my_seq
    Seq('AGTACACTGGT', Alphabet())
    >>> print my_seq
    AGTACACTGGT
    >>> my_seq.alphabet
    Alphabet()
    

    When creating sequences the alphabet should be explicitly specified following the IUPAC alphabet recommendations defined in Bio.Alphabet module.

    >>> from Bio.Seq import Seq
    >>> from Bio.Alphabet import IUPAC
    >>> my_seq = Seq('AGTACACTGGT', IUPAC.unambiguous_dna)
    >>> my_seq
    Seq('AGTACACTGGT', IUPACUnambiguousDNA())
    >>> my_seq.alphabet
    IUPACUnambiguousDNA()
    >>> my_prot = Seq('AGTACACTGGT', IUPAC.protein)
    >>> my_prot
    Seq('AGTACACTGGT', IUPACProtein())
    >>> my_prot.alphabet
    IUPACProtein()
    

    There are methods to obtain the complement and reverse_complement of a sequence:

    >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
    >>> my_seq
    Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
    >>> my_seq.complement()
    Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
    >>> my_seq.reverse_complement()
    Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
    

    The biological transcription process run a reverse complement on the template strand. On bioinformatics it is easier to only replace the coding strand with Us.

     	DNA coding strand (aka Crick strand, strand +1)	 
    5’	ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG	3’
     	|||||||||||||||||||||||||||||||||||||||	 
    3’	TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC	5’
     	DNA template strand (aka Watson strand, strand −1)	 
     
     	                  |	 
     	            Transcription	 
     	                  ↓	 
     
    5’	AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG	3’
     	Single stranded messenger RNA	 
    

    Whenever a transcribe (or back_transcribe) is executed, the alphabet is changed accordingly:

    >>> coding_dna
    Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
    >>> messenger_rna = coding_dna.transcribe()
    >>> messenger_rna
    Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
    >>> messenger_rna.back_transcribe()
    Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
    

    You can translate a messenger rna or the coding dna directly. A translation table can be specified to correctly deal with differing stop codons (see NCBI Genetic Codes):

    >>> messenger_rna.translate()
    Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
    >>> coding_dna.translate()
    Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
    >>> coding_dna.translate(table='Vertebrate Mitochondrial')
    Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
    

    Data for each table can be accessed via from Bio.Data import CodonTable.

    SeqRecord

    Class that allows the association of features and identifiers with a sequence. Has the following attributes:

    • seq – The sequence itself [Seq object].
    • id – The primary ID to identify the sequence [string].
    • name – A “common” name/id for the sequence [string].
    • description – A human readable description for the sequence [string].
    • letter_annotations – Per-letter-annotations, keys are the name and information is contained as Python sequence [dictionary].
    • annotations – Additional information about the sequence [dictionary].
    • features – List of SeqFeature objects with more structured information about the features on a sequence [list].
    • dbxrefs – List of database cross-references [list].

    Parse this information from files using SeqIO (GenBank files provide annotations and features, FASTA are more compact):

    &gt;&gt;&gt; from Bio import SeqIO
    &gt;&gt;&gt; record = SeqIO.read('NC_005816.gb', 'genbank')
    &gt;&gt;&gt; record
    SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
    IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
    description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
    dbxrefs=['Project:10638'])
    <ol>
    <li>Use &quot;SeqIO.parse()&quot; if the file has more than 1 entry.</li>
    </ol>
    
    
    

    SeqFeature

    Essential for describing aspects of a sequence. See here.

    Parsing

    Sequences from a taxon can be manually queried from the Nucleotide database, which collects data from several sources and results exported in different formats.

    SeqIO can parse results from GenBank and FASTA files returning an iterator. First argument is the file (or handle) and second argument is the file format (important!). In case you need to read a one entry file use Bio.SeqIO.read().

    from Bio import SeqIO
    for seq_record in SeqIO.parse('clypeasteroida.gb', 'genbank'):
        # Just print some info.
        print seq_record.id
        print repr(seq_record.seq)
        print len(seq_record)
    <ol>
    <li>Use &quot;SeqIO.parse('clypeasteroida.fasta', 'fasta')&quot; for FASTA files.</li>
    </ol>
    
    
    

    Load the whole records into memory with:

    from Bio import SeqIO
    records = list(SeqIO.parse('clypeasteroida.gb', 'genbank'))
    

    Get the organism scientific name from a record:

    >>> print first_record.annotations["organism"]
    Clypeaster subdepressus
    

    Sequence as dictionaries

    Bio.SeqIO.to_dict() – flexible and memory demanding. Keys can be specified if needed (for FASTA) using the argument key_function.

    from Bio import SeqIO
    orchid_dict = SeqIO.to_dict(SeqIO.parse('clypeasteroida.gb', 'genbank'))
    

    Bio.SeqIO.index() – middle ground, dictionary parsing on demand. Use only files, not handles.

    from Bio import SeqIO
    orchid_dict = SeqIO.index('clypeasteroida.gb', 'genbank')
    

    Bio.SeqIO.index_db() – low memory and bit slower; stores identifiers on a SQLite database. Three arguments, name of the database (preferably .idx), file (or list of files) to read, and format.

    from Bio import SeqIO
    indexed_dict = SeqIO.index_db('database.idx', files, 'genbank')
    

    Use get_raw() method with a record id to access the raw text entry.

    BLASTing

    Internet

    Use qblast() function from Bio.Blast.NCBIWWW with 3 arguments: blast program (blastn, blastp, blastx, tblast, tblastx; see descriptions), databases to be searched (see descriptions), and string containing the sequence (sequence itself, sequence in FASTA or identifier). Function returns XML.

    from Bio.Blast import NCBIWWW
    from Bio import SeqIO
    record = SeqIO.read(open('m_cold.fasta'), format='fasta')
    result_handle = NCBIWWW.qblast('blastn', 'nr', record.format('fasta'))
    <ol>
    <li>Also: result_handle = NCBIWWW.qblast('blastn', 'nr', record.seq)</li>
    </ol>
    
    
    

    It is better to save the results to a file to avoid reconnection to NCBI:

    <ol>
    <li>Save results to file.</li>
    </ol>
    
    
    save_file = open('my_blast.xml', 'w')
    save_file.write(result_handle.read())
    save_file.close()
    result_handle.close()
    <ol>
    <li>Open file to read results.</li>
    </ol>
    
    
    result_handle = open(&quot;my_blast.xml&quot;)
    

    Local

    BLAST+ is the standalone program and Bio.Blast.Applications has wrappers for it (except for makeblastdb, which could be useful in the future).

    This is a regular command line, assuming nr database is installed and configured:

    blastx -query opuntia.fasta -db nr -out opuntia.xml -evalue 0.001 -outfmt 5
    

    And this is the same command wrapped:

    >>> from Bio.Blast.Applications import NcbiblastxCommandline
    >>> blastx_cline = NcbiblastxCommandline(query='opuntia.fasta', db='nr', evalue=0.001,
    ...                                      outfmt=5, out='opuntia.xml')
    >>> print blastx_cline
    blastx -out opuntia.xml -outfmt 5 -query opuntia.fasta -db nr -evalue 0.001
    >>> stdout, stderr = blastx_cline()
    

    BLAST parsing

    XML is the more stable format for parsing results. It does not matter how the data is fetched, Biopython will be able to parse it if it is XML. To parse the handle directly or an opened .xml file:

    from Bio.Blast import NCBIXML
    blast_records = NCBIXML.parse(result_handle)
    for record in blast_records:
        <do something>
    

    BLAST class

    Example of information stored in a BLAST record:

    E_VALUE_THRESH = 0.04
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < E_VALUE_THRESH:
                print '****Alignment****'
                print 'sequence:', alignment.title
                print 'length:', alignment.length
                print 'e value:', hsp.expect
                print hsp.query[0:75] + '...'
                print hsp.match[0:75] + '...'
                print hsp.sbjct[0:75] + '...'
    

    Error handling

    Not easy to get a ValueError in the middle of a huge file, so use BlastErrorParser:

    <ol>
    <li>Define filenames.</li>
    </ol>
    
    
    import os
    blast_file = 'big_blast.out'
    error_file = 'big_blast.problems'
    <ol>
    <li>Create error handler.</li>
    </ol>
    
    
    from Bio.Blast import NCBIStandalone
    error_handle = open(error_file, 'w')
    blast_error_parser = NCBIStandalone.BlastErrorParser(error_handle)
    <ol>
    <li>Trigger iterator with the error parser.</li>
    </ol>
    
    
    result_handle = open(blast_file)
    iterator = NCBIStandalone.Iterator(result_handle, blast_error_parser)
    

    Entrez

    Still need to properly read Chapter 8 and take notes.

    Parse sequence(s) from id(s) using Entrez:

    from Bio import Entrez
    from Bio import SeqIO
    Entrez.email = 'A.N.Other@example.com'
    handle = Entrez.efetch(db='nucleotide', rettype='gb', \
                           id='6273291,6273290,6273289')
    for seq_record in SeqIO.parse(handle, 'gb'):
        print seq_record.id, seq_record.description[:50] + '...'
        print 'Sequence length %i,' % len(seq_record),
        print '%i features,' % len(seq_record.features),
        print 'from: %s' % seq_record.annotations['source']
    handle.close()
    
     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel