First dirty script

Following the previous entry I started to write a script to programatically BLAST sequences against a local database and GenBank database. It is a Python script using Biopython module and BLAST+.

As a test case I downloaded sequences from a Bugula neritina transcriptome (from this study(20540116)), selected a few candidate-genes from another B. neritina study(10.1186/2041-9139-2-13), and executed local BLASTs at the database.

Transcriptome sequences are in FASTA format and need to be converted to a database BLAST+ can understand. This is accomplished with the following command:

makeblastdb -in bugula.fasta -parse_seqids -dbtype nucl

Now that the local database is ready to be BLASTed I put GenBank (.gb) files from single candidate-genes into the candidates folder. Then run the script with: python blaster.py or ./blaster.py. It will pickup each candidate-gene from the folder, BLAST against the database, and write the results to the local_blasts folder.

I need to process (select, limit, etc) these local BLASTS results and run the obtained sequences against GenBank. This last part is still missing from the script, but should be ok to write.

Below is the zip file with the test case (except for the Bugula database) and the script itself:

BLAST Batch Script [zip]

#!/usr/bin/python
<ol>
<li>-*- coding: utf-8 -*-</li>
</ol>



'''BLASTing batch script.

Algorithm:
    - Take several genes of interest.
    - BLAST them locally against a transcriptome assembly.
    - Filter and process results to extract relevant sequences.
    - BLAST resulting sequences against GenBank.
    - Retrieve results for further analysis.

Notes:
    - Read files with sequences from a folder (FASTA or GenBank format). Could also be a list of IDs or
      an pickled object, etc...

Dependencies:
    - Python 2.7.1
    - Biopython 1.56
    - BLAST+ 2.2.25

Configuration (Ubuntu):
    - Regular packages for Python and Biopython.
    - Add BLAST+ commands to path putting this code to .bashrc:

        PATH=$PATH:/home/nelas/Downloads/ncbi-blast-2.2.25+/bin
        export PATH
'''

import os

from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
from Bio.Blast.Applications import NcbiblastnCommandline

<ol>
<li>Folder with candidate-genes.</li>
</ol>


CANDIDATE_FOLDER = 'candidates'

<ol>
<li>Folder with candidate-genes BLASTs.</li>
</ol>


CANDIDATE_BLASTS = 'local_blasts'

<ol>
<li>Folder with candidate-genes BLASTs.</li>
</ol>


GENBANK_BLASTS = 'genbank_blasts'

<ol>
<li>Database name.</li>
</ol>


DATABASE = 'bugula.fasta'
<ol>
<li>The source file from Bugula neritina transcriptome was downloaded from here:</li>
<li>http://www.ncbi.nlm.nih.gov/sra/SRX015704</li>
</ol>



<ol>
<li>Note: the original FASTA file needs to be converted to a BLASTable database.</li>
<li>This can be done via command line and preferably before the script is run:</li>
<li>makeblastdb -in bugula.fasta -parse_seqids -dbtype nucl</li>
</ol>



<ol>
<li>Deprecated way of doing it, just for reference:</li>
</ol>



<ol>
<li>Format a FASTA file to a readable database.</li>
<li>formatdb -i bugula.fasta -p F -o</li>
<li>BLAST against nucleotide with XML output:</li>
<li>blast2 -p blastn -d bugula.fasta -o blastbugula.xml -i candidates/BnFoxA.gb -m 7</li>
</ol>



<ol>
<li>Build list with path to candidate-genes.</li>
</ol>


candidates = os.listdir(CANDIDATE_FOLDER)
<ol>
<li>Only read GenBank files.</li>
</ol>


candidates = [file for file in candidates if file.endswith('.gb')]

print '\n%d genes to be BLASTed against %s database!' % (len(candidates),
        DATABASE)

for gene_file in candidates:
    # Define filepath.
    gene_filepath = os.path.join(CANDIDATE_FOLDER, gene_file)

    # Create record object.
    record = SeqIO.read(gene_filepath, 'genbank')

    # Create temporary FASTA file, if not available.
    try:
        fasta_file = open(gene_filepath.split('.')[0] + '.fasta', 'r')
    except IOError:
        fasta_file = open(gene_filepath.split('.')[0] + '.fasta', 'w')
        fasta_file.write(record.format('fasta'))
        fasta_file.close()
        fasta_file = open(gene_filepath.split('.')[0] + '.fasta', 'r')

    # Instantiate the BLASTn command.
    cline = NcbiblastnCommandline(
            query=fasta_file.name, db=DATABASE,
            out=os.path.join(
                CANDIDATE_BLASTS, gene_file.split('.')[0]),
                #CANDIDATE_BLASTS, gene_file.split('.')[0] + '.xml'),
            #outfmt=5 # Export in XML
            )

    # Execute BLAST.
    stdout, stderr = cline()

    # Close file.
    fasta_file.close()

    print '\t' + gene_file

#TODO Process local BLASTs

#TODO BLAST results against GenBank

#TODO Process remote BLASTs

print 'Done, bye!\n'

Bibliography