BLAST stands for Basic Local Alignment Search Tool. It finds regions of similarity between biological sequences. Biopython provides Bio.Blast module to deal with NCBI BLAST operation. You can run BLAST in either local connection or over Internet connection.
Let us understand these two connections in brief in the following section −
Biopython provides Bio.Blast.NCBIWWW module to call the online version of BLAST. To do this, we need to import the following module −
>>> from Bio.Blast import NCBIWWW
NCBIWW module provides qblast function to query the BLAST online version, https://blast.ncbi.nlm.nih.gov/Blast.cgi. qblast supports all the parameters supported by the online version.
To obtain any help about this module, use the below command and understand the features −
>>> help(NCBIWWW.qblast) Help on function qblast in module Bio.Blast.NCBIWWW: qblast( program, database, sequence, url_base = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi', auto_format = None, composition_based_statistics = None, db_genetic_code = None, endpoints = None, entrez_query = '(none)', expect = 10.0, filter = None, gapcosts = None, genetic_code = None, hitlist_size = 50, i_thresh = None, layout = None, lcase_mask = None, matrix_name = None, nucl_penalty = None, nucl_reward = None, other_advanced = None, perc_ident = None, phi_pattern = None, query_file = None, query_believe_defline = None, query_from = None, query_to = None, searchsp_eff = None, service = None, threshold = None, ungapped_alignment = None, word_size = None, alignments = 500, alignment_view = None, descriptions = 500, entrez_links_new_window = None, expect_low = None, expect_high = None, format_entrez_query = None, format_object = None, format_type = 'XML', ncbi_gi = None, results_file = None, show_overview = None, megablast = None, template_type = None, template_length = None ) BLAST search using NCBI's QBLAST server or a cloud service provider. Supports all parameters of the qblast API for Put and Get. Please note that BLAST on the cloud supports the NCBI-BLAST Common URL API (http://ncbi.github.io/blast-cloud/dev/api.html). To use this feature, please set url_base to 'http://host.my.cloud.service.provider.com/cgi-bin/blast.cgi' and format_object = 'Alignment'. For more details, please see 8. Biopython – Overview of BLAST https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE = BlastDocs&DOC_TYPE = CloudBlast Some useful parameters: - program blastn, blastp, blastx, tblastn, or tblastx (lower case) - database Which database to search against (e.g. "nr"). - sequence The sequence to search. - ncbi_gi TRUE/FALSE whether to give 'gi' identifier. - descriptions Number of descriptions to show. Def 500. - alignments Number of alignments to show. Def 500. - expect An expect value cutoff. Def 10.0. - matrix_name Specify an alt. matrix (PAM30, PAM70, BLOSUM80, BLOSUM45). - filter "none" turns off filtering. Default no filtering - format_type "HTML", "Text", "ASN.1", or "XML". Def. "XML". - entrez_query Entrez query to limit Blast search - hitlist_size Number of hits to return. Default 50 - megablast TRUE/FALSE whether to use MEga BLAST algorithm (blastn only) - service plain, psi, phi, rpsblast, megablast (lower case) This function does no checking of the validity of the parameters and passes the values to the server as is. More help is available at: https://ncbi.github.io/blast-cloud/dev/api.html
Usually, the arguments of the qblast function are basically analogous to different parameters that you can set on the BLAST web page. This makes the qblast function easy to understand as well as reduces the learning curve to use it.
To understand the process of connecting and searching BLAST online version, let us do a simple sequence search (available in our local sequence file) against online BLAST server through Biopython.
Step 1 − Create a file named blast_example.fasta in the Biopython directory and give the below sequence information as input
Example of a single sequence in FASTA/Pearson format: >sequence A ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatat tctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc >sequence B ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattca tattctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc
Step 2 − Import the NCBIWWW module.
>>> from Bio.Blast import NCBIWWW
Step 3 − Open the sequence file, blast_example.fasta using python IO module.
>>> sequence_data = open("blast_example.fasta").read() >>> sequence_data 'Example of a single sequence in FASTA/Pearson format:\n\n\n> sequence A\nggtaagtcctctagtacaaacacccccaatattgtgatataattaaaatt atattcatat\ntctgttgccagaaaaaacacttttaggctatattagagccatcttctttg aagcgttgtc\n\n'
Step 4 − Now, call the qblast function passing sequence data as main parameter. The other parameter represents the database (nt) and the internal program (blastn).
>>> result_handle = NCBIWWW.qblast("blastn", "nt", sequence_data) >>> result_handle <_io.StringIO object at 0x000001EC9FAA4558>
blast_results holds the result of our search. It can be saved to a file for later use and also, parsed to get the details. We will learn how to do it in the coming section.
Step 5 − The same functionality can be done using Seq object as well rather than using the whole fasta file as shown below −
>>> from Bio import SeqIO >>> seq_record = next(SeqIO.parse(open('blast_example.fasta'),'fasta')) >>> seq_record.id 'sequence' >>> seq_record.seq Seq('ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatat...gtc', SingleLetterAlphabet())
Now, call the qblast function passing Seq object, record.seq as main parameter.
>>> result_handle = NCBIWWW.qblast("blastn", "nt", seq_record.seq) >>> print(result_handle) <_io.StringIO object at 0x000001EC9FAA4558>
BLAST will assign an identifier for your sequence automatically.
Step 6 − result_handle object will have the entire result and can be saved into a file for later usage.
>>> with open('results.xml', 'w') as save_file: >>> blast_results = result_handle.read() >>> save_file.write(blast_results)
We will see how to parse the result file in the later section.
This section explains about how to run BLAST in local system. If you run BLAST in local system, it may be faster and also allows you to create your own database to search against sequences.
In general, running BLAST locally is not recommended due to its large size, extra effort needed to run the software, and the cost involved. Online BLAST is sufficient for basic and advanced purposes. Of course, sometime you may be required to install it locally.
Consider you are conducting frequent searches online which may require a lot of time and high network volume and if you have proprietary sequence data or IP related issues, then installing it locally is recommended.
To do this, we need to follow the below steps −
Step 1 − Download and install the latest blast binary using the given link − ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Step 2 − Download and unpack the latest and necessary database using the below link − ftp://ftp.ncbi.nlm.nih.gov/blast/db/
BLAST software provides lot of databases in their site. Let us download alu.n.gz file from the blast database site and unpack it into alu folder. This file is in FASTA format. To use this file in our blast application, we need to first convert the file from FASTA format into blast database format. BLAST provides makeblastdb application to do this conversion.
Use the below code snippet −
cd /path/to/alu makeblastdb -in alu.n -parse_seqids -dbtype nucl -out alun
Running the above code will parse the input file, alu.n and create BLAST database as multiple files alun.nsq, alun.nsi, etc. Now, we can query this database to find the sequence.
We have installed the BLAST in our local server and also have sample BLAST database, alun to query against it.
Step 3 − Let us create a sample sequence file to query the database. Create a file search.fsa and put the below data into it.
>gnl|alu|Z15030_HSAL001056 (Alu-J) AGGCTGGCACTGTGGCTCATGCTGAAATCCCAGCACGGCGGAGGACGGCGGAAGATTGCT TGAGCCTAGGAGTTTGCGACCAGCCTGGGTGACATAGGGAGATGCCTGTCTCTACGCAAA AGAAAAAAAAAATAGCTCTGCTGGTGGTGCATGCCTATAGTCTCAGCTATCAGGAGGCTG GGACAGGAGGATCACTTGGGCCCGGGAGTTGAGGCTGTGGTGAGCCACGATCACACCACT GCACTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTCAAAACAAACAAATAA >gnl|alu|D00596_HSAL003180 (Alu-Sx) AGCCAGGTGTGGTGGCTCACGCCTGTAATCCCACCGCTTTGGGAGGCTGAGTCAGATCAC CTGAGGTTAGGAATTTGGGACCAGCCTGGCCAACATGGCGACACCCCAGTCTCTACTAAT AACACAAAAAATTAGCCAGGTGTGCTGGTGCATGTCTGTAATCCCAGCTACTCAGGAGGC TGAGGCATGAGAATTGCTCACGAGGCGGAGGTTGTAGTGAGCTGAGATCGTGGCACTGTA CTCCAGCCTGGCGACAGAGGGAGAACCCATGTCAAAAACAAAAAAAGACACCACCAAAGG TCAAAGCATA >gnl|alu|X55502_HSAL000745 (Alu-J) TGCCTTCCCCATCTGTAATTCTGGCACTTGGGGAGTCCAAGGCAGGATGATCACTTATGC CCAAGGAATTTGAGTACCAAGCCTGGGCAATATAACAAGGCCCTGTTTCTACAAAAACTT TAAACAATTAGCCAGGTGTGGTGGTGCGTGCCTGTGTCCAGCTACTCAGGAAGCTGAGGC AAGAGCTTGAGGCTACAGTGAGCTGTGTTCCACCATGGTGCTCCAGCCTGGGTGACAGGG CAAGACCCTGTCAAAAGAAAGGAAGAAAGAACGGAAGGAAAGAAGGAAAGAAACAAGGAG AG
The sequence data are gathered from the alu.n file; hence, it matches with our database.
Step 4 − BLAST software provides many applications to search the database and we use blastn. blastn application requires minimum of three arguments, db, query and out. db refers to the database against to search; query is the sequence to match and out is the file to store results. Now, run the below command to perform this simple query −
blastn -db alun -query search.fsa -out results.xml -outfmt 5
Running the above command will search and give output in the results.xml file as given below (partially data) −
<?xml version = "1.0"?> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"> <BlastOutput> <BlastOutput_program>blastn</BlastOutput_program> <BlastOutput_version>BLASTN 2.7.1+</BlastOutput_version> <BlastOutput_reference>Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14. </BlastOutput_reference> <BlastOutput_db>alun</BlastOutput_db> <BlastOutput_query-ID>Query_1</BlastOutput_query-ID> <BlastOutput_query-def>gnl|alu|Z15030_HSAL001056 (Alu-J)</BlastOutput_query-def> <BlastOutput_query-len>292</BlastOutput_query-len> <BlastOutput_param> <Parameters> <Parameters_expect>10</Parameters_expect> <Parameters_sc-match>1</Parameters_sc-match> <Parameters_sc-mismatch>-2</Parameters_sc-mismatch> <Parameters_gap-open>0</Parameters_gap-open> <Parameters_gap-extend>0</Parameters_gap-extend> <Parameters_filter>L;m;</Parameters_filter> </Parameters> </BlastOutput_param> <BlastOutput_iterations> <Iteration> <Iteration_iter-num>1</Iteration_iter-num><Iteration_query-ID>Query_1</Iteration_query-ID> <Iteration_query-def>gnl|alu|Z15030_HSAL001056 (Alu-J)</Iteration_query-def> <Iteration_query-len>292</Iteration_query-len> <Iteration_hits> <Hit> <Hit_num>1</Hit_num> <Hit_id>gnl|alu|Z15030_HSAL001056</Hit_id> <Hit_def>(Alu-J)</Hit_def> <Hit_accession>Z15030_HSAL001056</Hit_accession> <Hit_len>292</Hit_len> <Hit_hsps> <Hsp> <Hsp_num>1</Hsp_num> <Hsp_bit-score>540.342</Hsp_bit-score> <Hsp_score>292</Hsp_score> <Hsp_evalue>4.55414e-156</Hsp_evalue> <Hsp_query-from>1</Hsp_query-from> <Hsp_query-to>292</Hsp_query-to> <Hsp_hit-from>1</Hsp_hit-from> <Hsp_hit-to>292</Hsp_hit-to> <Hsp_query-frame>1</Hsp_query-frame> <Hsp_hit-frame>1</Hsp_hit-frame> <Hsp_identity>292</Hsp_identity> <Hsp_positive>292</Hsp_positive> <Hsp_gaps>0</Hsp_gaps> <Hsp_align-len>292</Hsp_align-len> <Hsp_qseq> AGGCTGGCACTGTGGCTCATGCTGAAATCCCAGCACGGCGGAGGACGGCGGAAGATTGCTTGAGCCTAGGAGTTTG CGACCAGCCTGGGTGACATAGGGAGATGCCTGTCTCTACGCAAAAGAAAAAAAAAATAGCTCTGCTGGTGGTGCATG CCTATAGTCTCAGCTATCAGGAGGCTGGGACAGGAGGATCACTTGGGCCCGGGAGTTGAGGCTGTGGTGAGCC ACGATCACACCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTCAAAACAAACAAATAA </Hsp_qseq> <Hsp_hseq> AGGCTGGCACTGTGGCTCATGCTGAAATCCCAGCACGGCGGAGGACGGCGGAAGATTGCTTGAGCCTAGGA GTTTGCGACCAGCCTGGGTGACATAGGGAGATGCCTGTCTCTACGCAAAAGAAAAAAAAAATAGCTCTGCT GGTGGTGCATGCCTATAGTCTCAGCTATCAGGAGGCTGGGACAGGAGGATCACTTGGGCCCGGGAGTTGAGG CTGTGGTGAGCCACGATCACACCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTCAAAACAAAC AAATAA </Hsp_hseq> <Hsp_midline> ||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||| </Hsp_midline> </Hsp> </Hit_hsps> </Hit> ......................... ......................... ......................... </Iteration_hits> <Iteration_stat> <Statistics> <Statistics_db-num>327</Statistics_db-num> <Statistics_db-len>80506</Statistics_db-len> <Statistics_hsp-lenv16</Statistics_hsp-len> <Statistics_eff-space>21528364</Statistics_eff-space> <Statistics_kappa>0.46</Statistics_kappa> <Statistics_lambda>1.28</Statistics_lambda> <Statistics_entropy>0.85</Statistics_entropy> </Statistics> </Iteration_stat> </Iteration> </BlastOutput_iterations> </BlastOutput>
The above command can be run inside the python using the below code −
>>> from Bio.Blast.Applications import NcbiblastnCommandline >>> blastn_cline = NcbiblastnCommandline(query = "search.fasta", db = "alun", outfmt = 5, out = "results.xml") >>> stdout, stderr = blastn_cline()
Here, the first one is a handle to the blast output and second one is the possible error output generated by the blast command.
Since we have provided the output file as command line argument (out = “results.xml”) and sets the output format as XML (outfmt = 5), the output file will be saved in the current working directory.
Generally, BLAST output is parsed as XML format using the NCBIXML module. To do this, we need to import the following module −
>>> from Bio.Blast import NCBIXML
Now, open the file directly using python open method and use NCBIXML parse method as given below −
>>> E_VALUE_THRESH = 1e-20 >>> for record in NCBIXML.parse(open("results.xml")): >>> if record.alignments: >>> print("\n") >>> print("query: %s" % record.query[:100]) >>> for align in record.alignments: >>> for hsp in align.hsps: >>> if hsp.expect < E_VALUE_THRESH: >>> print("match: %s " % align.title[:100])
This will produce an output as follows −
query: gnl|alu|Z15030_HSAL001056 (Alu-J) match: gnl|alu|Z15030_HSAL001056 (Alu-J) match: gnl|alu|L12964_HSAL003860 (Alu-J) match: gnl|alu|L13042_HSAL003863 (Alu-FLA?) match: gnl|alu|M86249_HSAL001462 (Alu-FLA?) match: gnl|alu|M29484_HSAL002265 (Alu-J) query: gnl|alu|D00596_HSAL003180 (Alu-Sx) match: gnl|alu|D00596_HSAL003180 (Alu-Sx) match: gnl|alu|J03071_HSAL001860 (Alu-J) match: gnl|alu|X72409_HSAL005025 (Alu-Sx) query: gnl|alu|X55502_HSAL000745 (Alu-J) match: gnl|alu|X55502_HSAL000745 (Alu-J)