Biopython provides a module, Bio.SeqIO to read and write sequences from and to a file (any stream) respectively. It supports nearly all file formats available in bioinformatics. Most of the software provides different approach for different file formats. But, Biopython consciously follows a single approach to present the parsed sequence data to the user through its SeqRecord object.
Let us learn more about SeqRecord in the following section.
Bio.SeqRecord module provides SeqRecord to hold meta information of the sequence as well as the sequence data itself as given below −
seq − It is an actual sequence.
id − It is the primary identifier of the given sequence. The default type is string.
name − It is the Name of the sequence. The default type is string.
description − It displays human readable information about the sequence.
annotations − It is a dictionary of additional information about the sequence.
The SeqRecord can be imported as specified below
from Bio.SeqRecord import SeqRecord
Let us understand the nuances of parsing the sequence file using real sequence file in the coming sections.
This section explains about how to parse two of the most popular sequence file formats, FASTA and GenBank.
FASTA is the most basic file format for storing sequence data. Originally, FASTA is a software package for sequence alignment of DNA and protein developed during the early evolution of Bioinformatics and used mostly to search the sequence similarity.
Biopython provides an example FASTA file and it can be accessed at https://github.com/biopython/biopython/blob/master/Doc/examples/ls_orchid.fasta.
Download and save this file into your Biopython sample directory as ‘orchid.fasta’.
Bio.SeqIO module provides parse() method to process sequence files and can be imported as follows −
from Bio.SeqIO import parse
parse() method contains two arguments, first one is file handle and second is file format.
>>> file = open('path/to/biopython/sample/orchid.fasta') >>> for record in parse(file, "fasta"): ... print(record.id) ... gi|2765658|emb|Z78533.1|CIZ78533 gi|2765657|emb|Z78532.1|CCZ78532 .......... .......... gi|2765565|emb|Z78440.1|PPZ78440 gi|2765564|emb|Z78439.1|PBZ78439 >>>
Here, the parse() method returns an iterable object which returns SeqRecord on every iteration. Being iterable, it provides lot of sophisticated and easy methods and let us see some of the features.
next() method returns the next item available in the iterable object, which we can be used to get the first sequence as given below −
>>> first_seq_record = next(SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta')) >>> first_seq_record.id 'gi|2765658|emb|Z78533.1|CIZ78533' >>> first_seq_record.name 'gi|2765658|emb|Z78533.1|CIZ78533' >>> first_seq_record.seq Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()) >>> first_seq_record.description 'gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA' >>> first_seq_record.annotations {} >>>
Here, seq_record.annotations is empty because the FASTA format does not support sequence annotations.
We can convert the iterable object into list using list comprehension as given below
>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta') >>> all_seq = [seq_record for seq_record in seq_iter] >>> len(all_seq) 94 >>>
Here, we have used len method to get the total count. We can get sequence with maximum length as follows −
>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta') >>> max_seq = max(len(seq_record.seq) for seq_record in seq_iter) >>> max_seq 789 >>>
We can filter the sequence as well using the below code −
>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta') >>> seq_under_600 = [seq_record for seq_record in seq_iter if len(seq_record.seq) < 600] >>> for seq in seq_under_600: ... print(seq.id) ... gi|2765606|emb|Z78481.1|PIZ78481 gi|2765605|emb|Z78480.1|PGZ78480 gi|2765601|emb|Z78476.1|PGZ78476 gi|2765595|emb|Z78470.1|PPZ78470 gi|2765594|emb|Z78469.1|PHZ78469 gi|2765564|emb|Z78439.1|PBZ78439 >>>
Writing a collection of SqlRecord objects (parsed data) into file is as simple as calling the SeqIO.write method as below −
file = open("converted.fasta", "w) SeqIO.write(seq_record, file, "fasta")
This method can be effectively used to convert the format as specified below −
file = open("converted.gbk", "w) SeqIO.write(seq_record, file, "genbank")
It is a richer sequence format for genes and includes fields for various kinds of annotations. Biopython provides an example GenBank file and it can be accessed at https://github.com/biopython/biopython/blob/master/Doc/examples/ls_orchid.fasta.
Download and save file into your Biopython sample directory as ‘orchid.gbk’
Since, Biopython provides a single function, parse to parse all bioinformatics format. Parsing GenBank format is as simple as changing the format option in the parse method.
The code for the same has been given below −
>>> from Bio import SeqIO >>> from Bio.SeqIO import parse >>> seq_record = next(parse(open('path/to/biopython/sample/orchid.gbk'),'genbank')) >>> seq_record.id 'Z78533.1' >>> seq_record.name 'Z78533' >>> seq_record.seq Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA()) >>> seq_record.description 'C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA' >>> seq_record.annotations { 'molecule_type': 'DNA', 'topology': 'linear', 'data_file_division': 'PLN', 'date': '30-NOV-2006', 'accessions': ['Z78533'], 'sequence_version': 1, 'gi': '2765658', 'keywords': ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2'], 'source': 'Cypripedium irapeanum', 'organism': 'Cypripedium irapeanum', 'taxonomy': [ 'Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Cypripedium'], 'references': [ Reference(title = 'Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title = 'Direct Submission', ...) ] }