Biopython - BioSQL Module


Advertisements

BioSQL is a generic database schema designed mainly to store sequences and its related data for all RDBMS engine. It is designed in such a way that it holds the data from all popular bioinformatics databases like GenBank, Swissport, etc. It can be used to store in-house data as well.

BioSQL currently provides specific schema for the below databases −

  • MySQL (biosqldb-mysql.sql)
  • PostgreSQL (biosqldb-pg.sql)
  • Oracle (biosqldb-ora/*.sql)
  • SQLite (biosqldb-sqlite.sql)

It also provides minimal support for Java based HSQLDB and Derby databases.

BioPython provides very simple, easy and advanced ORM capabilities to work with BioSQL based database. BioPython provides a module, BioSQL to do the following functionality −

  • Create/remove a BioSQL database
  • Connect to a BioSQL database
  • Parse a sequence database like GenBank, Swisport, BLAST result, Entrez result, etc., and directly load it into the BioSQL database
  • Fetch the sequence data from the BioSQL database
  • Fetch taxonomy data from NCBI BLAST and store it in the BioSQL database
  • Run any SQL query against the BioSQL database

Overview of BioSQL Database Schema

Before going deep into the BioSQL, let us understand the basics of BioSQL schema. BioSQL schema provides 25+ tables to hold sequence data, sequence feature, sequence category/ontology and taxonomy information. Some of the important tables are as follows −

  • biodatabase
  • bioentry
  • biosequence
  • seqfeature
  • taxon
  • taxon_name
  • antology
  • term
  • dxref

Creating a BioSQL Database

In this section, let us create a sample BioSQL database, biosql using the schema provided by the BioSQL team. We shall work with SQLite database as it is really easy to get started and does not have complex setup.

Here, we shall create a SQLite based BioSQL database using the below steps.

Step 1 − Download the SQLite databse engine and install it.

Step 2 − Download the BioSQL project from the GitHub URL. https://github.com/biosql/biosql

Step 3 − Open a console and create a directory using mkdir and enter into it.

cd /path/to/your/biopython/sample 
mkdir sqlite-biosql 
cd sqlite-biosql

Step 4 − Run the below command to create a new SQLite database.

> sqlite3.exe mybiosql.db 
SQLite version 3.25.2 2018-09-25 19:08:10 
Enter ".help" for usage hints. 
sqlite>

Step 5 − Copy the biosqldb-sqlite.sql file from the BioSQL project (/sql/biosqldb-sqlite.sql`) and store it in the current directory.

Step 6 − Run the below command to create all the tables.

sqlite> .read biosqldb-sqlite.sql

Now, all tables are created in our new database.

Step 7 − Run the below command to see all the new tables in our database.

sqlite> .headers on 
sqlite> .mode column 
sqlite> .separator ROW "\n" 
sqlite> SELECT name FROM sqlite_master WHERE type = 'table'; 
biodatabase 
taxon 
taxon_name 
ontology 
term 
term_synonym 
term_dbxref 
term_relationship 
term_relationship_term 
term_path
bioentry 
bioentry_relationship 
bioentry_path 
biosequence 
dbxref 
dbxref_qualifier_value 
bioentry_dbxref 
reference 
bioentry_reference 
comment 
bioentry_qualifier_value 
seqfeature 
seqfeature_relationship 
seqfeature_path 
seqfeature_qualifier_value 
seqfeature_dbxref 
location 
location_qualifier_value 
sqlite>

The first three commands are configuration commands to configure SQLite to show the result in a formatted manner.

Step 8 − Copy the sample GenBank file, ls_orchid.gbk provided by BioPython team https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk into the current directory and save it as orchid.gbk.

Step 9 − Create a python script, load_orchid.py using the below code and execute it.

from Bio import SeqIO 
from BioSQL import BioSeqDatabase 
import os 

server = BioSeqDatabase.open_database(driver = 'sqlite3', db = "orchid.db") 

db = server.new_database("orchid") 
count = db.load(SeqIO.parse("orchid.gbk", "gb"), True) server.commit() 
server.close()

The above code parses the record in the file and converts it into python objects and inserts it into BioSQL database. We will analyze the code in later section.

Finally, we created a new BioSQL database and load some sample data into it. We shall discuss the important tables in the next chapter.

Simple ER Diagram

biodatabase table is in the top of the hierarchy and its main purpose is to organize a set of sequence data into a single group/virtual database. Every entry in the biodatabase refers to a separate database and it does not mingle with another database. All the related tables in the BioSQL database have references to biodatabase entry.

bioentry table holds all the details about a sequence except the sequence data. sequence data of a particular bioentry will be stored in biosequence table.

taxon and taxon_name are taxonomy details and every entry refers this table to specify its taxon information.

Simple ER Diagram

After understanding the schema, let us look into some queries in the next section.

BioSQL Queries

Let us delve into some SQL queries to better understand how the data are organized and the tables are related to each other. Before proceeding, let us open the database using the below command and set some formatting commands −

> sqlite3 orchid.db 
SQLite version 3.25.2 2018-09-25 19:08:10 
Enter ".help" for usage hints. 
sqlite> .header on 
sqlite> .mode columns

.header and .mode are formatting options to better visualize the data. You can also use any SQLite editor to run the query.

List the virtual sequence database available in the system as given below −

select 
   * 
from 
   biodatabase;
*** Result ***
sqlite> .width 15 15 15 15 
sqlite> select * from biodatabase; 
biodatabase_id       name        authority       description    
---------------  --------------- --------------- --------------- 
1                   orchid 
sqlite>

Here, we have only one database, orchid.

List the entries (top 3) available in the database orchid with the below given code

select 
   be.*, 
   bd.name 
from 
   bioentry be 
   inner join 
      biodatabase bd 
      on bd.biodatabase_id = be.biodatabase_id 
where 
   bd.name = 'orchid' Limit 1, 
   3;
*** Result ***
sqlite> .width 15 15 10 10 10 10 10 50 10 10 
sqlite> select be.*, bd.name from bioentry be inner join biodatabase bd on 
bd.biodatabase_id = be.biodatabase_id where bd.name = 'orchid' Limit 1,3; 
bioentry_id biodatabase_id taxon_id name accession identifier division description version name 
--------------- --------------- ---------- ---------- ---------- ---------- ---------- 
---------- ---------- ----------- ---------- --------- ---------- ---------- 
2                   1               19       Z78532     Z78532    2765657     PLN 
C.californicum  5.8S rRNA  gene    and      ITS1    and   ITS2 DN  1 
orchid 
3         1         20          Z78531          Z78531         2765656        PLN
C.fasciculatum  5.8S rRNA  gene    and      ITS1    and   ITS2 DN  1 
orchid 
4         1         21          Z78530          Z78530         2765655        PLN 
C.margaritaceum 5.8S rRNA  gene    and      ITS1    and   ITS2  D  1 
orchid 
sqlite>

List the sequence details associated with an entry (accession − Z78530, name − C. fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA) with the given code −

select 
   substr(cast(bs.seq as varchar), 0, 10) || '...' as seq, 
   bs.length, 
   be.accession, 
   be.description, 
   bd.name 
from 
   biosequence bs 
   inner join 
      bioentry be 
      on be.bioentry_id = bs.bioentry_id 
   inner join 
      biodatabase bd 
      on bd.biodatabase_id = be.biodatabase_id 
where 
   bd.name = 'orchid' 
   and be.accession = 'Z78532';
*** Result ***

sqlite> .width 15 5 10 50 10 
sqlite> select substr(cast(bs.seq as varchar), 0, 10) || '...' as seq, 
bs.length, be.accession, be.description, bd.name from biosequence bs inner 
join bioentry be on be.bioentry_id = bs.bioentry_id inner join biodatabase bd 
on bd.biodatabase_id = be.biodatabase_id where bd.name = 'orchid' and 
be.accession = 'Z78532'; 
seq           length    accession   description  name 
------------ ---------- ---------- ------------ ------------ ---------- ---------- ----------------- 
CGTAACAAG...    753    Z78532    C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA orchid 
sqlite>

Get the complete sequence associated with an entry (accession − Z78530, name − C. fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA) using the below code −

select 
   bs.seq 
from 
   biosequence bs 
   inner join 
      bioentry be 
      on be.bioentry_id = bs.bioentry_id 
   inner join 
      biodatabase bd 
      on bd.biodatabase_id = be.biodatabase_id 
where 
   bd.name = 'orchid' 
   and be.accession = 'Z78532';
*** Result ***

sqlite> .width 1000 
sqlite> select bs.seq from biosequence bs inner join bioentry be on 
be.bioentry_id = bs.bioentry_id inner join biodatabase bd on bd.biodatabase_id = 
be.biodatabase_id where bd.name = 'orchid' and be.accession = 'Z78532'; 
seq 
----------------------------------------------------------------------------------------
----------------------------
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCT
GGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCC
TCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGT
CAGAAAAAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGGATATCTTGGCTC
TAACATCGATGAAGAACGCAG 
sqlite>

List taxon associated with bio database, orchid

select distinct 
   tn.name 
from 
   biodatabase d 
   inner join 
      bioentry e 
      on e.biodatabase_id = d.biodatabase_id 
   inner join 
      taxon t 
      on t.taxon_id = e.taxon_id 
   inner join 
      taxon_name tn 
      on tn.taxon_id = t.taxon_id 
where 
   d.name = 'orchid' limit 10;
*** Result ***

sqlite> select distinct tn.name from biodatabase d inner join bioentry e on 
e.biodatabase_id = d.biodatabase_id inner join taxon t on t.taxon_id = 
e.taxon_id inner join taxon_name tn on tn.taxon_id = t.taxon_id where d.name = 
'orchid' limit 10; 
name 
------------------------------ 
Cypripedium irapeanum 
Cypripedium californicum 
Cypripedium fasciculatum 
Cypripedium margaritaceum 
Cypripedium lichiangense 
Cypripedium yatabeanum 
Cypripedium guttatum 
Cypripedium acaule 
pink lady's slipper 
Cypripedium formosanum 
sqlite>

Load Data into BioSQL Database

Let us learn how to load sequence data into the BioSQL database in this chapter. We already have the code to load data into the database in previous section and the code is as follows −

from Bio import SeqIO 
from BioSQL import BioSeqDatabase 
import os 

server = BioSeqDatabase.open_database(driver = 'sqlite3', db = "orchid.db") 
DBSCHEMA = "biosqldb-sqlite.sql" 
SQL_FILE = os.path.join(os.getcwd(), DBSCHEMA) 

server.load_database_sql(SQL_FILE) 
server.commit() 

db = server.new_database("orchid") 
count = db.load(SeqIO.parse("orchid.gbk", "gb"), True) server.commit() 
server.close()

We will have a deeper look at every line of the code and its purpose −

Line 1 − Loads the SeqIO module.

Line 2 − Loads the BioSeqDatabase module. This module provides all the functionality to interact with BioSQL database.

Line 3 − Loads os module.

Line 5 − open_database opens the specified database (db) with the configured driver (driver) and returns a handle to the BioSQL database (server). Biopython supports sqlite, mysql, postgresql and oracle databases.

Line 6-10 − load_database_sql method loads the sql from the external file and executes it. commit method commits the transaction. We can skip this step because we already created the database with schema.

Line 12 − new_database methods creates new virtual database, orchid and returns a handle db to execute the command against the orchid database.

Line 13 − load method loads the sequence entries (iterable SeqRecord) into the orchid database. SqlIO.parse parses the GenBank database and returns all the sequences in it as iterable SeqRecord. Second parameter (True) of the load method instructs it to fetch the taxonomy details of the sequence data from NCBI blast website, if it is not already available in the system.

Line 14 − commit commits the transaction.

Line 15 − close closes the database connection and destroys the server handle.

Fetch the Sequence Data

Let us fetch a sequence with identifier, 2765658 from the orchid database as below −

from BioSQL import BioSeqDatabase 

server = BioSeqDatabase.open_database(driver = 'sqlite3', db = "orchid.db") 

db = server["orchid"] 
seq_record = db.lookup(gi = 2765658) 
print(seq_record.id, seq_record.description[:50] + "...") 
print("Sequence length %i," % len(seq_record.seq))

Here, server["orchid"] returns the handle to fetch data from virtual databaseorchid. lookup method provides an option to select sequences based on criteria and we have selected the sequence with identifier, 2765658. lookup returns the sequence information as SeqRecordobject. Since, we already know how to work with SeqRecord`, it is easy to get data from it.

Remove a Database

Removing a database is as simple as calling remove_database method with proper database name and then committing it as specified below −

from BioSQL import BioSeqDatabase 
server = BioSeqDatabase.open_database(driver = 'sqlite3', db = "orchid.db") 
server.remove_database("orchids") 
server.commit()
Advertisements