In general, Cluster analysis is grouping a set of objects in the same group. This concept is mainly used in data mining, statistical data analysis, machine learning, pattern recognition, image analysis, bioinformatics, etc. It can be achieved by various algorithms to understand how the cluster is widely used in different analysis.
According to Bioinformatics, cluster analysis is mainly used in gene expression data analysis to find groups of genes with similar gene expression.
In this chapter, we will check out important algorithms in Biopython to understand the fundamentals of clustering on a real dataset.
Biopython uses Bio.Cluster module for implementing all the algorithms. It supports the following algorithms −
Let us have a brief introduction on the above algorithms.
Hierarchical clustering is used to link each node by a distance measure to its nearest neighbor and create a cluster. Bio.Cluster node has three attributes: left, right and distance. Let us create a simple cluster as shown below −
>>> from Bio.Cluster import Node >>> n = Node(1,10) >>> n.left = 11 >>> n.right = 0 >>> n.distance = 1 >>> print(n) (11, 0): 1
If you want to construct Tree based clustering, use the below command −
>>> n1 = [Node(1, 2, 0.2), Node(0, -1, 0.5)] >>> n1_tree = Tree(n1) >>> print(n1_tree) (1, 2): 0.2 (0, -1): 0.5 >>> print(n1_tree[0]) (1, 2): 0.2
Let us perform hierarchical clustering using Bio.Cluster module.
Consider the distance is defined in an array.
>>> import numpy as np >>> distance = array([[1,2,3],[4,5,6],[3,5,7]])
Now add the distance array in tree cluster.
>>> from Bio.Cluster import treecluster >>> cluster = treecluster(distance) >>> print(cluster) (2, 1): 0.666667 (-1, 0): 9.66667
The above function returns a Tree cluster object. This object contains nodes where the number of items are clustered as rows or columns.
It is a type of partitioning algorithm and classified into k - means, medians and medoids clustering. Let us understand each of the clustering in brief.
This approach is popular in data mining. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K.
The algorithm works iteratively to assign each data point to one of the K groups based on the features that are provided. Data points are clustered based on feature similarity.
>>> from Bio.Cluster import kcluster >>> from numpy import array >>> data = array([[1, 2], [3, 4], [5, 6]]) >>> clusterid, error,found = kcluster(data) >>> print(clusterid) [0 0 1] >>> print(found) 1
It is another type of clustering algorithm which calculates the mean for each cluster to determine its centroid.
This approach is based on a given set of items, using the distance matrix and the number of clusters passed by the user.
Consider the distance matrix as defined below −
>>> distance = array([[1,2,3],[4,5,6],[3,5,7]])
We can calculate k-medoids clustering using the below command −
>>> from Bio.Cluster import kmedoids >>> clusterid, error, found = kmedoids(distance)
Let us consider an example.
The kcluster function takes a data matrix as input and not Seq instances. You need to convert your sequences to a matrix and provide that to the kcluster function.
One way of converting the data to a matrix containing numerical elements only is by using the numpy.fromstring function. It basically translates each letter in a sequence to its ASCII counterpart.
This creates a 2D array of encoded sequences that the kcluster function recognized and uses to cluster your sequences.
>>> from Bio.Cluster import kcluster >>> import numpy as np >>> sequence = [ 'AGCT','CGTA','AAGT','TCCG'] >>> matrix = np.asarray([np.fromstring(s, dtype=np.uint8) for s in sequence]) >>> clusterid,error,found = kcluster(matrix) >>> print(clusterid) [1 0 0 1]
This approach is a type of artificial neural network. It is developed by Kohonen and often called as Kohonen map. It organizes items into clusters based on rectangular topology.
Let us create a simple cluster using the same array distance as shown below −
>>> from Bio.Cluster import somcluster >>> from numpy import array >>> data = array([[1, 2], [3, 4], [5, 6]]) >>> clusterid,map = somcluster(data) >>> print(map) [[[-1.36032469 0.38667395]] [[-0.41170578 1.35295911]]] >>> print(clusterid) [[1 0] [1 0] [1 0]]
Here, clusterid is an array with two columns, where the number of rows is equal to the number of items that were clustered, and data is an array with dimensions either rows or columns.
Principal Component Analysis is useful to visualize high-dimensional data. It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions.
Principal Component Analysis returns a tuple columnmean, coordinates, components, and eigenvalues. Let us look into the basics of this concept.
>>> from numpy import array >>> from numpy import mean >>> from numpy import cov >>> from numpy.linalg import eig # define a matrix >>> A = array([[1, 2], [3, 4], [5, 6]]) >>> print(A) [[1 2] [3 4] [5 6]] # calculate the mean of each column >>> M = mean(A.T, axis = 1) >>> print(M) [ 3. 4.] # center columns by subtracting column means >>> C = A - M >>> print(C) [[-2. -2.] [ 0. 0.] [ 2. 2.]] # calculate covariance matrix of centered matrix >>> V = cov(C.T) >>> print(V) [[ 4. 4.] [ 4. 4.]] # eigendecomposition of covariance matrix >>> values, vectors = eig(V) >>> print(vectors) [[ 0.70710678 -0.70710678] [ 0.70710678 0.70710678]] >>> print(values) [ 8. 0.]
Let us apply the same rectangular matrix data to Bio.Cluster module as defined below −
>>> from Bio.Cluster import pca >>> from numpy import array >>> data = array([[1, 2], [3, 4], [5, 6]]) >>> columnmean, coordinates, components, eigenvalues = pca(data) >>> print(columnmean) [ 3. 4.] >>> print(coordinates) [[-2.82842712 0. ] [ 0. 0. ] [ 2.82842712 0. ]] >>> print(components) [[ 0.70710678 0.70710678] [ 0.70710678 -0.70710678]] >>> print(eigenvalues) [ 4. 0.]