SCPS: an open-source spectral clustering tool for protein sequences

Clustering protein sequences based on their evolutionary relationship is important for sequence annotation as structural and functional relationships can potentially be inferred. Paccanaro et al (2006) mapped this problem into that of clustering the nodes of a weighted undirected graph in which each node corresponds to a protein sequence and the weights on the edges correspond to a measure of distance between two sequences. SCPS is an improved, efficient, open-source implementation of this method.

Download the software »

Gene Ontology parser in Python

The title says it all — this is a parser for the OBO ontology format that is used to maintain the Gene Ontology trees. It is far from being a complete OBO parser, but it can parse the Gene Ontology OBO file and make it readily available in Python scripts.

Check out the source code in Launchpad »

GFam — automatic annotation of gene families

A joint project with Rajkumar Sasidharan that aims the development of a bioinformatics tool for the large-scale automatic functional classification and annotation of genes in whole genomes, based on the domain architecture of genes and Gene Ontology annotations. We have successfully applied our technique for the annotation of the genome of A.thaliana and A.lyrata.

Check out the source code on GitHub »
Read the documentation »

Network analysis in Python with igraph

My long-term plans are to write a proper documentation for igraph's Python interface that follows the style of a textbook rather than the reference manual we have now. The problem with the reference manual is obvious: you won't know where to start reading it if you are just getting started with igraph. The tutorial that will be one of the introductory chapters in this upcoming documentation may be of some help for you if you find the reference manual somewhat useless.

Read the tutorial »

Power-law fitting to empirical data

This is a plain C implementation of the method of fitting power-law distributions to empirical data, as described in Clauset et al, 2009. The program will read samples from either the standard input or from files on the disk and try to fit a (continuous or discrete) power-law distribution to them. The result will include the exponent and the lower cutoff of the fitted distribution (i.e. it is assumed that the power-law behaviour kicks in only for samples larger than a given threshold, and this threshold is also determined automatically), the log-likelihood of the sample under the fitted distribution and the p-value of the corresponding Kolmogorov-Smirnov test. Small p-values mean that the data is unlikely to have come from a power-law distribution with the fitted (or any other) parameter set. See the following paper for a concise description of the method:

  1. Clauset A., Shalizi C.R., Newman M.E.J.: Power-law distributions in empirical data. SIAM Review 51(4):661–703, 2009.
    Preprint available: arXiv:0706.1062.

Check out the source code on GitHub »