An adaptive semantic similarity measure for Gene Ontology annotations

The Gene Ontology

A vast (43594 terms as of August 2015) controlled vocabulary of species-neutral attributes to be used in the description of the properties of genes and gene products. These attributes are connected by various binary relationships- The Gene Ontology can be subdivided into three domains:

  1. Biological Process — multi-step biological events occurring in an organ- ism, with a definite beginning and end.

  2. Molecular Function — elemental, molecular-level events.

  3. Cellular Component — parts of a cell.

Similarity Measures

There exists a rich literature on similarity measures over the Gene Ontology (see pdf file for details). In brief, they estimate how similar two proteins or genes are given their Gene Ontology annotation. Useful (among other things) for predicting interactions between proteins.

In this work, I introduce and implement a novel adaptive similarity measure that outperforms common similarity measures in predicting interactions between proteins. The similarity measure has been tested on four species (S. cerevisiae, E. coli, M. musculus, H. sapiens). Furthermore, comparing the relative importance that the measure assigns to different GO terms in different species highlights similarities and dissimilarities between them. For example, compare the relative importance of GO:0005829 (cytosol) for E. coli and for the other species in the following figure, and consider how in prokaryotic organisms (like E. coli) most biological processes occur directly in the cytosol:

Details and code

For more information about the definition of my similarity measure and comparisons with other measures, you can look at the pdf file.

You can also download the code from github and give it a try yourself (read the README first!)