Impact of the Continuous Evolution of Gene Ontology on Similarity Measures

Paul, Madhusudan; Anand, Ashish; Pyne, Saptarshi

doi:10.1007/978-3-030-34872-4_14

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11942))

Included in the following conference series:

International Conference on Pattern Recognition and Machine Intelligence

1209 Accesses
2 Citations

Abstract

Gene Ontology (GO) is a taxonomy of biological terms related to the properties of genes and gene products. It can be used to define a similarity measure between two gene products and assign a confidence score to protein-protein interactions (PPIs). GO is being evolved regularly by the addition/deletion/merging of terms. However, there is no study which evaluates the robustness of a particular similarity measure over the evolution of GO. By robustness of a similarity measure, we mean it should either improve or keep its performance similar over the evolution of GO. In this paper, we systematically study the same for the task of scoring confidence of PPIs using GO-based similarity measures. We observe that the performance of similarity measures gets affected due to the regular updates of GO. We find that similarity measures are not robust in all conditions, rather they keep their performance quite similar over the evolution of GO in certain conditions.

You have full access to this open access chapter, Download conference paper PDF

Impact of the Continuous Evolution of Gene Ontology on the Performance of Similarity Measures for Scoring Confidence of Protein Interactions

Article 20 October 2020

TopoICSim: a new semantic similarity measure based on gene ontology

Article Open access 29 July 2016

Scoring Protein-Protein Interactions Using the Width of Gene Ontology Terms and the Information Content of Common Ancestors

Keywords

1 Introduction

Gene Ontology (GO) [1] is a taxonomy of biological terms to represent the properties of genes and/or gene products (e.g., proteins)^{Footnote 1}. It is organized as a DAG (directed acyclic graph) to describe the relationship among the terms. Gene products are annotated to pertinent GO terms through annotation corpora. There are three GOs: biological process (BP), cellular component (CC), and molecular function (MF). Lord et al. [7] did the first pioneering work by utilizing the ontology-based semantic similarity measure (SSM) in the field of genomics. SSM is a quantitative function, \( SSM (t_1,t_2)\), that measures the closeness between two terms \(t_1\) and \(t_2\) based on their semantic representations in a given ontology. Subsequently, a variety of GO-based SSMs have been proposed and successfully applied to different genomics applications [4, 9].

The high similarity score between two proteins indicates that either they are annotated with similar cellular components (if CC-based GO is used), or with similar biological processes (if BP-based GO is used). This gives an indirect evidence that the two proteins are likely to be interacting compare to other pairs, which has a low similarity score. Hence several studies have used GO-based SSM between two gene products (involved in a PPI) as a confidence score of the interaction. However, GO is being updated regularly with the addition, deletion, and merging of terms along with their annotations. This may affect similarity score between a protein-pair calculated over different versions of the ontology. However, to the best of our knowledge, there is no study which systematically studies the effect of the evolution of GO over SSMs. In this paper, we systematically study whether changes in GO affect the performance of similarity measures. In particular, we focus on GO-based SSMs. Further, we compare multiple GO-based SSMs under this setting for the task of scoring confidence of PPIs.

Section 2 briefly discusses the necessary backgrounds and terminologies. In Sect. 3, we discuss datasets and different GO versions used along with evaluation metrics. Results are discussed and analyzed in Sect. 4.

2 Background

Semantic Similarity Measure (SSM). SSMs can be categorized mainly into two approaches: edge- and node-based [10]. The edge-based approach mainly considers the shared paths between two ontology terms and does not account annotation information of terms. Node-based SSMs compute the similarity between two terms by comparing their properties, common ancestors, and their descendants. This approach is less sensitive to the topological structure of the ontology but more sensitive to change in annotations. SSMs such as [2, 14] try to combine both node- and edge-based approaches and are commonly referred to as the hybrid approach. Few methods, such as TCSS [4], are developed based on the complex structure of GO DAG.

SSMs are defined for two individual terms, but a protein is annotated with a set of terms. So if two proteins \(p_1\) and \(p_2\) are annotated with a set of terms S and T, respectively, then \( SSM (p_1,p_2)\) is calculated as \( SSM (S,T)\) which requires combining SSM between individual term-pairs. Generally, three types of strategies are used in the literature: maximum (MAX), average (Avg), and best-match average (BMA). In MAX and Avg strategies, the similarity between S and T is calculated as the maximum and average of the set \(S \times T\), respectively. SSMs between two sets of terms can be treated as a matrix. BMA is defined as the average of all maximum similarity scores on each row and column of the matrix.

3 Experimental Design

GOs and SSMs. We consider BP and CC ontologies along with MAX and BMA in the evaluation. These ontologies and strategies are the most relevant for scoring confidence of PPIs [8]. We exclude electronically inferred annotations (IEA) as they are not verified by human experts. Further, we consider only those PPIs where both the interacting proteins are annotated to at least one GO term other than the root.

We select five different Bioconductor versions of GO and corresponding annotation corpora: 3.0 (2014-09-13), 3.1 (2015-03-13), 3.2 (2015-09-19), 3.3 (2016-03-05), and 3.4 (2016-09-21). We consider six state-of-the-art SSMs proposed by Resnik [12], Lin [6], Schlicker et al. [13], Jiang and Conrath [5], Wang et al. [14], and Jain and Bader [4], referred to as Resnik, Lin, Rel, Jiang, Wang, and TCSS, respectively, in the rest of the paper. Resnik and TCSS with MAX strategy have been considered to be the best SSMs for scoring confidence of PPIs by several studies [4, 9]. We also consider RDS, RNS, and RES, proposed recently by Paul and Anand [8]. The selected nine SSMs encompass all types of SSMs, as discussed in Sect. 2.

Datasets. We utilize the core subsets of the yeast PPIs from the DIP database (Database of Interacting Proteins) [15] downloaded on 29.10.2015 as positive instances. As done in [4], an equal number of negative PPI instances are generated independently by randomly choosing protein pairs annotated in BP and CC and are not present in the iRefWeb database [11], a combined database of all known PPIs, accessed on 27.11.2015.

Proteins involved in a pathway are more likely to interact among themselves and likely to be annotated to the same or similar GO terms and thus should show high similarity scores. A set of 11 yeast (S. cerevisiae) KEGG pathways is selected as in [8]. During the selection of pathways, the authors of [8] try to maintain a trade-off between functional diversity and computational time required for the experiment.

Evaluation Metrics. A similarity measure can classify a set of PPIs into two groups: positives and negatives, for a given cutoff similarity score. Hence an SSM can be treated as a binary classifier. We utilize the area under the ROC curve (AUC) as an evaluation metric for binary classifiers.

For each KEGG pathway, an intra-set average similarity is computed as the average of all pairwise similarities of proteins within the pathway. An inter-set average similarity for every two pathways is computed as the average of all pairwise cross-similarities of proteins between the two pathways. A discriminating power (DP) of a pathway is defined as the ratio between intra-set average similarity and the average of all inter-set average similarities between that pathway and other pathways as in [3]. Thus the DP quantifies the ability of an SSM to distinguish among various functionally different sets of proteins (e.g., KEGG pathways).

4 Results and Discussion

ROC curve analysis: Table 1 summarizes AUC of the top five SSMs for the different versions of BP ontology. Insignificant change in AUC values for all SSMs indicates that the evolution of GO has no impact on their classification performance. This can be explained easily. An AUC of 1 implies a perfect classifier, while an area of 0.5 indicates a random classifier. So, the practical range of AUC for a reasonably good classifier is very limited (Generally, [0.7, 1]). Unless the majority of the PPIs get affected (due to the changes in GO), it is unexpected to observe high variability in AUCs over the different versions of GO. By affected we mean for a given PPI, an SSM produces different similarity scores for different GO versions. In fact, the majority of PPIs (in the PPI dataset) does not get affected significantly due to the changes in GO.

Table 1. The area under the curves (AUCs) of SSMs for the different GO-BP versions. The best AUC for each strategy is shown in bold.

Full size table

To see the closer picture of the impact, we find those PPIs whose similarity scores change over the versions of GO. For each SSM, we select the common PPIs (more than \(99\%\) of PPIs are common) among the five GO versions. For each of the selected PPIs, the standard deviation of the five similarity scores corresponding to the five GO versions is calculated. Then we sort the PPIs according to their standard deviation (in descending order) and select the top \(10\%\) PPIs. The selected PPIs are the most affected \(10\%\) PPIs due to the changes in GO. An equal number of negative PPIs are selected from the already generated negative PPIs for the corresponding SSM. Finally, AUC is computed for the selected positive and negative PPIs for each GO version. The resultant AUCs of two best performing SSMs for the different versions of GO-BP are demonstrated in Table 2.

Now, the performance variations of SSMs among GO versions are quite visible. For RES, we observe relative changes of approximately \(8\%\) and \(4\%\) while using MAX and BMA strategies respectively. Similarly, for TCSS, relative changes of approximately \(6\%\) and \(7\%\) while using MAX and BMA strategies. These changes are observed between versions 3.0 and 3.4. Similar observations are made for the other SSMs and using other ontologies. We also observe that across all measures, the overall variability is higher in CC than BP.

Table 2. The area under the curves (AUCs) of two best performing SSMs for the different GO-BP versions with top 10% most affected PPIs.

Full size table

To find a general pattern of variability among SSMs, we repeat the aforementioned process for different cutoffs (\(100\%\) to the top \(10\%\)) of affected PPIs. Here a cutoff of \(100\%\) implies that all PPIs are considered and hence, the majority of them have no change in their similarity score. The mean AUCs (of five GO versions) achieved by SSMs in increasing order of variability of PPIs are shown in Fig. 1.

SSMs with BMA strategy shows robustness compared to MAX strategy. Almost all SSMs with BMA strategy either improve or keep their performance similar from their initial performance as variability increases in both the ontologies. Particularly in BP, the improvement is more smooth and consistent. However, with MAX strategy, the performance is quite fluctuating, and the irregularity is more in CC. Therefore it seems that MAX strategy overestimates in many cases, especially in CC.

All SSMs exhibit higher robustness in BP than CC. If we examine the same for each SSM separately, we get further insights (See Figs. 2 and 3). With all data considered (\(100\%\)), SSMs with MAX strategy gives better AUC in comparison with BMA. However, as variability increases (by removing PPIs having no changes over GO evolution), SSMs with BMA obtain higher AUCs. In TCSS, although BMA increases its performances continuously, it is unable to cross the performance of MAX, particularly in BP. In fact, the difference of performance between MAX and BMA of TCSS and Resnik is reducing as variability increases, and they show almost similar performances with very high variable PPIs (>50%).

RES-BMA continuously produces the highest AUCs as variability increases. In general, RES, RNS, and TCSS show comparatively high robustness. With the top \(10\%\) variable PPIs, the highest mean AUC is 0.949/0.957 (BP/CC) produced by RES-BMA while the second-highest mean AUC is 0.922/ 0.940 (BP/CC) produced by TCSS with MAX or BMA.

Set-discriminating power of KEGG pathways: For each GO versions and SSM, we calculate DP values of each pathway with respect to other 10 pathways. Then we take version-wise (GO) mean DP values. Table 3 shows the mean DP values of all the 11 pathways for each GO-BP version and SSM.

Table 3. The mean DP values of all the 11 pathways for each GO-BP version and SSM. The best DP values are shown in bold.

Full size table

The majority of SSMs produce quite similar DP values over the evolution of GO since less number of PPIs are affected due to the changes in GO. RES almost continuously produces higher DP values in both the ontologies, particularly, with BMA strategy. TCSS shows competitive performances in both the ontologies while Jiang achieves good DP values in BP only. The significant differences between MAX and BMA strategies, in both BP and CC simultaneously, are observed with RES, TCSS, and some extend with RNS only.

RES-BMA shows continuous and significant improvement over the evolution of GO. We can assume that the newer GO version represents more accurate and complete information than the older, and the robust SSMs should reflect that positively. RES-BMA almost continuously improves its DP value over the evolution of BP ontology (5.59, 5.76, 6.38, 6.58, and 6.50) except for the last version (Ver. 3.4), whereas other SSMs keep their performances quite similar. In fact, the changes, particularly, in edges, between the two GO-BP versions (Ver. 3.3 to Ver. 3.4) are very less (+0.30%) in comparison with other versions (The avg. successive change is +2.91%). Hence the changes are reflected better way with RES-BMA than the others.

5 Conclusion

In this paper, we systematically study how similarity measures get affected due to the evolution of gene ontology for the task of scoring confidence of PPIs. We observe that the performance of each measure gets affected due to the regular updates of GO. All SSMs exhibit satisfactory robustness with BMA strategy in BP ontology only. SSMs with MAX strategy have the tendency to overestimate, particularly in CC. Although, RES-BMA, TCSS-BMA and RNS-BMA exhibit comparatively good robustness, the changes in GO is reflected better way with RES-BMA than the others.

Notes

1.
Hereafter we refer to gene products only.

References

Ashburner, M., et al.: Gene ontology: tool for the unification of biology. Nature Genet. 25(1), 25–29 (2000)
Article Google Scholar
Bandyopadhyay, S., Mallick, K.: A new path based hybrid measure for gene ontology similarity. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 11(1), 116–127 (2014)
Article Google Scholar
Benabderrahmane, S., Smail-Tabbone, M., Poch, O., Napoli, A., Devignes, M.D.: IntelliGO: a new vector-based semantic similarity measure including annotation origin. BMC Bioinform. 11(1), 588 (2010)
Article Google Scholar
Jain, S., Bader, G.D.: An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinform. 11(1), 562 (2010)
Article Google Scholar
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of 10th International Conference on Research In Computational Linguistics, ROCLING 1997 (1997)
Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, vol. 98, pp. 296–304. Morgan Kaufmann Publishers Inc., San Francisco (1998)
Google Scholar
Lord, P.W., Stevens, R.D., Brass, A., Goble, C.A.: Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19(10), 1275–1283 (2003)
Article Google Scholar
Paul, M., Anand, A.: A new family of similarity measures for scoring confidence of protein interactions using gene ontology, p. 459107. bioRxiv (2018)
Google Scholar
Pesquita, C.: Semantic similarity in the gene ontology. In: Dessimoz, C., Škunca, N. (eds.) The Gene Ontology Handbook. MMB, vol. 1446, pp. 161–173. Springer, New York (2017). https://doi.org/10.1007/978-1-4939-3743-1_12
Chapter Google Scholar
Pesquita, C., Faria, D., Falcao, A.O., Lord, P., Couto, F.M.: Semantic similarity in biomedical ontologies. PLoS Comput. Biol. 5(7), e1000443 (2009)
Article MathSciNet Google Scholar
Razick, S., Magklaras, G., Donaldson, I.M.: iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinform. 9(1), 1 (2008)
Article Google Scholar
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 448–453. Morgan Kaufmann Publishers Inc., San Francisco (1995)
Google Scholar
Schlicker, A., Domingues, F.S., Rahnenführer, J., Lengauer, T.: A new measure for functional similarity of gene products based on gene ontology. BMC Bioinform. 7(1), 302 (2006)
Article Google Scholar
Wang, J.Z., Du, Z., Payattakool, R., Yu, P.S., Chen, C.F.: A new method to measure the semantic similarity of go terms. Bioinformatics 23(10), 1274–1281 (2007)
Article Google Scholar
Xenarios, I., Rice, D.W., Salwinski, L., Baron, M.K., Marcotte, E.M., Eisenberg, D.: DIP: the database of interacting proteins. Nucleic Acids Res. 28(1), 289–291 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, IIT Guwahati, Guwahati, India
Madhusudan Paul, Ashish Anand & Saptarshi Pyne
Department of Computer and System Sciences, Visva-Bharati, Santiniketan, India
Madhusudan Paul

Authors

Madhusudan Paul
View author publications
You can also search for this author in PubMed Google Scholar
Ashish Anand
View author publications
You can also search for this author in PubMed Google Scholar
Saptarshi Pyne
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashish Anand .

Editor information

Editors and Affiliations

Tezpur University, Tezpur, India
Bhabesh Deka
Indian Statistical Institute, Kolkata, India
Pradipta Maji
Indian Statistical Institute, Kolkata, India
Sushmita Mitra
Tezpur University, Tezpur, India
Dhruba Kumar Bhattacharyya
Indian Institute of Technology Guwahati, Guwahati, India
Prabin Kumar Bora
Indian Statistical Institute, Kolkata, India
Sankar Kumar Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paul, M., Anand, A., Pyne, S. (2019). Impact of the Continuous Evolution of Gene Ontology on Similarity Measures. In: Deka, B., Maji, P., Mitra, S., Bhattacharyya, D., Bora, P., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2019. Lecture Notes in Computer Science(), vol 11942. Springer, Cham. https://doi.org/10.1007/978-3-030-34872-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-34872-4_14
Published: 25 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34871-7
Online ISBN: 978-3-030-34872-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Impact of the Continuous Evolution of Gene Ontology on Similarity Measures

Abstract

Similar content being viewed by others

Impact of the Continuous Evolution of Gene Ontology on the Performance of Similarity Measures for Scoring Confidence of Protein Interactions

TopoICSim: a new semantic similarity measure based on gene ontology

Scoring Protein-Protein Interactions Using the Width of Gene Ontology Terms and the Information Content of Common Ancestors

Keywords

1 Introduction

2 Background

3 Experimental Design

4 Results and Discussion

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Impact of the Continuous Evolution of Gene Ontology on Similarity Measures

Abstract

Similar content being viewed by others

Impact of the Continuous Evolution of Gene Ontology on the Performance of Similarity Measures for Scoring Confidence of Protein Interactions

TopoICSim: a new semantic similarity measure based on gene ontology

Scoring Protein-Protein Interactions Using the Width of Gene Ontology Terms and the Information Content of Common Ancestors

Keywords

1 Introduction

2 Background

3 Experimental Design

4 Results and Discussion

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation