A Method for Similarity-Based Grouping of Biological Data

Jakonienė, Vaida; Rundqvist, David; Lambrix, Patrick

doi:10.1007/11799511_13

Vaida Jakonienė²²,
David Rundqvist²² &
Patrick Lambrix²²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4075))

Included in the following conference series:

International Workshop on Data Integration in the Life Sciences

535 Accesses
1 Citations

Abstract

Similarity-based grouping of data entries in one or more data sources is a task underlying many different data management tasks, such as, structuring search results, removal of redundancy in databases and data integration. Similarity-based grouping of data entries is not a trivial task in the context of life science data sources as the stored data is complex, highly correlated and represented at different levels of granularity. The contribution of this paper is two-fold. 1) We propose a method for similarity-based grouping and 2) we show results from test cases. As the main steps the method contains specification of grouping rules, pairwise grouping between entries, actual grouping of similar entries, and evaluation and analysis of the results. Often, different strategies can be used in the different steps. The method enables exploration of the influence of the choices and supports evaluation of the results with respect to given classifications. The grouping method is illustrated by test cases based on different strategies and classifications. The results show the complexity of the similarity-based grouping tasks and give deeper insights in the selected grouping tasks, the analyzed data source, and the influence of different strategies on the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Uncertain Groupings: Probabilistic Combination of Grouping Data

Clustering of Biological Sequences

simona: a comprehensive R package for semantic similarity analysis on bio-ontologies

Article Open access 16 September 2024

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)
Google Scholar
Berg, J.M., Tymoczko, J.L., Stryer, L.: Biochemistry. W.H. Freeman and Company, New York (2002)
Google Scholar
Bilke, A., Bleiholder, J., Böhm, C., Draba, K., Naumann, F.: Automatic Data Fusion with HumMer. In: Demo at VLDB Conference, pp. 1251–1254 (2005)
Google Scholar
Couto, F.M., Silva, M.J., Coutinho, P.: Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. In: Conference on Information and Knowledge Management, pp. 343–344 (2005)
Google Scholar
Doms, A., Schroeder, M.: GoPubMed: Exploring PubMed with the GeneOntology. Nucleic Acids Research 33, W783–W786 (2005)
Article Google Scholar
Gabaldon, T., Huynen, M.A.: Prediction of protein function and pathways in the genome era. Cellular and molecular life sciences: CMLS 61(7-8), 930–944 (2004)
Article Google Scholar
The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genetics 25(1), 25–29 (2000), http://www.geneontology.org/
Google Scholar
Herbert, K.G., Gehani, N.H., Piel, W.H., Wang, J., Wu, C.H.: BIO-AJAX: An Extensible Framework for Biological Data Cleaning. SIGMOD Record 33(2), 51–57 (2004)
Article Google Scholar
Java implementation of the Smith-Waterman algorithm for biological sequence alignment, http://jaligner.sourceforge.net/
Koh, J.L.Y., Lee, M.L., Khan, A.M., Tan, P.T.J., Brusic, V.: Duplicate Detection in Biological Data using Association Rule Mining. In: ECML/PKDD Workshop on Data Mining and Text Mining for Bioinformatics, pp. 31–37 (2004)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Lord, P.W., Stevens, R., Brass, A., Goble, C.A.: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19(10), 1275–1283 (2003)
Article Google Scholar
Pfeifer, U., Poersch, T., Fuhr, N.: Searching Proper Names in Databases. In: Conference on Hypertext - Information Retrieval - Multimedia, pp. 259–275 (1995)
Google Scholar
Shamir, R., Sharan, R.: Algorithmic Approaches to Clustering Gene Expression Data. In: Jiang, T., Smith, T., Xu, Y., Zhang, M.Q. (eds.) Current Topics in Computational Biology, pp. 269–299. MIT Press, Cambridge (2002)
Google Scholar
Speer, N., Fröhlich, H., Spieth, C., Zell, A.: Functional Distances for Genes Based on GO Feature Maps and their Application to Clustering. In: IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 142–149 (2005)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Google Scholar
Strehl, A.: Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. PhD thesis, University of Texas at Austin (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Science, Linköpings universitet, SE-581 83, Linköping, Sweden
Vaida Jakonienė, David Rundqvist & Patrick Lambrix

Authors

Vaida Jakonienė
View author publications
You can also search for this author in PubMed Google Scholar
David Rundqvist
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Lambrix
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Humboldt-Universität zu Berlin,
Ulf Leser
Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
Felix Naumann
IBM Application and Integration Middleware, 1475 Phoenixville Pike, 19380, West Chester, PA, USA
Barbara Eckman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jakonienė, V., Rundqvist, D., Lambrix, P. (2006). A Method for Similarity-Based Grouping of Biological Data. In: Leser, U., Naumann, F., Eckman, B. (eds) Data Integration in the Life Sciences. DILS 2006. Lecture Notes in Computer Science(), vol 4075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11799511_13

Download citation

DOI: https://doi.org/10.1007/11799511_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36593-8
Online ISBN: 978-3-540-36595-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics