Similarity Search for an Extreme Application: Experience and Implementation

Mic, Vladimir; Raček, Tomáš; Křenek, Aleš; Zezula, Pavel

doi:10.1007/978-3-030-89657-7_20

Vladimir Mic¹⁵,
Tomáš Raček¹⁶,
Aleš Křenek¹⁶ &
…
Pavel Zezula¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13058))

Included in the following conference series:

International Conference on Similarity Search and Applications

1148 Accesses
1 Citations

Abstract

Contemporary challenges for efficient similarity search include complex similarity functions, the curse of dimensionality, and large sizes of descriptive features of data objects. This article reports our experience with a database of protein chains which form (almost) metric space and demonstrate the following extreme properties. Evaluation of the pairwise similarity of protein chains can take even tens of minutes, and has a variance of six orders of magnitude. The minimisation of a number of similarity comparisons is thus crucial, so we propose a generic three stage search engine to solve it. We improve the median searching time 73 times in comparison with the search engine currently employed for the protein database in practice.

V. Mic and P. Zezula—This research was supported by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822). Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Theoretical and Computational Aspects of Protein Structural Alignment

Distance-Based Index Structures for Fast Similarity Search

Article 27 July 2017

A Tale of Four Metrics

Notes

1.
This is an estimation made as an extrapolation from other query executions. Our search engine evaluates approximately 1,000 distances per average query. The vast majority of distances is not stored, so we do not know the precise number of distances evaluated in less than a second.
2.
No caching is used here except a re-using the distances evaluated in the previous phases of the same query execution.

References

Amato, G., Savino, P.: Approximate similarity search in metric spaces using inverted files. In: 3rd International ICST Conference on Scalable Information Systems, INFOSCALE 2008, Vico Equense, Italy, 2008. p. 28. ICST / ACM (2008)
Google Scholar
Armstrong, D.R., et al.: PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res. 48(D1), D335–D343 (2019)
Google Scholar
Batko, M., Novak, D., Falchi, F., Zezula, P.: Scalability comparison of peer-to-peer similarity search structures. Future Gener. Comput. Syst. 24(8), 834–848 (2008)
Google Scholar
Berman, H.M., et al.: The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000)
Google Scholar
Bernhauer, D., Skopal, T.: Analysing indexability of intrinsically high-dimensional data using TriGen. In: Satoh S., et al. (eds.) Similarity Search and Applications. SISAP 2020. Lecture Notes in Computer Science, 12440, 261-269. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60936-8_20
Chávez, E., Figueroa, K., Navarro, G.: Effective proximity retrieval by ordering permutations. IEEE Trans. Pattern Anal. Mach. Intell. 30(9), 1647–1658 (2008)
Google Scholar
Connor, R.C.H., Dearle, A., Mic, V., Zezula, P.: On the application of convex transforms to metric search. Pattern Recognit. Lett. 138, 563–570 (2020)
Google Scholar
Deng, L., et al.: MADOKA: an ultra-fast approach for large-scale protein structure similarity searching. BMC Bioinform. 20, 662 (2019)
Google Scholar
Kearsley, S.K.: On the orthogonal transformation used for structural comparisons. Acta Crystallogr. A45, 208–210 (1989)
Google Scholar
Krissinel, E., Henrick, K.: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr. Sect. D Biol. Crystallogr. 60(12), 2256–2268 (2004)
Google Scholar
Krissinel, E.: Enhanced fold recognition using efficient short fragment clustering. J. Mol. Biochem. 1(2), 76 (2012)
Google Scholar
Mic, V., Novak, D., Zezula, P.: Designing sketches for similarity filtering. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 655–662 (2016)
Google Scholar
Mic, V., Novak, D., Zezula, P.: Binary sketches for secondary filtering. ACM Trans. Inf. Syst. 37(1), 1:1–1:28 (2018)
Google Scholar
Novak, D., Batko, M., Zezula, P.: Large-scale similarity data management with distributed metric index. Inf. Process. Manag. 48(5), 855–872 (2012)
Google Scholar
Novak, D., Zezula, P.: Performance study of independent anchor spaces for similarity searching. Comput. J. 57(11), 1741 (2014)
Google Scholar
Novak, D., Zezula, P.: PPP-codes for large-scale similarity searching. Trans. Large-Scale Data Knowl. Cent. Syst. 24, 61–87 (2016)
Google Scholar
Skopal, T.: Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Trans. Database Syst. 32(4), 29 (2007)
Google Scholar
Velankar, S., et al.: PDBe: protein data bank in Europe. Nucleic Acids Res. 38(suppl\_1), D308–D317 (2009)
Google Scholar
Winn, M.D., et al.: Overview of the CCP4 suite and current developments. Acta Crystallogr. D67, 235–242 (2011)
Google Scholar
Yang, A., Honig, B.: An integrated approach to the analysis and modeling of protein sequences and structures. i. protein structural alignment and a quantitative measure for protein structural distance. J. Mol. Biol. 301, 665–678 (2000)
Google Scholar
Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search - The Metric Space Approach, Advances in Database Systems, 32, Springer, Heidelberg (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Vladimir Mic & Pavel Zezula
Institute of Computer Science, Masaryk University, Brno, Czech Republic
Tomáš Raček & Aleš Křenek

Authors

Vladimir Mic
View author publications
You can also search for this author in PubMed Google Scholar
Tomáš Raček
View author publications
You can also search for this author in PubMed Google Scholar
Aleš Křenek
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Zezula
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vladimir Mic .

Editor information

Editors and Affiliations

National University of San Luis, San Luis, Argentina
Nora Reyes
University of St Andrews, St Andrews, UK
Richard Connor
University of Vienna, Vienna, Austria
Nils Kriege
Kiel University, Kiel, Germany
Daniyal Kazempour
University of Bologna, Bologna, Italy
Ilaria Bartolini
TU Dortmund University, Dortmund, Germany
Erich Schubert
TU Dortmund University, Dortmund, Germany
Jian-Jia Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mic, V., Raček, T., Křenek, A., Zezula, P. (2021). Similarity Search for an Extreme Application: Experience and Implementation. In: Reyes, N., et al. Similarity Search and Applications. SISAP 2021. Lecture Notes in Computer Science(), vol 13058. Springer, Cham. https://doi.org/10.1007/978-3-030-89657-7_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-89657-7_20
Published: 22 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89656-0
Online ISBN: 978-3-030-89657-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics