Abstract
Contemporary challenges for efficient similarity search include complex similarity functions, the curse of dimensionality, and large sizes of descriptive features of data objects. This article reports our experience with a database of protein chains which form (almost) metric space and demonstrate the following extreme properties. Evaluation of the pairwise similarity of protein chains can take even tens of minutes, and has a variance of six orders of magnitude. The minimisation of a number of similarity comparisons is thus crucial, so we propose a generic three stage search engine to solve it. We improve the median searching time 73 times in comparison with the search engine currently employed for the protein database in practice.
V. Mic and P. Zezula—This research was supported by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822). Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
This is an estimation made as an extrapolation from other query executions. Our search engine evaluates approximately 1,000 distances per average query. The vast majority of distances is not stored, so we do not know the precise number of distances evaluated in less than a second.
- 2.
No caching is used here except a re-using the distances evaluated in the previous phases of the same query execution.
References
Amato, G., Savino, P.: Approximate similarity search in metric spaces using inverted files. In: 3rd International ICST Conference on Scalable Information Systems, INFOSCALE 2008, Vico Equense, Italy, 2008. p. 28. ICST / ACM (2008)
Armstrong, D.R., et al.: PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res. 48(D1), D335–D343 (2019)
Batko, M., Novak, D., Falchi, F., Zezula, P.: Scalability comparison of peer-to-peer similarity search structures. Future Gener. Comput. Syst. 24(8), 834–848 (2008)
Berman, H.M., et al.: The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000)
Bernhauer, D., Skopal, T.: Analysing indexability of intrinsically high-dimensional data using TriGen. In: Satoh S., et al. (eds.) Similarity Search and Applications. SISAP 2020. Lecture Notes in Computer Science, 12440, 261-269. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60936-8_20
Chávez, E., Figueroa, K., Navarro, G.: Effective proximity retrieval by ordering permutations. IEEE Trans. Pattern Anal. Mach. Intell. 30(9), 1647–1658 (2008)
Connor, R.C.H., Dearle, A., Mic, V., Zezula, P.: On the application of convex transforms to metric search. Pattern Recognit. Lett. 138, 563–570 (2020)
Deng, L., et al.: MADOKA: an ultra-fast approach for large-scale protein structure similarity searching. BMC Bioinform. 20, 662 (2019)
Kearsley, S.K.: On the orthogonal transformation used for structural comparisons. Acta Crystallogr. A45, 208–210 (1989)
Krissinel, E., Henrick, K.: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr. Sect. D Biol. Crystallogr. 60(12), 2256–2268 (2004)
Krissinel, E.: Enhanced fold recognition using efficient short fragment clustering. J. Mol. Biochem. 1(2), 76 (2012)
Mic, V., Novak, D., Zezula, P.: Designing sketches for similarity filtering. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 655–662 (2016)
Mic, V., Novak, D., Zezula, P.: Binary sketches for secondary filtering. ACM Trans. Inf. Syst. 37(1), 1:1–1:28 (2018)
Novak, D., Batko, M., Zezula, P.: Large-scale similarity data management with distributed metric index. Inf. Process. Manag. 48(5), 855–872 (2012)
Novak, D., Zezula, P.: Performance study of independent anchor spaces for similarity searching. Comput. J. 57(11), 1741 (2014)
Novak, D., Zezula, P.: PPP-codes for large-scale similarity searching. Trans. Large-Scale Data Knowl. Cent. Syst. 24, 61–87 (2016)
Skopal, T.: Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Trans. Database Syst. 32(4), 29 (2007)
Velankar, S., et al.: PDBe: protein data bank in Europe. Nucleic Acids Res. 38(suppl\_1), D308–D317 (2009)
Winn, M.D., et al.: Overview of the CCP4 suite and current developments. Acta Crystallogr. D67, 235–242 (2011)
Yang, A., Honig, B.: An integrated approach to the analysis and modeling of protein sequences and structures. i. protein structural alignment and a quantitative measure for protein structural distance. J. Mol. Biol. 301, 665–678 (2000)
Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search - The Metric Space Approach, Advances in Database Systems, 32, Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Mic, V., Raček, T., Křenek, A., Zezula, P. (2021). Similarity Search for an Extreme Application: Experience and Implementation. In: Reyes, N., et al. Similarity Search and Applications. SISAP 2021. Lecture Notes in Computer Science(), vol 13058. Springer, Cham. https://doi.org/10.1007/978-3-030-89657-7_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-89657-7_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89656-0
Online ISBN: 978-3-030-89657-7
eBook Packages: Computer ScienceComputer Science (R0)