Skip to main content

Similarity Search for an Extreme Application: Experience and Implementation

  • Conference paper
  • First Online:
Similarity Search and Applications (SISAP 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13058))

Included in the following conference series:

Abstract

Contemporary challenges for efficient similarity search include complex similarity functions, the curse of dimensionality, and large sizes of descriptive features of data objects. This article reports our experience with a database of protein chains which form (almost) metric space and demonstrate the following extreme properties. Evaluation of the pairwise similarity of protein chains can take even tens of minutes, and has a variance of six orders of magnitude. The minimisation of a number of similarity comparisons is thus crucial, so we propose a generic three stage search engine to solve it. We improve the median searching time 73 times in comparison with the search engine currently employed for the protein database in practice.

V. Mic and P. Zezula—This research was supported by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822). Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This is an estimation made as an extrapolation from other query executions. Our search engine evaluates approximately 1,000 distances per average query. The vast majority of distances is not stored, so we do not know the precise number of distances evaluated in less than a second.

  2. 2.

    No caching is used here except a re-using the distances evaluated in the previous phases of the same query execution.

References

  1. Amato, G., Savino, P.: Approximate similarity search in metric spaces using inverted files. In: 3rd International ICST Conference on Scalable Information Systems, INFOSCALE 2008, Vico Equense, Italy, 2008. p. 28. ICST / ACM (2008)

    Google Scholar 

  2. Armstrong, D.R., et al.: PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res. 48(D1), D335–D343 (2019)

    Google Scholar 

  3. Batko, M., Novak, D., Falchi, F., Zezula, P.: Scalability comparison of peer-to-peer similarity search structures. Future Gener. Comput. Syst. 24(8), 834–848 (2008)

    Google Scholar 

  4. Berman, H.M., et al.: The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000)

    Google Scholar 

  5. Bernhauer, D., Skopal, T.: Analysing indexability of intrinsically high-dimensional data using TriGen. In: Satoh S., et al. (eds.) Similarity Search and Applications. SISAP 2020. Lecture Notes in Computer Science, 12440, 261-269. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60936-8_20

  6. Chávez, E., Figueroa, K., Navarro, G.: Effective proximity retrieval by ordering permutations. IEEE Trans. Pattern Anal. Mach. Intell. 30(9), 1647–1658 (2008)

    Google Scholar 

  7. Connor, R.C.H., Dearle, A., Mic, V., Zezula, P.: On the application of convex transforms to metric search. Pattern Recognit. Lett. 138, 563–570 (2020)

    Google Scholar 

  8. Deng, L., et al.: MADOKA: an ultra-fast approach for large-scale protein structure similarity searching. BMC Bioinform. 20, 662 (2019)

    Google Scholar 

  9. Kearsley, S.K.: On the orthogonal transformation used for structural comparisons. Acta Crystallogr. A45, 208–210 (1989)

    Google Scholar 

  10. Krissinel, E., Henrick, K.: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr. Sect. D Biol. Crystallogr. 60(12), 2256–2268 (2004)

    Google Scholar 

  11. Krissinel, E.: Enhanced fold recognition using efficient short fragment clustering. J. Mol. Biochem. 1(2), 76 (2012)

    Google Scholar 

  12. Mic, V., Novak, D., Zezula, P.: Designing sketches for similarity filtering. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 655–662 (2016)

    Google Scholar 

  13. Mic, V., Novak, D., Zezula, P.: Binary sketches for secondary filtering. ACM Trans. Inf. Syst. 37(1), 1:1–1:28 (2018)

    Google Scholar 

  14. Novak, D., Batko, M., Zezula, P.: Large-scale similarity data management with distributed metric index. Inf. Process. Manag. 48(5), 855–872 (2012)

    Google Scholar 

  15. Novak, D., Zezula, P.: Performance study of independent anchor spaces for similarity searching. Comput. J. 57(11), 1741 (2014)

    Google Scholar 

  16. Novak, D., Zezula, P.: PPP-codes for large-scale similarity searching. Trans. Large-Scale Data Knowl. Cent. Syst. 24, 61–87 (2016)

    Google Scholar 

  17. Skopal, T.: Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Trans. Database Syst. 32(4), 29 (2007)

    Google Scholar 

  18. Velankar, S., et al.: PDBe: protein data bank in Europe. Nucleic Acids Res. 38(suppl\_1), D308–D317 (2009)

    Google Scholar 

  19. Winn, M.D., et al.: Overview of the CCP4 suite and current developments. Acta Crystallogr. D67, 235–242 (2011)

    Google Scholar 

  20. Yang, A., Honig, B.: An integrated approach to the analysis and modeling of protein sequences and structures. i. protein structural alignment and a quantitative measure for protein structural distance. J. Mol. Biol. 301, 665–678 (2000)

    Google Scholar 

  21. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search - The Metric Space Approach, Advances in Database Systems, 32, Springer, Heidelberg (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vladimir Mic .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mic, V., Raček, T., Křenek, A., Zezula, P. (2021). Similarity Search for an Extreme Application: Experience and Implementation. In: Reyes, N., et al. Similarity Search and Applications. SISAP 2021. Lecture Notes in Computer Science(), vol 13058. Springer, Cham. https://doi.org/10.1007/978-3-030-89657-7_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-89657-7_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-89656-0

  • Online ISBN: 978-3-030-89657-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics