Skip to main content

Efficiency and Scalability Issues in Metric Access Methods

  • Chapter
Book cover Computational Intelligence in Medical Informatics

Part of the book series: Studies in Computational Intelligence ((SCI,volume 85))

The metric space paradigm has recently received attention as an important model of similarity in the area of Bioinformatics. Numerous techniques have been proposed to solve similarity (range or nearest-neighbor) queries on collections of data from metric domains. Though important representatives are outlined, this chapter is not trying to substitute existing comprehensive surveys. The main objective is to explain and prove by experiments that similarity searching is typically an expensive process which does not easily scale to very large volumes of data, thus distributed architectures able to exploit parallelism must be employed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Alpkocak, T. Danisman, and T. Ulker. A parallel similarity search in high dimensional metric space using M-Tree. In D. Grigoras, A. Nicolau, B. Toursel, and B. Folliot, editors, Proceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing-Revised Papers (IWCC 2001), Mangalia, Romania, September 1-6, 2001, volume 2326 of Lecture Notes in Computer Science, pages 166–171. Springer, 2002.

    Google Scholar 

  2. S. F. Altschul and B. W. Erickson. Locally optimal subalignments using nonlinear similarity functions. Bulletin of Mathematical Biology, 48:633–660, 1986.

    MATH  MathSciNet  Google Scholar 

  3. G. Amato, F. Rabitti, P. Savino, and P. Zezula. Region proximity in metric spaces and its use for approximate similarity search. ACM Transactions on Information Systems (TOIS 2003), 21(2):192–227, April 2003.

    Article  Google Scholar 

  4. M. Batko, V. Dohnal, and P. Zezula. M-Grid: Similarity searching in Grids. In Proceedings of ACM International Workshop on Information Retrieval in Peer-to-Peer Networks (P2PIR 2006), Arlington, VA, USA, November 11, 2006, page 8. ACM, 2006.

    Google Scholar 

  5. M. Batko, C. Gennaro, and P. Zezula. A scalable nearest neighbor search in P2P systems. In Proceedings of the 2nd International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2004), Toronto, Canada, volume 3367 of Lecture Notes in Computer Science, pages 79–92. Springer, February 2005.

    Google Scholar 

  6. M. Batko, D. Novak, F. Falchi, and P. Zezula. On scalability of the similarity search in the world of peers. In Proceedings of First International Conference on Scalable Information Systems (INFOSCALE 2006), Hong Kong, May 30 - June 1, pages 1–12. ACM Press, 2006.

    Google Scholar 

  7. A. Baxevanis and B. Ouellette. Bioinformatics. A Practical Guide to the Analysis of Genes and Proteins (Second Edition). Wiley-Interscience, 2001.

    Google Scholar 

  8. D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler. Genbank: update. Nucleic Acids Research, 32:Database Issue D23–D26, 2004.

    Article  Google Scholar 

  9. S. Brin. Near neighbor search in large metric spaces. In U. Dayal, P. M. D. Gray, and S. Nishio, editors, Proceedings of the 21th International Conference on Very Large Data Bases (VLDB 1995), Zurich, Switzerland, September 11-15, 1995, pages 574–584. Morgan Kaufmann, 1995.

    Google Scholar 

  10. E. Chávez, G. Navarro, R. A. Baeza-Yates, and J. L. Marroquín. Searching in metric spaces. ACM Computing Surveys (CSUR 2001), 33(3):273–321, September 2001.

    Article  Google Scholar 

  11. P.-H. Chi, C.-R. Shyu, and D. Xu. A fast scop fold classification system using content-based e-predict algorithm. BMC Bioinformatics, 7:362+, July 2006.

    Article  Google Scholar 

  12. P. Ciaccia and M. Patella. Bulk loading the M-tree. In Proceedings of the 9th Australasian Database Conference (ADC 1998), Perth, Australia, February 2-3, 1998, volume 20(2) of Australian Computer Science Communications, pages 15–26. Springer, 1998.

    Google Scholar 

  13. P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In M. Jarke, M. J. Carey, K. R. Dittrich, F. H. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld, editors, Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB 1997), Athens, Greece, August 25-29, 1997, pages 426–435. Morgan Kaufmann, 1997.

    Google Scholar 

  14. V. Dohnal. Indexing Structures for Searching in Metric Spaces. PhD thesis, Faculty of Informatics, Masaryk University in Brno, Czech Republic, May 2004. http://www.fi.muni.cz/~{}xdohnal/phd-thesis.pdf.

  15. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. Separable splits in metric data sets. In A. Celentano, L. Tanca, and P. Tiberio, editors, Proceedings of the 9th Italian Symposium on Advanced Database Systems (SEBD 2001), Venezia, Italy, June 27-29, 2001, pages 45–62. LCM Selecta Group - Milano, 2001.

    Google Scholar 

  16. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. D-Index: Distance searching index for metric data sets. Multimedia Tools and Applications, 21(1):9–33, 2003.

    Article  Google Scholar 

  17. F. Falchi, C. Gennaro, and P. Zezula. A content-addressable network for similarity search in metric spaces. In Proceedings of the the 2nd International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2005), Trondheim, Norway, August 28-29, 2005, pages 126–137, 2005.

    Google Scholar 

  18. C. Gennaro, P. Savino, and P. Zezula. Similarity search in metric databases through hashing. In Proceedings of the 3rd ACM Multimedia 2001 Workshop on Multimedia Information Retrieval (MIR 2001), Ottawa, Ontario, Canada, October 5, 2001, pages 1–5. ACM Press, 2001.

    Chapter  Google Scholar 

  19. G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems (TODS 2003), 28(4):517–580, 2003.

    Article  Google Scholar 

  20. H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems (TODS 2005), 30(2):364–397, 2005.

    Article  Google Scholar 

  21. M. B. Jones, M. Theimer, H. Wang, and A. Wolman. Unexpected complexity: Experiences tuning and extending can. Technical Report MSR-TR-2002-118, Microsoft Research, December 2002.

    Google Scholar 

  22. R. Mao, W. Xu, S. Ramakrishnan, G. Nuckolls, and D. P. Miranker. On optimizing distance-based similarity search for biological databases. In Proceedings of the 4th International IEEE Computer Society Computational Systems Bioinformatics Conference (CSB 2005), Stanford, USA, pages 351–361, 2005.

    Google Scholar 

  23. R. Mao, W. Xu, N. Singh, and D. P. Miranker. An assessment of a metric space database index to support sequence homology. International Journal on Artificial Intelligence Tools, 14(5):867–885, 2005.

    Article  Google Scholar 

  24. R. Mao, W. Xu, W. S. Willard, S. R. Ramakrishnan, and D. P. Miranker. MoBIoS index: Support distance-based queries in bioinformatics. In Proceedings of the 2006 Workshop on Intelligent Computing & Bioinformatics of the Chinese Academy of Sciences (WICB 2006), Hefei, Anhui, China, November 12-14, 2006, 2006.

    Google Scholar 

  25. D. P. Miranker, W. J. Briggs, R. Mao, S. Ni, and W. Xu. Biosequence use cases in MoBIoS SQL. IEEE Data Engineering Bulletin, 27(3):3–11, 2004.

    Google Scholar 

  26. D. P. Miranker, W. Xu, and R. Mao. Mobios: A metric-space dbms to support biological discovery. In Proceedings of the 15th International Conference on Scientific and Statistical Database Management (SSDBM 2003), Cambridge, MA, USA, July 9-11, 2003, pages 241–244. IEEE Computer Society, 2003.

    Chapter  Google Scholar 

  27. D. W. Mount. Bioinformatics – Sequence and Genome Analysis, Second Edition. Cold Spring Harbor Laboratory Press, 2004.

    Google Scholar 

  28. E. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374, 1994.

    Article  MATH  MathSciNet  Google Scholar 

  29. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443–453, 1970.

    Article  Google Scholar 

  30. D. Novak and P. Zezula. M-Chord: A scalable distributed similarity search structure. In Proceedings of First International Conference on Scalable Information Systems (INFOSCALE 2006), Hong Kong, May 30 - June 1, pages 1–10. IEEE Computer Society, 2006.

    Google Scholar 

  31. V. Pestov and A. Stojmirovic. Indexing schemes for similarity search: an illustrated paradigm. Fundamenta Informaticae, 70(4):367–385, 2006.

    MATH  MathSciNet  Google Scholar 

  32. S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker. A scalable content-addressable network. In Proceedings of the 2001 ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM 2001), pages 161–172. ACM Press, 2001.

    Google Scholar 

  33. S. Ratnasamy, M. Handley, R. Karp, and S. Shenker. Application-level multicast using content-addressable networks. In Proceedings of the 3rd International COST264 Workshop on Networked Group Communication, London, UK, November 7-9, 2001, volume 2233 of Lecture Notes in Computer Science. Springer, 2001.

    Google Scholar 

  34. P. H. Sellers. On the theory and computation of evolutionary distances. SIAM Journal on Applied Mathematics, 26(4):787–793, 1974.

    Article  MATH  MathSciNet  Google Scholar 

  35. P. H. Sellers. The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms, 1(4):359–373, 1980.

    Article  MATH  MathSciNet  Google Scholar 

  36. P. H. Sellers. Pattern recognition in genetic sequences by mismatch density. Bulletin of Mathematical Biology, 46:501–514, 1984.

    MATH  MathSciNet  Google Scholar 

  37. T. Skopal. Pivoting M-tree: A metric access method for efficient similarity search. In V. Snášel, J. Pokorný, and K. Richta, editors, Proceedings of the Annual International Workshop on DAtabases, TExts, Specifications and Objects (DATESO 2004), Desna, Czech Republic, April 14-16, 2004, volume 98 of CEUR Workshop Proceedings. Technical University of Aachen (RWTH), 2004.

    Google Scholar 

  38. T. Skopal, J. Pokorný, M. Krátký, and V. Snášel. Revisiting M-Tree building principles. In L. A. Kalinichenko, R. Manthey, B. Thalheim, and U. Wloka, editors, Proceedings of the 7th East European Conference on Advances in Databases and Information Systems (ADBIS 2003), Dresden, Germany, September 3-6, 2003, volume 2798 of Lecture Notes in Computer Science. Springer, 2003.

    Google Scholar 

  39. T. F. Smith, M. S. Waterman, and W. M. Fitch. Comparative biosequence metrics. Journal of Molecular Evolution, 18:38–46, 1981.

    Article  Google Scholar 

  40. I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: A scalable Peer-To-Peer lookup service for internet applications. In Proceedings of ACM Special Interest Group on Data Communications (SIGCOMM 2001), San Diego, USA, pages 149–160. ACM Press, 2001.

    Google Scholar 

  41. A. Stojmirovic and V. Pestov. Indexing schemes for similarity search in datasets of short protein fragments. ArXiv Computer Science e-prints, September 2003.

    Google Scholar 

  42. C. Traina, Jr., A. J. M. Traina, R. F. S. Filho, and C. Faloutsos. How to improve the pruning ability of dynamic metric access methods. In Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management (CIKM 2002), McLean, VA, USA, November 4-9, 2002, pages 219–226. ACM, 2002.

    Google Scholar 

  43. C. Traina, Jr., A. J. M. Traina, B. Seeger, and C. Faloutsos. Slim-Trees: High performance metric trees minimizing overlap between nodes. In C. Zaniolo, P. C. Lockemann, M. H. Scholl, and T. Grust, editors, Proceedings of the 7th International Conference on Extending Database Technology (EDBT 2000), Konstanz, Germany, March 27-31, 2000, volume 1777 of Lecture Notes in Computer Science, pages 51–65. Springer, 2000.

    Google Scholar 

  44. J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40(4):175–179, 1991.

    Article  MATH  Google Scholar 

  45. M. R. Vieira, C. Traina, Jr., F. J. T. Chino, and A. J. M. Traina. DBM-Tree: a dynamic metric access method sensitive to local density data. In Proceedings of the 19th Brazilian Symposium on Databases (SBBD 2004), Brasília, Distrito Federal, Brasil, October 18-20, 2004, pages 163–177. University of Brasília, 2004.

    Google Scholar 

  46. J. T.-L. Wang, X. Wang, D. Shasha, and K. Zhang. MetricMap: an embedding technique for processing distance-based queries in metric spaces. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 35(5):973–987, 2005.

    Article  Google Scholar 

  47. M. S. Waterman and M. Eggert. A new algorithm for best subsequence alignments with application to tRNA - rRNA comparisons. Journal of Molecular Biology, 197:723–728, 1987.

    Article  Google Scholar 

  48. M. S. Waterman, T. F. Smith, and W. A. Beyer. Some biological sequence metrics. Advances in Mathematics, 20:367–387, 1976.

    Article  MATH  MathSciNet  Google Scholar 

  49. W. Xu, W. J. Briggs, J. Padolina, R. E. Timme, W. Liu, C. R. Linder, and D. P. Miranker. Using MoBIoS’ scalable genome join to find conserved primer pair candidates between two genomes. In Proceedings of the 12th International Conference on Intelligent Systems for Molecular Biology/Third European Conference on Computational Biology (ISMB/ECCB 2004), Glasgow, UK, pages 355–362, 2004.

    Google Scholar 

  50. W. Xu, D. P. Miranker, R. Mao, and S. Wang. Metric-space search of protein sequence databases. Technical Report TR-04-06, The University of Texas at Austin, Department of Computer Sciences, October 2003.

    Google Scholar 

  51. P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the 4th Annual ACM Symposium on Discrete Algorithms (SODA 1993), Austin, Texas, USA, January 25-27, 1993, pages 311–321. ACM Press, 1993.

    Google Scholar 

  52. P. N. Yianilos. Excluded middle vantage point forests for nearest neighbor search. Technical report, NEC Research Institute, Princeton, NJ, July 1998.

    Google Scholar 

  53. C. Yu, B. C. Ooi, K.-L. Tan, and H. V. Jagadish. Indexing the distance: An efficient method to knn processing. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, Proceedings of 27th International Conference on Very Large Data Bases (VLDB 2001), Roma, Italy, September 11-14, 2001, pages 421–430. Morgan Kaufmann, 2001.

    Google Scholar 

  54. P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach, volume 32 of Advances in Database Systems. Springer, 2005.

    Google Scholar 

  55. P. Zezula, P. Savino, G. Amato, and F. Rabitti. Approximate similarity retrieval with M-Trees. The VLDB Journal, 7(4):275–293, 1998.

    Article  Google Scholar 

  56. P. Zezula, P. Savino, F. Rabitti, G. Amato, and P. Ciaccia. Processing M-trees with parallel resources. In Proceedings of Eight International Workshop on Research Issues in Data Engineering: Continuous-Media Databases and Applications (RIDE 1998), Orlando, Florida, USA, February 23-24, 1998, pages 147–154. IEEE Computer Society, 1998.

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Dohnal, V., Gennaro, C., Zezula, P. (2008). Efficiency and Scalability Issues in Metric Access Methods. In: Kelemen, A., Abraham, A., Liang, Y. (eds) Computational Intelligence in Medical Informatics. Studies in Computational Intelligence, vol 85. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75767-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75767-2_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75766-5

  • Online ISBN: 978-3-540-75767-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics