Abstract
Similarity is a central notion throughout human lives and it will soon become the prevalent strategy for dealing with digital content also in computer systems. But the exponential growth of data makes the scalability and performance issues serious matters of concern. Contemporary decentralized media of mass communication allowing cooperative and collaborative practices enable users autonomously contribute to production of global media, whose elements are in fact related by numerous multi-facet links of similarity. As an example, consider the sites like Flickr, YouTube, or Facebook that host user-contributed heterogeneous content for a variety of events. Accordingly, the core ability of future data processing systems is the similarity management of large and ever growing volumes of data. In a simplified way, the real-life performance can be constrained from two points of view: (1) the query response time, and (2) the query execution throughput, i.e. the number of queries processed per a unit of time. Typically, the query response time should be on-line, say less than one second, but the query execution throughput can even be expected in hundreds or thousands in case of large-scale web applications.
- }}M. Batko, D. Novak, F. Falchi, and P. Zezula. On scalability of the similarity search in the world of peers. In INFOSCALE, pages 1--12. ACM, 2006. Google ScholarDigital Library
- }}P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In VLDB, pages 426--435. Morgan Kaufmann, 1997. Google ScholarDigital Library
- }}J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Comm. ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- }}V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. D-Index: Distance searching index for metric data sets. Multimedia Tools and Applications, 21(1):9--33, 2003. Google ScholarDigital Library
- }}C. Doulkeridis, A. Vlachou, Y. Kotidis, and M. Vazirgiannis. Efficient range query processing in metric spaces over highly distributed data. Distributed and Parallel Databases, 26(2--3):155--180, 2009. Google ScholarDigital Library
- }}I. King, C. H. Ng, and K. C. Sia. Distributed content-based visual information retrieval system on peer-to-peer networks. ACM TOIS, 22(3):477--501, 2004. Google ScholarDigital Library
- }}J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In SIGIR, pages 155--162. ACM, 2009. Google ScholarDigital Library
- }}D. Novak, M. Batko, and P. Zezula. Generic similarity search engine demonstrated by an image retrieval application. In SIGIR, page 840. ACM, 2009. Google ScholarDigital Library
- }}H. Samet. Foundations of Multidimensional And Metric Data Structures. Series in Data Management Systems. Morgan Kaufmann, 2006. Google ScholarDigital Library
- }}J. Sedmidubsky, S. Bartoň, V. Dohnal, and P. Zezula. Adaptive approximate similarity searching through metric social networks. In ICDE, pages 1424--1426. IEEE, 2008. Google ScholarDigital Library
- }}T. Skopal. Pivoting M-tree: A metric access method for efficient similarity search. In DATESO, volume 98. Technical University of Aachen, 2004.Google Scholar
- }}C. Traina, Jr., A. J. M. Traina, B. Seeger, and C. Faloutsos. Slim-Trees: High performance metric trees minimizing overlap between nodes. In EDBT, volume 1777 of Lecture Notes in Computer Science, pages 51--65. Springer, 2000. Google ScholarDigital Library
- }}R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506. ACM, 2010. Google ScholarDigital Library
- }}P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach, volume 32 of Advances in Database Systems. Springer, 2005. Google ScholarDigital Library
- }}P. Zezula, P. Savino, F. Rabitti, G. Amato, and P. Ciaccia. Processing M-trees with parallel resources. In RIDE, pages 147--154. IEEE, 1998. Google ScholarDigital Library
Index Terms
- Real-life performance of metric searching
Recommendations
View selection for real conjunctive queries
Given a query workload, a database and a set of constraints, the view-selection problem is to select views to materialize so that the constraints are satisfied and the views can be used to compute the queries in the workload efficiently. A typical ...
Multi-metric Graph Query Performance Prediction
Database Systems for Advanced ApplicationsAbstractWe propose a general framework for predicting graph query performance with respect to three performance metrics: execution time, query answer quality, and memory consumption. The learning framework generates and makes use of informative statistics ...
Searching the deep web using proactive phrase queries
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide WebThis paper proposes ipq, a novel search engine that proactively transforms query forms of Deep Web sources into phrase queries, constructs query evaluation plans, and caches results for popular queries offline. Then at query time, keyword queries are ...
Comments