An Efficient Similarity Search in Large Data Collections with MapReduce

Phan, Trong Nhan; Küng, Josef; Dang, Tran Khanh

doi:10.1007/978-3-319-12778-1_4

An Efficient Similarity Search in Large Data Collections with MapReduce

Trong Nhan Phan¹⁹,
Josef Küng¹⁹ &
Tran Khanh Dang²⁰

Conference paper

1272 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8860))

Abstract

The era of big data has been calling for many innovations on improving similarity search computing. Such unstoppable large amounts of data threaten both processing capacity and performance of existing information systems. Joining the challenges on scalability, we propose an efficient similarity search in large data collections with MapReduce. In addition, we make the best use of the proposed scheme for widespread similarity search cases including pairwise similarity, search by example, range query, and k-Nearest Neighbor query. Moreover, collaborative strategic refinements are utilized to effectively eliminate unnecessary computations and efficiently speed up the whole process. Last but not least, our methods are enhanced by experiments, along with a previous work, on real large datasets, which shows how well these methods are verified.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing Parallel Algorithms for All Pairs Similarity Search. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining, USA, pp. 203–212 (2013)
Google Scholar
Alex cluster, http://www.jku.at/content/e213/e174/e167/e186534 (referenced on February 4, 2014)
Apache Software Foundation. Hadoop: A Framework for Running Applications on Large Clusters Built of Commodity Hardware (2006)
Google Scholar
Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document Similarity Self-Join with MapReduce. In: Proceedings of the 10th IEEE International Conference on Data Mining, pp. 731–736 (2010)
Google Scholar
Dang, T.K.: Solving Approximate Similarity Queries. International Journal of Computer Systems Science and Engineering 22(1-2), 71–89 (2007)
MathSciNet Google Scholar
Dang, T.K., Küng, J.: The SH-tree: A Super Hybrid Index Structure for Multidimensional Data. In: Mayr, H.C., Lazanský, J., Quirchmayr, G., Vogel, P. (eds.) DEXA 2001. LNCS, vol. 2113, pp. 340–349. Springer, Heidelberg (2001)
Chapter Google Scholar
DBLP data set, http://dblp.uni-trier.de/xml/ (referenced on March 8, 2014)
De Francisci Morales, G., Lucchese, C., Baraglia, R.: Scaling Out All Pairs Similarity Search with MapReduce. In: Proceedings of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval, pp. 25–30 (2010)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Symposium on Opearting Systems Design and Implementation, pp. 137–150. USENIX Association (2004)
Google Scholar
Elsayed, T., Lin, J., Oard, D.W.: Pairwise Document Similarity in Large Collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Companion Volume, Columbus, Ohio, pp. 265–268 (2008)
Google Scholar
Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient Similarity Search in Very Large String Sets. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 262–279. Springer, Heidelberg (2012)
Chapter Google Scholar
Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch Text Similarity Search with MapReduce. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011. LNCS, vol. 6612, pp. 412–423. Springer, Heidelberg (2011)
Chapter Google Scholar
Phan, T.N., Küng, J., Dang, T.K.: An Elastic Approximate Similarity Search in Very Large Datasets with Mapreduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds.) Globe 2014. LNCS, vol. 8648, pp. 49–60. Springer, Heidelberg (2014)
Chapter Google Scholar
Szmit, R.: Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T. (eds.) IIS 2013. LNCS, vol. 7912, pp. 171–178. Springer, Heidelberg (2013)
Chapter Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient Parallel Set-similarity Joins Using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, USA, pp. 495–506 (2010)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: Proceedings of the 17th Int’l World Wide Web Conference, pp. 131–140 (2008)
Google Scholar
Zhang, D., Yang, G., Hu, Y., Jin, Z., Cai, D., He, X.: A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp. 681–687 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

FAW Institute, Johannes Kepler University Linz, Austria
Trong Nhan Phan & Josef Küng
HCMC University of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang

Authors

Trong Nhan Phan
View author publications
You can also search for this author in PubMed Google Scholar
Josef Küng
View author publications
You can also search for this author in PubMed Google Scholar
Tran Khanh Dang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Ho Chi Minh City University of Technology, 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam
Tran Khanh Dang & Nam Thoai &
Johannes Kepler University Linz, Altenberger Straße 69, 4040, Linz, Austria
Roland Wagner & Josef Küng &
University of Vienna, Währinger Straße 29, 1190, Wien, Austria
Erich Neuhold
Hosei University, 3-7-2, Kajino-machi, 184-8584, Koganei-shi, Tokyo, Japan
Makoto Takizawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Phan, T.N., Küng, J., Dang, T.K. (2014). An Efficient Similarity Search in Large Data Collections with MapReduce. In: Dang, T.K., Wagner, R., Neuhold, E., Takizawa, M., Küng, J., Thoai, N. (eds) Future Data and Security Engineering. FDSE 2014. Lecture Notes in Computer Science, vol 8860. Springer, Cham. https://doi.org/10.1007/978-3-319-12778-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-12778-1_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12777-4
Online ISBN: 978-3-319-12778-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics