An Efficient Document Indexing-Based Similarity Search in Large Datasets

Phan, Trong Nhan; Jäger, Markus; Nadschläger, Stefan; Küng, Josef; Dang, Tran Khanh

doi:10.1007/978-3-319-26135-5_2

Trong Nhan Phan¹⁹,
Markus Jäger¹⁹,
Stefan Nadschläger¹⁹,
Josef Küng¹⁹ &
…
Tran Khanh Dang²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9446))

Included in the following conference series:

International Conference on Future Data and Security Engineering

954 Accesses
2 Citations

Abstract

In this paper, we principally devote our effort to proposing a novel MapReduce-based approach for efficient similarity search in big data. Specifically, we address the drawbacks of using inverted index in similarity search with MapReduce and then propose a simple yet efficient redundancy-free MapReduce scheme, which not only takes advantages over the baseline inverted index-based procedures but also adapts to various similarity measures and similarity searches. Additionally, we present other strategic methods in order to potentially contribute to eliminating unnecessary data and computations. Last but not least, empirical evaluations are intensively conducted with real massive datasets and Hadoop framework in the cluster of commodity machines to verify the proposed methods, whose promising results show how much beneficial they are when dealing with big data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

An Adaptive Similarity Search in Massive Datasets

A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce

Article 25 June 2019

References

Alex cluster. Available on the following website link. http://www.jku.at/content/e213/e174/e167/e186534. Accessed 4 Feb 2014
Apache Hadoop. Wiki at http://hadoop.apache.org/docs/r1.2.1/. Accessed 8 Mar 2014
Bayardo, R., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, pp. 131–140 (2007)
Google Scholar
DBLP data set. http://dblp.uni-trier.de/xml/. Accessed 8 Mar 2014
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation, USENIX Association, pp. 137–150 (2004)
Google Scholar
Deng, D., Li, G., Hao, S., Wang, J., Feng J.: MassJoin: a MapReduce-based algorithm for string similarity joins. In: Proceedings of the 30th IEEE International Conference on Data Engineering, pp. 340–351 (2014)
Google Scholar
Dittrich, J., Richter, S., Schuh, S.: Efficient or Hadoop: why not both? Datenbank-Spektrum 13(1), 17–22 (2013)
Article Google Scholar
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Companion Volume, pp. 265–268 (2008)
Google Scholar
Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques, 3rd edn. The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers. ISBN: 978-0123814791 (2011)
Google Scholar
Kolb, L., Thor, A., Rahm, E.: Don’t match twice: redundancy-free similarity computation with MapReduce. In: Proceedings of the 2nd International Workshop on Data Analytics in the Cloud (2013)
Google Scholar
Letouzé, E.: Big data for development: challenges & opportunities. In: Tatevossian, A.R., Kirkpatrick, R., (eds.) UN Global Pulse, pp. 1–47 (2012)
Google Scholar
Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162 (2009)
Google Scholar
Metwally, A., Faloutsos, C.: V-SMART-join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)
Google Scholar
Mika, P.: Distributed indexing for semantic search. In: Proceedings of the 3rd International Semantic Search Workshop, pp. 1–4 (2010)
Google Scholar
Phan, T.N., Küng, J., Dang, T.K.: An efficient similarity search in large data collections with MapReduce. In: Dang, T.K., Wagner, R., Neuhold, E., Takizawa, M., Küng, J., Thoai, N. (eds.) FDSE 2014. LNCS, vol. 8860, pp. 44–57. Springer, Heidelberg (2014)
Google Scholar
Phan, T.N., Küng, J., Dang, T.K.: An elastic approximate similarity search in very large datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds.) Globe 2014. LNCS, vol. 8648, pp. 49–60. Springer, Heidelberg (2014)
Google Scholar
Project Gutenberg. http://www.gutenberg.org/. Accessed 8 Mar 2014
Rajaraman, A., Ullman J.D.: Finding similar items. In: Mining of Massive Datasets, 1st edn, pp. 71–127 (Chap. 3). Cambridge University Press, Cambridge (2011)
Google Scholar
Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable processing of string similarity join. IEEE TKDE 25(10), 2217–2230 (2013)
Google Scholar
Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 563–570 (2008)
Google Scholar
Zadeh, R.B., Goel, A.: Dimension independent similarity computation. J. Mach. Learn. Res. 14(1), 1605–1626 (2013)
MathSciNet MATH Google Scholar
Zikopoulos, P.C., Eaton, C., DeRoos, D., Deutsch, T., Lapis, G.: Understanding big data: analytics for enterprise class Hadoop and streaming data. McGraw-Hill Osborne Media, New York. ISBN: 978-0071790536 (2012)
Google Scholar

Download references

Acknowledgements

Our sincere thanks to Faruk Kujundžić, Scientific Computing, Information Management team, Johannes Kepler University Linz, for his kind support in the Alex Cluster.

Author information

Authors and Affiliations

Institute for Application Oriented Knowledge Processing, Johannes Kepler University Linz, Linz, Austria
Trong Nhan Phan, Markus Jäger, Stefan Nadschläger & Josef Küng
Faculty of Computer Science and Engineering, HCMC University of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang

Authors

Trong Nhan Phan
View author publications
You can also search for this author in PubMed Google Scholar
Markus Jäger
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Nadschläger
View author publications
You can also search for this author in PubMed Google Scholar
Josef Küng
View author publications
You can also search for this author in PubMed Google Scholar
Tran Khanh Dang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Trong Nhan Phan .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Johannes Kepler University Linz, Linz, Austria
Roland Wagner
Johannes Kepler University Linz, Linz, Austria
Josef Küng
Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Nam Thoai
Hosei University, Tokyo, Japan
Makoto Takizawa
University of Vienna, Vienna, Austria
Erich Neuhold

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Phan, T.N., Jäger, M., Nadschläger, S., Küng, J., Dang, T.K. (2015). An Efficient Document Indexing-Based Similarity Search in Large Datasets. In: Dang, T., Wagner, R., Küng, J., Thoai, N., Takizawa, M., Neuhold, E. (eds) Future Data and Security Engineering. FDSE 2015. Lecture Notes in Computer Science(), vol 9446. Springer, Cham. https://doi.org/10.1007/978-3-319-26135-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-26135-5_2
Published: 08 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26134-8
Online ISBN: 978-3-319-26135-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics