Abstract
In this paper, we principally devote our effort to proposing a novel MapReduce-based approach for efficient similarity search in big data. Specifically, we address the drawbacks of using inverted index in similarity search with MapReduce and then propose a simple yet efficient redundancy-free MapReduce scheme, which not only takes advantages over the baseline inverted index-based procedures but also adapts to various similarity measures and similarity searches. Additionally, we present other strategic methods in order to potentially contribute to eliminating unnecessary data and computations. Last but not least, empirical evaluations are intensively conducted with real massive datasets and Hadoop framework in the cluster of commodity machines to verify the proposed methods, whose promising results show how much beneficial they are when dealing with big data.
Similar content being viewed by others
References
Alex cluster. Available on the following website link. http://www.jku.at/content/e213/e174/e167/e186534. Accessed 4 Feb 2014
Apache Hadoop. Wiki at http://hadoop.apache.org/docs/r1.2.1/. Accessed 8 Mar 2014
Bayardo, R., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, pp. 131–140 (2007)
DBLP data set. http://dblp.uni-trier.de/xml/. Accessed 8 Mar 2014
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation, USENIX Association, pp. 137–150 (2004)
Deng, D., Li, G., Hao, S., Wang, J., Feng J.: MassJoin: a MapReduce-based algorithm for string similarity joins. In: Proceedings of the 30th IEEE International Conference on Data Engineering, pp. 340–351 (2014)
Dittrich, J., Richter, S., Schuh, S.: Efficient or Hadoop: why not both? Datenbank-Spektrum 13(1), 17–22 (2013)
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Companion Volume, pp. 265–268 (2008)
Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques, 3rd edn. The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers. ISBN: 978-0123814791 (2011)
Kolb, L., Thor, A., Rahm, E.: Don’t match twice: redundancy-free similarity computation with MapReduce. In: Proceedings of the 2nd International Workshop on Data Analytics in the Cloud (2013)
Letouzé, E.: Big data for development: challenges & opportunities. In: Tatevossian, A.R., Kirkpatrick, R., (eds.) UN Global Pulse, pp. 1–47 (2012)
Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162 (2009)
Metwally, A., Faloutsos, C.: V-SMART-join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)
Mika, P.: Distributed indexing for semantic search. In: Proceedings of the 3rd International Semantic Search Workshop, pp. 1–4 (2010)
Phan, T.N., Küng, J., Dang, T.K.: An efficient similarity search in large data collections with MapReduce. In: Dang, T.K., Wagner, R., Neuhold, E., Takizawa, M., Küng, J., Thoai, N. (eds.) FDSE 2014. LNCS, vol. 8860, pp. 44–57. Springer, Heidelberg (2014)
Phan, T.N., Küng, J., Dang, T.K.: An elastic approximate similarity search in very large datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds.) Globe 2014. LNCS, vol. 8648, pp. 49–60. Springer, Heidelberg (2014)
Project Gutenberg. http://www.gutenberg.org/. Accessed 8 Mar 2014
Rajaraman, A., Ullman J.D.: Finding similar items. In: Mining of Massive Datasets, 1st edn, pp. 71–127 (Chap. 3). Cambridge University Press, Cambridge (2011)
Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable processing of string similarity join. IEEE TKDE 25(10), 2217–2230 (2013)
Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 563–570 (2008)
Zadeh, R.B., Goel, A.: Dimension independent similarity computation. J. Mach. Learn. Res. 14(1), 1605–1626 (2013)
Zikopoulos, P.C., Eaton, C., DeRoos, D., Deutsch, T., Lapis, G.: Understanding big data: analytics for enterprise class Hadoop and streaming data. McGraw-Hill Osborne Media, New York. ISBN: 978-0071790536 (2012)
Acknowledgements
Our sincere thanks to Faruk Kujundžić, Scientific Computing, Information Management team, Johannes Kepler University Linz, for his kind support in the Alex Cluster.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Phan, T.N., Jäger, M., Nadschläger, S., Küng, J., Dang, T.K. (2015). An Efficient Document Indexing-Based Similarity Search in Large Datasets. In: Dang, T., Wagner, R., Küng, J., Thoai, N., Takizawa, M., Neuhold, E. (eds) Future Data and Security Engineering. FDSE 2015. Lecture Notes in Computer Science(), vol 9446. Springer, Cham. https://doi.org/10.1007/978-3-319-26135-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-26135-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26134-8
Online ISBN: 978-3-319-26135-5
eBook Packages: Computer ScienceComputer Science (R0)