Skip to main content

An Efficient Document Indexing-Based Similarity Search in Large Datasets

  • Conference paper
  • First Online:
Future Data and Security Engineering (FDSE 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9446))

Included in the following conference series:

Abstract

In this paper, we principally devote our effort to proposing a novel MapReduce-based approach for efficient similarity search in big data. Specifically, we address the drawbacks of using inverted index in similarity search with MapReduce and then propose a simple yet efficient redundancy-free MapReduce scheme, which not only takes advantages over the baseline inverted index-based procedures but also adapts to various similarity measures and similarity searches. Additionally, we present other strategic methods in order to potentially contribute to eliminating unnecessary data and computations. Last but not least, empirical evaluations are intensively conducted with real massive datasets and Hadoop framework in the cluster of commodity machines to verify the proposed methods, whose promising results show how much beneficial they are when dealing with big data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Similar content being viewed by others

References

  1. Alex cluster. Available on the following website link. http://www.jku.at/content/e213/e174/e167/e186534. Accessed 4 Feb 2014

  2. Apache Hadoop. Wiki at http://hadoop.apache.org/docs/r1.2.1/. Accessed 8 Mar 2014

  3. Bayardo, R., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, pp. 131–140 (2007)

    Google Scholar 

  4. DBLP data set. http://dblp.uni-trier.de/xml/. Accessed 8 Mar 2014

  5. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation, USENIX Association, pp. 137–150 (2004)

    Google Scholar 

  6. Deng, D., Li, G., Hao, S., Wang, J., Feng J.: MassJoin: a MapReduce-based algorithm for string similarity joins. In: Proceedings of the 30th IEEE International Conference on Data Engineering, pp. 340–351 (2014)

    Google Scholar 

  7. Dittrich, J., Richter, S., Schuh, S.: Efficient or Hadoop: why not both? Datenbank-Spektrum 13(1), 17–22 (2013)

    Article  Google Scholar 

  8. Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Companion Volume, pp. 265–268 (2008)

    Google Scholar 

  9. Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques, 3rd edn. The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers. ISBN: 978-0123814791 (2011)

    Google Scholar 

  10. Kolb, L., Thor, A., Rahm, E.: Don’t match twice: redundancy-free similarity computation with MapReduce. In: Proceedings of the 2nd International Workshop on Data Analytics in the Cloud (2013)

    Google Scholar 

  11. Letouzé, E.: Big data for development: challenges & opportunities. In: Tatevossian, A.R., Kirkpatrick, R., (eds.) UN Global Pulse, pp. 1–47 (2012)

    Google Scholar 

  12. Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162 (2009)

    Google Scholar 

  13. Metwally, A., Faloutsos, C.: V-SMART-join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)

    Google Scholar 

  14. Mika, P.: Distributed indexing for semantic search. In: Proceedings of the 3rd International Semantic Search Workshop, pp. 1–4 (2010)

    Google Scholar 

  15. Phan, T.N., Küng, J., Dang, T.K.: An efficient similarity search in large data collections with MapReduce. In: Dang, T.K., Wagner, R., Neuhold, E., Takizawa, M., Küng, J., Thoai, N. (eds.) FDSE 2014. LNCS, vol. 8860, pp. 44–57. Springer, Heidelberg (2014)

    Google Scholar 

  16. Phan, T.N., Küng, J., Dang, T.K.: An elastic approximate similarity search in very large datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds.) Globe 2014. LNCS, vol. 8648, pp. 49–60. Springer, Heidelberg (2014)

    Google Scholar 

  17. Project Gutenberg. http://www.gutenberg.org/. Accessed 8 Mar 2014

  18. Rajaraman, A., Ullman J.D.: Finding similar items. In: Mining of Massive Datasets, 1st edn, pp. 71–127 (Chap. 3). Cambridge University Press, Cambridge (2011)

    Google Scholar 

  19. Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable processing of string similarity join. IEEE TKDE 25(10), 2217–2230 (2013)

    Google Scholar 

  20. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 563–570 (2008)

    Google Scholar 

  21. Zadeh, R.B., Goel, A.: Dimension independent similarity computation. J. Mach. Learn. Res. 14(1), 1605–1626 (2013)

    MathSciNet  MATH  Google Scholar 

  22. Zikopoulos, P.C., Eaton, C., DeRoos, D., Deutsch, T., Lapis, G.: Understanding big data: analytics for enterprise class Hadoop and streaming data. McGraw-Hill Osborne Media, New York. ISBN: 978-0071790536 (2012)

    Google Scholar 

Download references

Acknowledgements

Our sincere thanks to Faruk Kujundžić, Scientific Computing, Information Management team, Johannes Kepler University Linz, for his kind support in the Alex Cluster.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Trong Nhan Phan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Phan, T.N., Jäger, M., Nadschläger, S., Küng, J., Dang, T.K. (2015). An Efficient Document Indexing-Based Similarity Search in Large Datasets. In: Dang, T., Wagner, R., Küng, J., Thoai, N., Takizawa, M., Neuhold, E. (eds) Future Data and Security Engineering. FDSE 2015. Lecture Notes in Computer Science(), vol 9446. Springer, Cham. https://doi.org/10.1007/978-3-319-26135-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26135-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26134-8

  • Online ISBN: 978-3-319-26135-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics