Balanced Large Scale Knowledge Matching Using LSH Forest

Cochez, Michael; Terziyan, Vagan; Ermolayev, Vadim

doi:10.1007/978-3-319-27932-9_4

Michael Cochez¹⁹,
Vagan Terziyan¹⁹ &
Vadim Ermolayev²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9398))

Included in the following conference series:

International KEYSTONE Conference on Semantic Keyword-Based Search on Structured Data Sources

484 Accesses
1 Citations

Abstract

Evolving Knowledge Ecosystems were proposed recently to approach the Big Data challenge, following the hypothesis that knowledge evolves in a way similar to biological systems. Therefore, the inner working of the knowledge ecosystem can be spotted from natural evolution. An evolving knowledge ecosystem consists of Knowledge Organisms, which form a representation of the knowledge, and the environment in which they reside. The environment consists of contexts, which are composed of so-called knowledge tokens. These tokens are ontological fragments extracted from information tokens, in turn, which originate from the streams of information flowing into the ecosystem. In this article we investigate the use of LSH Forest (a self-tuning indexing schema based on locality-sensitive hashing) for solving the problem of placing new knowledge tokens in the right contexts of the environment. We argue and show experimentally that LSH Forest possesses required properties and could be used for large distributed set-ups.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

References

Ermolayev, V., Akerkar, R., Terziyan, V., Cochez, M.: Towards evolving knowledge ecosystems for big data understanding. Big Data Computing, pp. 3–55. Taylor & Francis group - Chapman and Hall/CRC, New York (2014)
Google Scholar
Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th International Conference on World Wide Web, pp. 651–660. ACM (2005)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Google Scholar
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Article Google Scholar
Rajaraman, A., Ullman, J.D.: Finding similar items. Mining of Massive Datasets, pp. 71–128. Cambridge University Press, Cambridge (2012)
Google Scholar
Ermolayev, V., Davidovsky, M.: Agent-based ontology alignment: basics, applications, theoretical foundations, and demonstration. In: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, WIMS 2012, pp. 3:1–3:12. ACM, New York, NY, USA (2012)
Google Scholar
Cochez, M.: Locality-sensitive hashing for massive string-based ontology matching. In: Proceedings of IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) (2014) (accepted)
Google Scholar
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences 1997, pp. 21–29. IEEE (1997)
Google Scholar
Broder, A.: Some applications of rabin’s fingerprinting method. In: Capocelli, R., Santis, A., Vaccaro, U. (eds.) Sequences II, pp. 143–152. Springer, New York (1993)
Chapter Google Scholar
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comput. 31(6), 1794–1813 (2002)
Article MathSciNet Google Scholar
Cochez, M., Mou, H.: Twister tries: approximate hierarchical agglomerative clustering for average distance in linear time. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 505–517. ACM (2015)
Google Scholar
Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing. STOC 1997, pp. 654–663. ACM, New York, NY, USA (1997)
Google Scholar

Download references

Acknowledgments

The authors would like to thank the department of Mathematical Information Technology of the University of Jyväskylä for financially supporting this research. This research is also in part financed by the N4S SHOK organized by Digile Oy and financially supported by TEKES. The authors would further like to thank Steeri Oy for supporting the research and the members of the Industrial Ontologies Group (IOG) of the University of Jyväskylä for their support in the research. Further, it has to be mentioned that the implementation of the software was greatly simplified by the Guava library by Google, the Apache Commons Math\(^\mathrm{TM}\) library, and the Rabin hash library by Bill Dwyer and Ian Brandt.

Author information

Authors and Affiliations

Department of Mathematical Information Technology, University of Jyväskylä, P.O. Box 35(Agora), 40014, University of Jyväskylä, Finland
Michael Cochez & Vagan Terziyan
Department of IT, Zaporozhye National University, 66, Zhukovskogo Street, Zaporozhye, 69063, Ukraine
Vadim Ermolayev

Authors

Michael Cochez
View author publications
You can also search for this author in PubMed Google Scholar
Vagan Terziyan
View author publications
You can also search for this author in PubMed Google Scholar
Vadim Ermolayev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Cochez .

Editor information

Editors and Affiliations

University of Coimbra, Coimbra, Portugal
Jorge Cardoso
Huawei European Research Center, Munich, Germany
Jorge Cardoso
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Francesco Guerra
Delft University of Technology, Delft, Zuid-Holland, The Netherlands
Geert-Jan Houben
University of Coimbra, Coimbra, Portugal
Alexandre Miguel Pinto
Università degli Studi di Trento, Trento, Italy
Yannis Velegrakis

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cochez, M., Terziyan, V., Ermolayev, V. (2015). Balanced Large Scale Knowledge Matching Using LSH Forest. In: Cardoso, J., Guerra, F., Houben, GJ., Pinto, A.M., Velegrakis, Y. (eds) Semantic Keyword-Based Search on Structured Data Sources. IKC 2015. Lecture Notes in Computer Science(), vol 9398. Springer, Cham. https://doi.org/10.1007/978-3-319-27932-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-27932-9_4
Published: 07 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27931-2
Online ISBN: 978-3-319-27932-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics