Skip to main content

German Encyclopedia Alignment Based on Information Retrieval Techniques

  • Conference paper
Research and Advanced Technology for Digital Libraries (ECDL 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6273))

Included in the following conference series:

Abstract

Collaboratively created online encyclopedias have become increasingly popular. Especially in terms of completeness they have begun to surpass their printed counterparts. Two German publishers of traditional encyclopedias have reacted to this challenge and decided to merge their corpora to create a single more complete encyclopedia. The crucial step in this merge process is the alignment of articles. We have developed a system to identify corresponding entries from different encyclopedic corpora. The base of our system is the alignment algorithm which incorporates various techniques developed in the field of information retrieval. We have evaluated the system on four real-world encyclopedias with a ground truth provided by domain experts. A combination of weighting and ranking techniques has been found to deliver a satisfying performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Rector, L.H.: Comparison of Wikipedia and other encyclopedias for accuracy, breadth, and depth in historical articles. Reference Services Review 36(1) (2008)

    Google Scholar 

  2. Pedersen, T.: Computational Approaches to Measuring the Similarity of Short Contexts: A Review of Applications and Methods. CoRR abs/0806.3 (2008)

    Google Scholar 

  3. Liu, X., Zhou, Y., Zheng, R.: Measuring semantic similarity within sentences. In: Proceedings of the 7th International Conference on Machine Learning and Cybernetics, ICMLC, vol. 5, pp. 2558–2562 (2008)

    Google Scholar 

  4. Li, Y., McLean, D., Bandar, Z.: Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering 18(8), 1138–1150 (2006)

    Article  Google Scholar 

  5. O’ Shea, J., Bandar, Z., Crockett, K., McLean, D.: A Comparative Study of Two Short Text Semantic Similarity Measures. In: Nguyen, N.T., Jo, G.-S., Howlett, R.J., Jain, L.C. (eds.) KES-AMSTA 2008. LNCS (LNAI), vol. 4953, pp. 172–181. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  6. Metzler, D., Bernstein, Y., Croft, W., Moffat, A., Zobel, J.: Similarity Measures for Tracking Information Flow. In: CIKM 2005: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 517–524. ACM, New York (2005)

    Chapter  Google Scholar 

  7. Bernstein, Y., Zobel, J.: A Scalable System for Identifying Co-derivative Documents. In: String Processing and Information Retrieval, pp. 55–67 (2004)

    Google Scholar 

  8. Sahami, M., Heilman, T.: A web-based kernel function for measuring the similarity of short text snippets. In: WWW 2006: Proceedings of the 15th International Conference on World Wide Web, pp. 377–386. ACM, New York (2006)

    Chapter  Google Scholar 

  9. Yih, W., Meek, C.: Improving similarity measures for short segments of text. In: AAAI 2007: Proceedings of the 22nd National Conference on Artificial Intelligence, pp. 1489–1494. AAAI Press, Menlo Park (2007)

    Google Scholar 

  10. Fang, H., Zhai, C.: An exploration of axiomatic approaches to information retrieval. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 480–487. ACM, New York (2005)

    Chapter  Google Scholar 

  11. Robertson, S., Gatford, M.: Okapi at TREC-4. In: Proceedings of the Fourth Text Retrieval Conference, pp. 73–97 (1996)

    Google Scholar 

  12. Gries, S.: Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4), 403–437 (2008)

    Article  Google Scholar 

  13. Kern, R., Granitzer, M.: Efficient linear text segmentation based on information retrieval techniques. In: MEDES 2009: Proceedings of the International Conference on Management of Emergent Digital EcoSystems, pp. 167–171. ACM, New York (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kern, R., Granitzer, M. (2010). German Encyclopedia Alignment Based on Information Retrieval Techniques. In: Lalmas, M., Jose, J., Rauber, A., Sebastiani, F., Frommholz, I. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2010. Lecture Notes in Computer Science, vol 6273. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15464-5_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15464-5_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15463-8

  • Online ISBN: 978-3-642-15464-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics