Abstract
Collaboratively created online encyclopedias have become increasingly popular. Especially in terms of completeness they have begun to surpass their printed counterparts. Two German publishers of traditional encyclopedias have reacted to this challenge and decided to merge their corpora to create a single more complete encyclopedia. The crucial step in this merge process is the alignment of articles. We have developed a system to identify corresponding entries from different encyclopedic corpora. The base of our system is the alignment algorithm which incorporates various techniques developed in the field of information retrieval. We have evaluated the system on four real-world encyclopedias with a ground truth provided by domain experts. A combination of weighting and ranking techniques has been found to deliver a satisfying performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Rector, L.H.: Comparison of Wikipedia and other encyclopedias for accuracy, breadth, and depth in historical articles. Reference Services Review 36(1) (2008)
Pedersen, T.: Computational Approaches to Measuring the Similarity of Short Contexts: A Review of Applications and Methods. CoRR abs/0806.3 (2008)
Liu, X., Zhou, Y., Zheng, R.: Measuring semantic similarity within sentences. In: Proceedings of the 7th International Conference on Machine Learning and Cybernetics, ICMLC, vol. 5, pp. 2558–2562 (2008)
Li, Y., McLean, D., Bandar, Z.: Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering 18(8), 1138–1150 (2006)
O’ Shea, J., Bandar, Z., Crockett, K., McLean, D.: A Comparative Study of Two Short Text Semantic Similarity Measures. In: Nguyen, N.T., Jo, G.-S., Howlett, R.J., Jain, L.C. (eds.) KES-AMSTA 2008. LNCS (LNAI), vol. 4953, pp. 172–181. Springer, Heidelberg (2008)
Metzler, D., Bernstein, Y., Croft, W., Moffat, A., Zobel, J.: Similarity Measures for Tracking Information Flow. In: CIKM 2005: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 517–524. ACM, New York (2005)
Bernstein, Y., Zobel, J.: A Scalable System for Identifying Co-derivative Documents. In: String Processing and Information Retrieval, pp. 55–67 (2004)
Sahami, M., Heilman, T.: A web-based kernel function for measuring the similarity of short text snippets. In: WWW 2006: Proceedings of the 15th International Conference on World Wide Web, pp. 377–386. ACM, New York (2006)
Yih, W., Meek, C.: Improving similarity measures for short segments of text. In: AAAI 2007: Proceedings of the 22nd National Conference on Artificial Intelligence, pp. 1489–1494. AAAI Press, Menlo Park (2007)
Fang, H., Zhai, C.: An exploration of axiomatic approaches to information retrieval. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 480–487. ACM, New York (2005)
Robertson, S., Gatford, M.: Okapi at TREC-4. In: Proceedings of the Fourth Text Retrieval Conference, pp. 73–97 (1996)
Gries, S.: Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4), 403–437 (2008)
Kern, R., Granitzer, M.: Efficient linear text segmentation based on information retrieval techniques. In: MEDES 2009: Proceedings of the International Conference on Management of Emergent Digital EcoSystems, pp. 167–171. ACM, New York (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kern, R., Granitzer, M. (2010). German Encyclopedia Alignment Based on Information Retrieval Techniques. In: Lalmas, M., Jose, J., Rauber, A., Sebastiani, F., Frommholz, I. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2010. Lecture Notes in Computer Science, vol 6273. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15464-5_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-15464-5_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15463-8
Online ISBN: 978-3-642-15464-5
eBook Packages: Computer ScienceComputer Science (R0)