Abstract
In this paper we apply a data-compression IR method in the GIRT social science database, focusing on the monolingual task in German and English. For this purpose we use a recently proposed general scheme for context recognition and context classification of strings of characters (in particular texts) or other coded information. The key point of the method is the computation of a suitable measure of remoteness (or similarity) between two strings of characters. This measure of remoteness reflects the distance between the structures present in the two strings, i.e. between the two different distributions of elements of the compared sequences. The hypothesis is that the information-theory oriented measure of remoteness between two sequences could reflect their semantic distance. It is worth stressing the generality and versatility of our information-theoretic method which applies to any kind of corpora of character strings, whatever the type of coding used (i.e. language).
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)
Croft, B. (ed.): Advances in Information Retrieval – Recent Research from the Centre for Intelligent Information Retrieval. Kluwer Academic Publishers, Dordrecht (2003)
Shannon, C.E.: A Mathematical Theory of Communication. The Bell System Technical Journal 27, 379–423, 623–656 (1948)
Zurek, W.H. (ed.): Complexity, Entropy and Physics of Information. Addison-Wesley, Redwood City (1990)
Li, M., Vitànyi, P.: An Introduction to Kolmogorov Complexity and its Applications, 2nd edn. Springer, Heidelberg (1997)
Khinchin, A.I.: Mathematical Foundations of Information Theory. Dover, New York (1957)
Benedetto, D., Caglioti, E., Loreto, V.: Language Trees and Zipping. Physical Review Letters 88, 048702–048705 (2002)
Ziv, J., Merhav, N.: A Measure of Relative Entropy between Individual Sequences with Applications to Universal Classification. IEEE Transactions on Information Theory 39, 1280–1292 (1993)
Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23, 337–343 (1977)
Puglisi, A., Benedetto, D., Caglioti, E., Loreto, V., Vulpiani, A.: Data Compression and Learning Time Sequences Analysis. Physica D 180, 92–107 (2003)
Benedetto, D., Caglioti, E., Loreto, V.: Zipping Out Relevant Information. Invited column “Computing Prescriptions”. The AIP/IEEE journal Computing in Science and Engineering, January-February issue (2003)
Braschler, M., Ripplinger, B.: Stemming and Decompounding for German Text Retrieval. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 177–192. Springer, Heidelberg (2003)
Kluck, M., Gey, F.C.: The Domain-Specific Task of CLEF - Specific Evaluation Strategies in Cross-Language Information Retrieval. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, p. 48. Springer, Heidelberg (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alderuccio, D., Bordoni, L., Loreto, V. (2004). A Data-Compression Approach to the Monolingual GIRT Task: An Agnostic Point of View. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds) Comparative Evaluation of Multilingual Information Access Systems. CLEF 2003. Lecture Notes in Computer Science, vol 3237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30222-3_38
Download citation
DOI: https://doi.org/10.1007/978-3-540-30222-3_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24017-4
Online ISBN: 978-3-540-30222-3
eBook Packages: Springer Book Archive