Abstract
In this paper we take a step towards understanding compression distances by analyzing the relevance of contextual information in compression-based text clustering. In order to do so, two kinds of word removal are explored, one that maintains part of the contextual information despite the removal, and one that does not maintain it. We show how removing words in such a way that the contextual information is maintained despite the word removal helps the compression-based text clustering and improves its accuracy, while on the contrary, removing words losing that contextual information makes the clustering results worse.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Benedetto, D., Caglioti, E., Loreto, V.: Language Trees and Zipping. Physical Review Letters 88(4), 48702 (2002)
Cilibrasi, R., Cruz, A.L., de Rooij, S., Keijzer, M.: CompLearn Toolkit, http://www.complearn.org/
Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)
Cilibrasi, R., Vitanyi, P.M.B.: The google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007)
BNC Consortium. British National Corpus. Oxford University Computing Services, Oxford, http://www.natcorp.ox.ac.uk/
Fong, S., Roussinov, D., Skillicorn, D.B.: Detecting word substitutions in text. IEEE Transactions on Knowledge and Data Engineering 20(8), 1067–1076 (2008)
González, A., Granados, A., Camacho, D., Rodríguez, F.: Influence of music representation on compression-based clustering. In: IEEE Congress on Evolutionary Computation (2010)
Granados, A., Cebrián, M., Camacho, D., Rodríguez, F.: Reducing the loss of information through annealing text distortion. IEEE Transactions on Knowledge and Data Engineering (in press, 2010)
Granados, A., Cebrián, M., Camacho, D., Rodríguez, F.: Evaluating the impact of information distortion on normalized compression distance. In: Barbero, Á. (ed.) ICMCTA 2008. LNCS, vol. 5228, pp. 69–79. Springer, Heidelberg (2008)
Kraskov, A., Stoegbauer, H., Andrzejak, R.G., Grassberger, P.: Hierarchical clustering using mutual information. Europhysics Letters 70(2), 278–284 (2005)
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Pavlov, I.: LZMAX, http://www.7-zip.org/sdk.html
Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co. Inc., Boston (1989)
Turing, A.: On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society 2(42), 230–265 (1936)
U.S. National Library of Medicine and National Institutes of Health MedlinePlus Health Information, MedlinePlus website, http://medlineplus.gov/
Van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann, Newton (1979)
Verdú, S., Weissman, T.: The information lost in erasures. IEEE Transactions on Information Theory 54(11), 5030–5058 (2008)
Wilbur, W.J., Sirotkin, K.: The automatic identification of stop words. Journal of Information Science 18(1), 45 (1992)
Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR, pp. 256–263 (1995)
Zhang, X., Hao, Y., Zhu, X., Li, M.: Information distance from a question to an answer. In: KDD ’07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 874–883. ACM, New York (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Granados, A., Martínez, R., Camacho, D., de Borja Rodríguez, F. (2010). Relevance of Contextual Information in Compression-Based Text Clustering. In: Fyfe, C., Tino, P., Charles, D., Garcia-Osorio, C., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2010. IDEAL 2010. Lecture Notes in Computer Science, vol 6283. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15381-5_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-15381-5_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15380-8
Online ISBN: 978-3-642-15381-5
eBook Packages: Computer ScienceComputer Science (R0)