Relevance of Contextual Information in Compression-Based Text Clustering

Granados, Ana; Martínez, Rafael; Camacho, David; de Borja Rodríguez, Francisco

doi:10.1007/978-3-642-15381-5_32

Relevance of Contextual Information in Compression-Based Text Clustering

Ana Granados²¹,
Rafael Martínez²¹,
David Camacho²¹ &
…
Francisco de Borja Rodríguez²¹

Conference paper

1634 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6283))

Abstract

In this paper we take a step towards understanding compression distances by analyzing the relevance of contextual information in compression-based text clustering. In order to do so, two kinds of word removal are explored, one that maintains part of the contextual information despite the removal, and one that does not maintain it. We show how removing words in such a way that the contextual information is maintained despite the word removal helps the compression-based text clustering and improves its accuracy, while on the contrary, removing words losing that contextual information makes the clustering results worse.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Benedetto, D., Caglioti, E., Loreto, V.: Language Trees and Zipping. Physical Review Letters 88(4), 48702 (2002)
Article Google Scholar
Cilibrasi, R., Cruz, A.L., de Rooij, S., Keijzer, M.: CompLearn Toolkit, http://www.complearn.org/
Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)
Article MathSciNet Google Scholar
Cilibrasi, R., Vitanyi, P.M.B.: The google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007)
Article Google Scholar
BNC Consortium. British National Corpus. Oxford University Computing Services, Oxford, http://www.natcorp.ox.ac.uk/
Fong, S., Roussinov, D., Skillicorn, D.B.: Detecting word substitutions in text. IEEE Transactions on Knowledge and Data Engineering 20(8), 1067–1076 (2008)
Article Google Scholar
González, A., Granados, A., Camacho, D., Rodríguez, F.: Influence of music representation on compression-based clustering. In: IEEE Congress on Evolutionary Computation (2010)
Google Scholar
Granados, A., Cebrián, M., Camacho, D., Rodríguez, F.: Reducing the loss of information through annealing text distortion. IEEE Transactions on Knowledge and Data Engineering (in press, 2010)
Google Scholar
Granados, A., Cebrián, M., Camacho, D., Rodríguez, F.: Evaluating the impact of information distortion on normalized compression distance. In: Barbero, Á. (ed.) ICMCTA 2008. LNCS, vol. 5228, pp. 69–79. Springer, Heidelberg (2008)
Chapter Google Scholar
Kraskov, A., Stoegbauer, H., Andrzejak, R.G., Grassberger, P.: Hierarchical clustering using mutual information. Europhysics Letters 70(2), 278–284 (2005)
Article MathSciNet Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Article MathSciNet Google Scholar
Pavlov, I.: LZMAX, http://www.7-zip.org/sdk.html
Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co. Inc., Boston (1989)
Google Scholar
Turing, A.: On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society 2(42), 230–265 (1936)
Google Scholar
U.S. National Library of Medicine and National Institutes of Health MedlinePlus Health Information, MedlinePlus website, http://medlineplus.gov/
Van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann, Newton (1979)
Google Scholar
Verdú, S., Weissman, T.: The information lost in erasures. IEEE Transactions on Information Theory 54(11), 5030–5058 (2008)
Article Google Scholar
Wilbur, W.J., Sirotkin, K.: The automatic identification of stop words. Journal of Information Science 18(1), 45 (1992)
Article Google Scholar
Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR, pp. 256–263 (1995)
Google Scholar
Zhang, X., Hao, Y., Zhu, X., Li, M.: Information distance from a question to an answer. In: KDD ’07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 874–883. ACM, New York (2007)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Escuela Politécnica Superior, Universidad Autónoma de Madrid, Spain
Ana Granados, Rafael Martínez, David Camacho & Francisco de Borja Rodríguez

Authors

Ana Granados
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Martínez
View author publications
You can also search for this author in PubMed Google Scholar
David Camacho
View author publications
You can also search for this author in PubMed Google Scholar
Francisco de Borja Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, University of the West of Scotland, PA1 2BE, Paisley, UK
Colin Fyfe
University of Birmingham, B15 2TT, Birmingham, UK
Peter Tino
University of Ulster, Coleraine, UK
Darryl Charles
Universidad de Burgos, Burgos, Spain
Cesar Garcia-Osorio
School of Electrical and Electronic Engineering, University of Manchester, Sackville Street Building, Sackville Street, M60 1QD, Manchester, UK
Hujun Yin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Granados, A., Martínez, R., Camacho, D., de Borja Rodríguez, F. (2010). Relevance of Contextual Information in Compression-Based Text Clustering. In: Fyfe, C., Tino, P., Charles, D., Garcia-Osorio, C., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2010. IDEAL 2010. Lecture Notes in Computer Science, vol 6283. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15381-5_32

Download citation

DOI: https://doi.org/10.1007/978-3-642-15381-5_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15380-8
Online ISBN: 978-3-642-15381-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics