Skip to main content

Relevance of Contextual Information in Compression-Based Text Clustering

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6283))

Abstract

In this paper we take a step towards understanding compression distances by analyzing the relevance of contextual information in compression-based text clustering. In order to do so, two kinds of word removal are explored, one that maintains part of the contextual information despite the removal, and one that does not maintain it. We show how removing words in such a way that the contextual information is maintained despite the word removal helps the compression-based text clustering and improves its accuracy, while on the contrary, removing words losing that contextual information makes the clustering results worse.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Benedetto, D., Caglioti, E., Loreto, V.: Language Trees and Zipping. Physical Review Letters 88(4), 48702 (2002)

    Article  Google Scholar 

  2. Cilibrasi, R., Cruz, A.L., de Rooij, S., Keijzer, M.: CompLearn Toolkit, http://www.complearn.org/

  3. Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)

    Article  MathSciNet  Google Scholar 

  4. Cilibrasi, R., Vitanyi, P.M.B.: The google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007)

    Article  Google Scholar 

  5. BNC Consortium. British National Corpus. Oxford University Computing Services, Oxford, http://www.natcorp.ox.ac.uk/

  6. Fong, S., Roussinov, D., Skillicorn, D.B.: Detecting word substitutions in text. IEEE Transactions on Knowledge and Data Engineering 20(8), 1067–1076 (2008)

    Article  Google Scholar 

  7. González, A., Granados, A., Camacho, D., Rodríguez, F.: Influence of music representation on compression-based clustering. In: IEEE Congress on Evolutionary Computation (2010)

    Google Scholar 

  8. Granados, A., Cebrián, M., Camacho, D., Rodríguez, F.: Reducing the loss of information through annealing text distortion. IEEE Transactions on Knowledge and Data Engineering (in press, 2010)

    Google Scholar 

  9. Granados, A., Cebrián, M., Camacho, D., Rodríguez, F.: Evaluating the impact of information distortion on normalized compression distance. In: Barbero, Á. (ed.) ICMCTA 2008. LNCS, vol. 5228, pp. 69–79. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  10. Kraskov, A., Stoegbauer, H., Andrzejak, R.G., Grassberger, P.: Hierarchical clustering using mutual information. Europhysics Letters 70(2), 278–284 (2005)

    Article  MathSciNet  Google Scholar 

  11. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)

    Article  MathSciNet  Google Scholar 

  12. Pavlov, I.: LZMAX, http://www.7-zip.org/sdk.html

  13. Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co. Inc., Boston (1989)

    Google Scholar 

  14. Turing, A.: On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society 2(42), 230–265 (1936)

    Google Scholar 

  15. U.S. National Library of Medicine and National Institutes of Health MedlinePlus Health Information, MedlinePlus website, http://medlineplus.gov/

  16. Van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann, Newton (1979)

    Google Scholar 

  17. Verdú, S., Weissman, T.: The information lost in erasures. IEEE Transactions on Information Theory 54(11), 5030–5058 (2008)

    Article  Google Scholar 

  18. Wilbur, W.J., Sirotkin, K.: The automatic identification of stop words. Journal of Information Science 18(1), 45 (1992)

    Article  Google Scholar 

  19. Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR, pp. 256–263 (1995)

    Google Scholar 

  20. Zhang, X., Hao, Y., Zhu, X., Li, M.: Information distance from a question to an answer. In: KDD ’07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 874–883. ACM, New York (2007)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Granados, A., Martínez, R., Camacho, D., de Borja Rodríguez, F. (2010). Relevance of Contextual Information in Compression-Based Text Clustering. In: Fyfe, C., Tino, P., Charles, D., Garcia-Osorio, C., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2010. IDEAL 2010. Lecture Notes in Computer Science, vol 6283. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15381-5_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15381-5_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15380-8

  • Online ISBN: 978-3-642-15381-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics