Skip to main content
Log in

Topic Identification in Dynamical Text by Complexity Pursuit

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

The problem of analysing dynamically evolving textual data has arisen within the last few years. An example of such data is the discussion appearing in Internet chat lines. In this Letter a recently introduced source separation method, termed as complexity pursuit, is applied to the problem of finding topics in dynamical text and is compared against several blind separation algorithms for the problem considered. Complexity pursuit is a generalisation of projection pursuit to time series and it is able to use both higher-order statistical measures and temporal dependency information in separating the topics. Experimental results on chat line and newsgroup data demonstrate that the minimum complexity time series indeed do correspond to meaningful topics inherent in the dynamical text data, and also suggest the applicability of the method to query-based retrieval from a temporally changing text stream.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Allan, J., Carbonell, J., Doddington, G., Yamron, J. and Yang, Y.: Topic detection and tracking pilot study. Final report, In: Proc. of DARPA Broadcast News Transcription and Understanding Workshop, 1998, pp. 194–218.

  2. Baeza-Yates, R. A. and Ribeiro-Neto, B.: Modern Information Retrieval, New York: ACM Press, 1999.

    Google Scholar 

  3. Belouchrani, A., Meraim, K. A., Cardoso, J.-F. and Moulines, E.: A blind source separation technique based on second order statistics, IEEE Tr. on Signal Processing, 45(2) (1997), 434–444.

    Article  Google Scholar 

  4. Berry, M. W., Dumais, S. T. and Letsche, T. A.: Computational methods for intelligent information access, In: Proc. of Supercomputing '95,San Diego,CA: USA, 1995.

  5. Bingham, E., Kabán, A. and Girolami, M.: Finding topics in dynamical text: application to chat line discussions, In: 10th Int. World Wide Web Conf. Poster Proc., 2001, pp. 198–199.

  6. Comon, P.: Independent component analysis—a new concept? Signal Processing, 36 (1994), 287–314.

    Article  MATH  Google Scholar 

  7. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R.: Indexing by latent semantic analysis, Journal of the American Society for Information Science, 41(6) (1990), 391–407.

    Article  Google Scholar 

  8. Friedman, J. H. and Tukey, J. W.: A projection pursuit algorithm for exploratory data analysis, IEEE Tr. of Computers, c-23(9) (1974), 881–890.

    Google Scholar 

  9. Hofmann, T.: Probabilistic Latent Semantic Analysis, In: Proc. 15th Annual Conf. on Uncertainty in Artificial Intelligence (UAI'99), Sweden: Stockholm, 1999.

    Google Scholar 

  10. Hyvärinen, A.: Fast and robust fixed-point algorithms for independent component analysis, IEEE Tr. on Neural Networks, 10(3) (1999), 626–634.

    Article  Google Scholar 

  11. Hyvärinen, A.: Complexity pursuit: separating interesting components from time-series, Neural Computation, 13(4) (2001), 883–898.

    Article  MATH  Google Scholar 

  12. Hyvärinen, A., Karhunen, J. and Oja, E.: Independent component analysis,Wiley Interscience, 2001.

  13. Isbell, C. L. and Viola, P.: Restucturing sparse high dimensional data for effective retrieval, In: Advances in Neural Information Processing Systems 11, 1998,pp. 480–486.

    Google Scholar 

  14. Jutten, C. and Herault, J.: Blind separation of sources,part I: An adaptive algorithm based on neuromimetic architecture, Signal Processing, 24 (1991), 1–10.

    Article  MATH  Google Scholar 

  15. Kabán, A. and Girolami, M.: Unsupervised topic separation and keyword identification in document collections: a projection approach,Technical Report 10, Dept. of Computing and Information Systems,Univ. of Paisley, 2000.

  16. Kabán, A. and Girolami, M.: A combined latent class and trait model for the analysis and visualization of discrete data, IEEE Tr. on Pattern Analysis, 23(8) (2001), 859–872.

    Article  Google Scholar 

  17. Kabán, A. and Girolami, M.: A dynamic probabilistic model to visualize topic evolution in text streams, Journal of Intelligent Information Systems, Special Issue on Automated Text Categorization, 18(2) (2002).

  18. Katz, S.: Distribution of content words and phrases in text and language modeling, Natural Language Engineering, 2(1) (1996), 15–59.

    Article  Google Scholar 

  19. Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V. and Saarela, A.: Self organization of a massive document collection, IEEE Tr. on Neural Networks, 11(3) (2000) 574–585. Special Issue on Neural Networks for Data Mining and Knowledge Discovery.

    Article  Google Scholar 

  20. Kolenda, T. and Hansen, L. K.: Dynamical components of chat, Technical report Technical University of Denmark, 2000.

  21. Kolenda, T., Hansen, L. K. and Larsen, J.: Signal detection using ICA: application to chat room topic spotting, In: Lee and Jung and Makeig and Sejnowski (eds.): Proc. of the Third International Conference on Independent Component Analysis and Signal Separation (ICA2001), San Diego, CA: USA pp. 540–545, 2001.

  22. Kolenda, T., Hansen, L. K. and Sigurdsson, S.: Independent components in text, In: M. Girolami (ed.): Advances in Independent Component Analysis, Springer-Verlag, 2000, Chapt. 13,pp. 235–256.

  23. Molgedey, L. and Schuster, H. G.: Separation of a mixture of independent signals using time delayed correlations, Phys. Review Letters, 72(23) (1994),3634–3637.

    Article  ADS  Google Scholar 

  24. Müller, K.-R., Philips, P. and Ziehe, A.: JADETD: Combining higher-order statistics and temporal information for blind source separation (with noise), In: Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA '99), France: Aussois, 1999, pp. 87–92.

    Google Scholar 

  25. Pajunen, P.: Blind source separation using algorithmic information theory, Neurocomputing, 22 (1998), 35–48.

    Article  MATH  Google Scholar 

  26. Pajunen, P.: Blind source separation of natural signals based on approximate complexity minimization, In: Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA '99), France: Aussois, 1999, pp. 267–270.

    Google Scholar 

  27. Papadimitriou, C., Raghavan, P., Tamaki, H. and Vempala, S.: Latent semantic indexing: a probabilistic analysis, In: Proc. 17th ACM Symp. Principles of Database Systems, Seattle, 1998,pp. 159–168.

  28. Salton, G. and McGill, M.J.: Introduction to modern information retrieval, New York: McGraw-Hill, 1983.

    Google Scholar 

  29. Slaney, M. and Ponceleon, D.: Hierarchical segmentation: finding changes in a text signal, In: Proc. of the SIAM Text Mining 2001 Workshop,Chicago, IL: 2001, pp. 6–13.

  30. Stone, J. V.: Blind source separation using temporal predictability, Neural Computation, 13(4) (2001).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bingham, E., Kabán, A. & Girolami, M. Topic Identification in Dynamical Text by Complexity Pursuit. Neural Processing Letters 17, 69–83 (2003). https://doi.org/10.1023/A:1022990829563

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1022990829563

Navigation