Skip to main content
Log in

Finding structure in noisy text: topic classification and unsupervised clustering

  • Original Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

This paper addresses two types of classification of noisy, unstructured text such as newsgroup messages: (1) spotting messages containing topics of interest, and (2) automatic conceptual organization of messages without prior knowledge of topics of interest. In addition to applying our hidden Markov model methodology to spotting topics of interest in newsgroup messages, we present a robust methodology for rejecting messages which are off-topic. We describe a novel approach for automatically organizing a large, unstructured collection of messages. The approach applies an unsupervised topic clustering procedure to generate a hierarchical tree of topics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

HMM:

Hidden Markov Model

TFIDF:

Term frequency inverse document frequency

SVM:

Support vector machines

UTD:

Unsupervised topic discovery

ROC:

Receiver operating characteristics

References

  1. Schwartz, R., Imai, T., Kubala, F., Nguyen, L., Makhoul, J.: A Maximum Likelihood Model for Topic Classification of Broadcast News. In: Proceedings of EUROSPEECH. ISCA, Rhodes (1997)

  2. Joachims, T.: Text Categorization with support vector machines. In: Proceedings of ECML-98, 10th European Conference on Machine Learning. Springer, Chemnitz, pp. 137–142 (1998)

  3. Baker, L.D., McCallum A.: Distributional clustering of words for text classification. In: Proceedings of ACM SIGIR. ACM, Melbourne, pp. 96–103 (1998)

  4. Rennie J.D.M., Shih L., Teevan J. and Karger D. (2003). Tackling the poor assumptions of Naive–Bayes text classifiers. In: Fawcett, T. and Mishra, N. (eds) Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), pp 616–623. AAAI Press, Washington, DC

    Google Scholar 

  5. Eick, S., Lockwood, J., Loui, R., Moscola, J., Kastner, C., Levine, A., Weishar, D.: Transformation algorithms for data streams. In: Proceedings of IEEE Aerospace Conference, March 2005

  6. Wright, J. H., Carey, J. M., Parris, E. S.: Improved topic spotting through statistical modeling of keyword dependencies. In: Proceedings IEEE ICASSP. IEEE, Detroit, pp. 313–316 (1995)

  7. Peskin, B., Connolly, S., Gillick, L., Lowe, S., McAllaster, D., Nagesha, V.: Improvements in switchboard recognition and topic identification. In: Proceedings of IEEE ICASSP. IEEE, Atlanta, pp. 303–306 (1996)

  8. Subramanian, K., Prasad, R., Natarajan, P., Schwartz, R.: Optimal estimation of rejection thresholds for topic spotting. In: Proceedings of IEEE ICASSP. IEEE, Honolulu (2007)

  9. Sista S., Schwartz R., Leek T. and Makhoul J. (2002). An algorithm for unsupervised topic discovery from broadcast news stories. In: Marcus, M. (eds) Proceedings of ACM HLT. ACM, San Diego

    Google Scholar 

  10. The 20 Newsgroup (20 NG) Corpus, http://www.people.csail.mit.edu/jrennie/20Newsgroups/

  11. Differentiable Nonlinear Optimization, http://www.sai.msu.su/sal/B/3/DONLP2.html

  12. Li, H., Abe, N.: Word clustering and disambiguation based on co-occurrence data. In: Proceedings of COLING-ACL, pp. 749–755 (1998)

  13. Biswas, G., Weinberg, J., Fisher, D.: ITERATE: A conceptual clustering algorithm for data mining. IEEE Trans. Syst. Man Cybernet. Rev. 28(2) (1998)

  14. Hofmann, T.: The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In: Proceedings of IJCAI. IJCAI, Stockholm, pp. 682–687 (1999)

  15. Willet P. (1998). Recent trends in hierarchical document clustering: a critical review. Inform. Process. Manage. 24(5): 577–597

    Article  Google Scholar 

  16. Sista, S., Srivastava A., Kubala, F., Schwartz R.: Unsupervised topic discovery applied to segmentation of news transcriptions. In: Proceedings of Eurospeech. ISCA, Geneva, pp. 2833–2836 (2003)

  17. Marcken, C.: The unsupervised acquisition of lexicon from continuous speech. In: MIT Artificial Intelligence Laboratory, A.I. Memo No. 1558 (1995)

  18. Leek, T.R.: Minimum Description Length (MDL) phrases. BBN Technologies, Technical Report 010405TRL (2001)

  19. Bikel D., Schwartz R. and Weischedel R. (1999). An Algorithm that learns what’s in a name. Machine Learning 34(1–3): 211–231

    Article  MATH  Google Scholar 

  20. Salton G. (1971). The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs

    Google Scholar 

  21. Papoulis P. (1984). Probability, Random Variables and Stochastic Processes, 2nd edn. McGraw-Hill, New York

    MATH  Google Scholar 

  22. Kullback S. and Leibler R.A. (1951). On information and sufficiency. Ann. Mathe. Stat. 22: 79–86

    MathSciNet  Google Scholar 

  23. Johnson, D., Sinanovic, S.: Symmetrizing the Kullback–Leibler distance. Rice University Working Paper (2001)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Prem Natarajan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Natarajan, P., Prasad, R., Subramanian, K. et al. Finding structure in noisy text: topic classification and unsupervised clustering. IJDAR 10, 187–198 (2007). https://doi.org/10.1007/s10032-007-0057-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-007-0057-x

Keywords

Navigation