Finding structure in noisy text: topic classification and unsupervised clustering

Natarajan, Prem; Prasad, Rohit; Subramanian, Krishna; Saleem, Shirin; Choi, Fred; Schwartz, Rich

doi:10.1007/s10032-007-0057-x

Finding structure in noisy text: topic classification and unsupervised clustering

Original Paper
Published: 05 December 2007

Volume 10, pages 187–198, (2007)
Cite this article

International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Prem Natarajan¹,
Rohit Prasad¹,
Krishna Subramanian¹,
Shirin Saleem¹,
Fred Choi¹ &
…
Rich Schwartz¹

241 Accesses
5 Citations
Explore all metrics

Abstract

This paper addresses two types of classification of noisy, unstructured text such as newsgroup messages: (1) spotting messages containing topics of interest, and (2) automatic conceptual organization of messages without prior knowledge of topics of interest. In addition to applying our hidden Markov model methodology to spotting topics of interest in newsgroup messages, we present a robust methodology for rejecting messages which are off-topic. We describe a novel approach for automatically organizing a large, unstructured collection of messages. The approach applies an unsupervised topic clustering procedure to generate a hierarchical tree of topics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

A systematic evaluation of text mining methods for short texts: Mapping individuals’ internal states from online posts

Article Open access 04 April 2024

Abbreviations

HMM:: Hidden Markov Model
TFIDF:: Term frequency inverse document frequency
SVM:: Support vector machines
UTD:: Unsupervised topic discovery
ROC:: Receiver operating characteristics

References

Schwartz, R., Imai, T., Kubala, F., Nguyen, L., Makhoul, J.: A Maximum Likelihood Model for Topic Classification of Broadcast News. In: Proceedings of EUROSPEECH. ISCA, Rhodes (1997)
Joachims, T.: Text Categorization with support vector machines. In: Proceedings of ECML-98, 10th European Conference on Machine Learning. Springer, Chemnitz, pp. 137–142 (1998)
Baker, L.D., McCallum A.: Distributional clustering of words for text classification. In: Proceedings of ACM SIGIR. ACM, Melbourne, pp. 96–103 (1998)
Rennie J.D.M., Shih L., Teevan J. and Karger D. (2003). Tackling the poor assumptions of Naive–Bayes text classifiers. In: Fawcett, T. and Mishra, N. (eds) Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), pp 616–623. AAAI Press, Washington, DC
Google Scholar
Eick, S., Lockwood, J., Loui, R., Moscola, J., Kastner, C., Levine, A., Weishar, D.: Transformation algorithms for data streams. In: Proceedings of IEEE Aerospace Conference, March 2005
Wright, J. H., Carey, J. M., Parris, E. S.: Improved topic spotting through statistical modeling of keyword dependencies. In: Proceedings IEEE ICASSP. IEEE, Detroit, pp. 313–316 (1995)
Peskin, B., Connolly, S., Gillick, L., Lowe, S., McAllaster, D., Nagesha, V.: Improvements in switchboard recognition and topic identification. In: Proceedings of IEEE ICASSP. IEEE, Atlanta, pp. 303–306 (1996)
Subramanian, K., Prasad, R., Natarajan, P., Schwartz, R.: Optimal estimation of rejection thresholds for topic spotting. In: Proceedings of IEEE ICASSP. IEEE, Honolulu (2007)
Sista S., Schwartz R., Leek T. and Makhoul J. (2002). An algorithm for unsupervised topic discovery from broadcast news stories. In: Marcus, M. (eds) Proceedings of ACM HLT. ACM, San Diego
Google Scholar
The 20 Newsgroup (20 NG) Corpus, http://www.people.csail.mit.edu/jrennie/20Newsgroups/
Differentiable Nonlinear Optimization, http://www.sai.msu.su/sal/B/3/DONLP2.html
Li, H., Abe, N.: Word clustering and disambiguation based on co-occurrence data. In: Proceedings of COLING-ACL, pp. 749–755 (1998)
Biswas, G., Weinberg, J., Fisher, D.: ITERATE: A conceptual clustering algorithm for data mining. IEEE Trans. Syst. Man Cybernet. Rev. 28(2) (1998)
Hofmann, T.: The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In: Proceedings of IJCAI. IJCAI, Stockholm, pp. 682–687 (1999)
Willet P. (1998). Recent trends in hierarchical document clustering: a critical review. Inform. Process. Manage. 24(5): 577–597
Article Google Scholar
Sista, S., Srivastava A., Kubala, F., Schwartz R.: Unsupervised topic discovery applied to segmentation of news transcriptions. In: Proceedings of Eurospeech. ISCA, Geneva, pp. 2833–2836 (2003)
Marcken, C.: The unsupervised acquisition of lexicon from continuous speech. In: MIT Artificial Intelligence Laboratory, A.I. Memo No. 1558 (1995)
Leek, T.R.: Minimum Description Length (MDL) phrases. BBN Technologies, Technical Report 010405TRL (2001)
Bikel D., Schwartz R. and Weischedel R. (1999). An Algorithm that learns what’s in a name. Machine Learning 34(1–3): 211–231
Article MATH Google Scholar
Salton G. (1971). The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs
Google Scholar
Papoulis P. (1984). Probability, Random Variables and Stochastic Processes, 2nd edn. McGraw-Hill, New York
MATH Google Scholar
Kullback S. and Leibler R.A. (1951). On information and sufficiency. Ann. Mathe. Stat. 22: 79–86
MathSciNet Google Scholar
Johnson, D., Sinanovic, S.: Symmetrizing the Kullback–Leibler distance. Rice University Working Paper (2001)

Download references

Author information

Authors and Affiliations

BBN Technologies, 10 Moulton Street, Cambridge, MA, 02138, USA
Prem Natarajan, Rohit Prasad, Krishna Subramanian, Shirin Saleem, Fred Choi & Rich Schwartz

Authors

Prem Natarajan
View author publications
You can also search for this author in PubMed Google Scholar
Rohit Prasad
View author publications
You can also search for this author in PubMed Google Scholar
Krishna Subramanian
View author publications
You can also search for this author in PubMed Google Scholar
Shirin Saleem
View author publications
You can also search for this author in PubMed Google Scholar
Fred Choi
View author publications
You can also search for this author in PubMed Google Scholar
Rich Schwartz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prem Natarajan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Natarajan, P., Prasad, R., Subramanian, K. et al. Finding structure in noisy text: topic classification and unsupervised clustering. IJDAR 10, 187–198 (2007). https://doi.org/10.1007/s10032-007-0057-x

Download citation

Received: 20 March 2007
Revised: 21 October 2007
Accepted: 28 October 2007
Published: 05 December 2007
Issue Date: December 2007
DOI: https://doi.org/10.1007/s10032-007-0057-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finding structure in noisy text: topic classification and unsupervised clustering

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

A Comprehensive Survey of Clustering Algorithms

A systematic evaluation of text mining methods for short texts: Mapping individuals’ internal states from online posts

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Finding structure in noisy text: topic classification and unsupervised clustering

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

A Comprehensive Survey of Clustering Algorithms

A systematic evaluation of text mining methods for short texts: Mapping individuals’ internal states from online posts

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation