Abstract
This paper addresses two types of classification of noisy, unstructured text such as newsgroup messages: (1) spotting messages containing topics of interest, and (2) automatic conceptual organization of messages without prior knowledge of topics of interest. In addition to applying our hidden Markov model methodology to spotting topics of interest in newsgroup messages, we present a robust methodology for rejecting messages which are off-topic. We describe a novel approach for automatically organizing a large, unstructured collection of messages. The approach applies an unsupervised topic clustering procedure to generate a hierarchical tree of topics.
Similar content being viewed by others
Abbreviations
- HMM:
-
Hidden Markov Model
- TFIDF:
-
Term frequency inverse document frequency
- SVM:
-
Support vector machines
- UTD:
-
Unsupervised topic discovery
- ROC:
-
Receiver operating characteristics
References
Schwartz, R., Imai, T., Kubala, F., Nguyen, L., Makhoul, J.: A Maximum Likelihood Model for Topic Classification of Broadcast News. In: Proceedings of EUROSPEECH. ISCA, Rhodes (1997)
Joachims, T.: Text Categorization with support vector machines. In: Proceedings of ECML-98, 10th European Conference on Machine Learning. Springer, Chemnitz, pp. 137–142 (1998)
Baker, L.D., McCallum A.: Distributional clustering of words for text classification. In: Proceedings of ACM SIGIR. ACM, Melbourne, pp. 96–103 (1998)
Rennie J.D.M., Shih L., Teevan J. and Karger D. (2003). Tackling the poor assumptions of Naive–Bayes text classifiers. In: Fawcett, T. and Mishra, N. (eds) Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), pp 616–623. AAAI Press, Washington, DC
Eick, S., Lockwood, J., Loui, R., Moscola, J., Kastner, C., Levine, A., Weishar, D.: Transformation algorithms for data streams. In: Proceedings of IEEE Aerospace Conference, March 2005
Wright, J. H., Carey, J. M., Parris, E. S.: Improved topic spotting through statistical modeling of keyword dependencies. In: Proceedings IEEE ICASSP. IEEE, Detroit, pp. 313–316 (1995)
Peskin, B., Connolly, S., Gillick, L., Lowe, S., McAllaster, D., Nagesha, V.: Improvements in switchboard recognition and topic identification. In: Proceedings of IEEE ICASSP. IEEE, Atlanta, pp. 303–306 (1996)
Subramanian, K., Prasad, R., Natarajan, P., Schwartz, R.: Optimal estimation of rejection thresholds for topic spotting. In: Proceedings of IEEE ICASSP. IEEE, Honolulu (2007)
Sista S., Schwartz R., Leek T. and Makhoul J. (2002). An algorithm for unsupervised topic discovery from broadcast news stories. In: Marcus, M. (eds) Proceedings of ACM HLT. ACM, San Diego
The 20 Newsgroup (20 NG) Corpus, http://www.people.csail.mit.edu/jrennie/20Newsgroups/
Differentiable Nonlinear Optimization, http://www.sai.msu.su/sal/B/3/DONLP2.html
Li, H., Abe, N.: Word clustering and disambiguation based on co-occurrence data. In: Proceedings of COLING-ACL, pp. 749–755 (1998)
Biswas, G., Weinberg, J., Fisher, D.: ITERATE: A conceptual clustering algorithm for data mining. IEEE Trans. Syst. Man Cybernet. Rev. 28(2) (1998)
Hofmann, T.: The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In: Proceedings of IJCAI. IJCAI, Stockholm, pp. 682–687 (1999)
Willet P. (1998). Recent trends in hierarchical document clustering: a critical review. Inform. Process. Manage. 24(5): 577–597
Sista, S., Srivastava A., Kubala, F., Schwartz R.: Unsupervised topic discovery applied to segmentation of news transcriptions. In: Proceedings of Eurospeech. ISCA, Geneva, pp. 2833–2836 (2003)
Marcken, C.: The unsupervised acquisition of lexicon from continuous speech. In: MIT Artificial Intelligence Laboratory, A.I. Memo No. 1558 (1995)
Leek, T.R.: Minimum Description Length (MDL) phrases. BBN Technologies, Technical Report 010405TRL (2001)
Bikel D., Schwartz R. and Weischedel R. (1999). An Algorithm that learns what’s in a name. Machine Learning 34(1–3): 211–231
Salton G. (1971). The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs
Papoulis P. (1984). Probability, Random Variables and Stochastic Processes, 2nd edn. McGraw-Hill, New York
Kullback S. and Leibler R.A. (1951). On information and sufficiency. Ann. Mathe. Stat. 22: 79–86
Johnson, D., Sinanovic, S.: Symmetrizing the Kullback–Leibler distance. Rice University Working Paper (2001)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Natarajan, P., Prasad, R., Subramanian, K. et al. Finding structure in noisy text: topic classification and unsupervised clustering. IJDAR 10, 187–198 (2007). https://doi.org/10.1007/s10032-007-0057-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-007-0057-x