A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections

Vinokourov, Alexei; Girolami, Mark

doi:10.1023/A:1013677411002

A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections

Published: March 2002

Volume 18, pages 153–172, (2002)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Alexei Vinokourov¹ &
Mark Girolami²

183 Accesses
21 Citations
3 Altmetric
Explore all metrics

Abstract

This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art support vector machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a flat non-hierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the effective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Document Classification and Topic Detection

Effective and Interpretable Document Classification Using Distinctly Labeled Dirichlet Process Mixture Models of von Mises-Fisher Distributions

Efficient integration of generative topic models into discriminative classifiers using robust probabilistic kernels

Article 30 September 2020

References

Bartholomew, D.J. (1987). Latent Variable Models and Factor Analysis. London: Charles Griffin &; Co. Ltd.
Google Scholar
Bishop, C.M. (1995). Neural Networks for Pattern Recognition, Oxford: Oxford University Press.
Google Scholar
Bishop, C.M. and Tipping, M.E. (1998). A Hierarchical Latent Variable Model for Data Visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 281–293.
Google Scholar
Cheeseman, P. and Stutz, J. (1995). Bayesian Classification (AutoClass): Theory and Results. Advances in Knowledge Discovery and Data Mining (vol. II, pp. 153–180). AAAI Press.
Google Scholar
Chen, H. and Dumais, S. (2000). Bringing Order to the Web: Automatically Categorising Search Results. In Proceedings of CHI-00, ACM International Conference on Human Factors in Computing Systems (pp. 145–152). Den Haag, NL.
Google Scholar
Chickering, D.M. and Heckerman, D. (1996). Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables. Machine Learning, 29, 181–212.
Google Scholar
Chickering, D.M., Heckerman, D., and Meek, C. (1997). A Bayesian Approach to Learning Bayesian Networks with Local Structure. In Proceedings Uncertainty and Artificial Intelligence (UAI-97).
Cooper, G.F. and Herskovits, E. (1992). A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning, 9, 309–347.
Google Scholar
Cover, T.M. and Thomas, J.A. (1991). Elements of Information Theory. New York: John Willey &; Sons.
Google Scholar
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. (1998). Learning to Extract Symbolic Knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intellligence (AAAI98) (pp. 509–516).
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41, 391–407.
Google Scholar
Dumais, S.T., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive Learning Algorithms and Representations for Text Categorisation. In Proceedings of the Seventh International Conference on Information and Knowledge Management (pp. 148–155).
Dumais, S.T. and Chen, H. (2000). Hierarchical Classification of Web Content. In N.J. Belkin, P. Ingwersen, and M.K. Leong (Eds.), Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, Athens, GR (pp. 256–263). New York, US: ACM Press.
Google Scholar
Hanson, R., Stutz, J., and Cheeseman, P. (1991). Bayesian Classification with Correlation and Inheritance. In R. Myopoulos and J. Reiter (Eds.), Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, August 1991 (pp. 692–698). Morgan Kaufmann.
Heckerman, D., Geiger, D., and Chickering, D. (1995). Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning, 20(3), 197–243.
Google Scholar
Hofmann, T. (1999). Probabilistic Latent Semantic Analysis. In Proceedings of the Fifteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), San Fracisco, CA (pp. 289–296). Los Alfos, CA: Morgan Kaufmann.
Google Scholar
Hofmann, T. (2000). Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization. In Proceedings of Advances in Neural Information Processing Systems (NIPS'99) (vol. 12). Cambridge, MA: MIT Press.
Google Scholar
Jaakkola, T., Meilă, M., and Jebara, T. (2000). Maximum Entropy Discrimination. In Advances in Neural Information Processing Systems (vol. 12). Cambridge, MA: MIT Press.
Google Scholar
Jaakkola, T., Meila, M., and Jebara, T. (1999). Maximum Entropy Discrimination. Technical Report AITR-1668, MIT AI Lab.
Jaakola, T. and Haussler, D. (1999). Exploiting Generative Models in Discriminative Classifiers. In Proceedings of Advances in Neural Information Processing System (NIPS'98) (vol. 11, pp. 487–493). Cambridge, MA: MIT Press.
Google Scholar
Joachims, T. SVM ^light—Support Vector Machine. http://ais.gmd.de/thorsten/^~svm light/.
Joachims, T. (1997). A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML96.)
Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In C. Nédellec and C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning, Chemnitz, DE, 1998. (pp. 137–142). Heidelberg, DE: Springer Verlag. Lecture Notes in Computer Science, vol. 1398.
Google Scholar
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., and Saul, L.K. (1999). An Introduction to Variational Methods for Graphical Models. Machine Learning, 37, 183–233.
Google Scholar
Jordan, M.I. (1994). Hierarchical Mixture of Experts and the EM Algorithm. Neural Computation, 6(6), 181–214.
Google Scholar
McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. In AAAI-98 Workshop on Learning for Text Categorization.
McLachlan, G.J. and Krishnan, T. (1997). The EM Algorithm and Extensions. New York: John Wiley and Sons.
Google Scholar
Meilă, M. and Jordan, M.I. (2000). Learning with Mixtures of Trees. Journal of Machine Learning Research, 1, 1–48.
Google Scholar
Paradimitriou, C.H. and Raghavan, P. (1998). Latent Semantic Indexing: A Probabilistic Analysis. In Proceedings of the 17th ACM Symposium on the Priciples of Database Systems, Seattle, 1998 (pp. 159–168).
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. Singapore: World Scientific Publishing. Computer Science, vol. 15.
Google Scholar
Thiesson, B. (1997). Score and Information for Recursive Exponential Models with Incomplete Data. In D. Geiger, and P.P. Shenoy (Eds.), Proceedings of the 13th Conference on Uncertainty in Artificial Intelligence (UAI-97), San Francisco, August 1–3, 1997 (pp. 453–463). Los Alfas, CA: Morgan Kaufmann Publishers.
Google Scholar
Vaithyanatham, S. and Dom, B. (2000). Generalized Model Selection for Unsupervised Learning in High Dimensions. In Proceedings of Advances in Neural Information Processing Systems (NIPS'99) (vol. 12). Cambridge, MA: MIT Press.
Google Scholar
van Rijsbergen, C.J. (1979). Information Retrieval, 2nd edn. London: Butterworths.
Google Scholar
Vapnik, V.N. (1995). The Nature of Statistical Learning Theory. Berlin: Springer-Verlag.
Google Scholar
Vasconcelos, N. and Lippman, A. (1999). Learning Mixture Hierarchies. In Proceedings of Advances in Neural Information Processing Systems (NIPS'98) (vol. 11, pp. 606–612). Cambridge MA: MIT Press.
Google Scholar
Vinokourov, A. (2001). SoftwareDemo—AProbabilistic Approach to the Unsupervised Organisation of Document Collections Using Multinomial ASymmetric Hierarchical Analysis (MASHA). http://cis.paisley.ac.uk/vinoci0/ masha demo/.
Vinokourov, A. and Girolami, M. (2000). A Probabilistic Hierarchical Clustering Method for Organizing Collections of Text Documents. In 15th International Conference on Pattern Recognition (ICPR'2000) (vol. 2, pp. 182–185). IEEE Computer Society.
Google Scholar
Weigend, A.S., Wiener, E.D., and Pedersen, J.O. (1999). Exploiting Hierarchy in Text Categorisation. Information Retrieval, 1(3), 193–216.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK
Alexei Vinokourov
School of Communication and Information Technologies, University of Paisley, High Street, Paisley, PA1 2BE, UK
Mark Girolami

Authors

Alexei Vinokourov
View author publications
You can also search for this author in PubMed Google Scholar
Mark Girolami
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vinokourov, A., Girolami, M. A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections. Journal of Intelligent Information Systems 18, 153–172 (2002). https://doi.org/10.1023/A:1013677411002

Download citation

Issue Date: March 2002
DOI: https://doi.org/10.1023/A:1013677411002

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections

Abstract

Access this article

Similar content being viewed by others

Unsupervised Document Classification and Topic Detection

Effective and Interpretable Document Classification Using Distinctly Labeled Dirichlet Process Mixture Models of von Mises-Fisher Distributions

Efficient integration of generative topic models into discriminative classifiers using robust probabilistic kernels

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections

Abstract

Access this article

Similar content being viewed by others

Unsupervised Document Classification and Topic Detection

Effective and Interpretable Document Classification Using Distinctly Labeled Dirichlet Process Mixture Models of von Mises-Fisher Distributions

Efficient integration of generative topic models into discriminative classifiers using robust probabilistic kernels

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation