Abstract
This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art support vector machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a flat non-hierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the effective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance.
Similar content being viewed by others
References
Bartholomew, D.J. (1987). Latent Variable Models and Factor Analysis. London: Charles Griffin &; Co. Ltd.
Bishop, C.M. (1995). Neural Networks for Pattern Recognition, Oxford: Oxford University Press.
Bishop, C.M. and Tipping, M.E. (1998). A Hierarchical Latent Variable Model for Data Visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 281–293.
Cheeseman, P. and Stutz, J. (1995). Bayesian Classification (AutoClass): Theory and Results. Advances in Knowledge Discovery and Data Mining (vol. II, pp. 153–180). AAAI Press.
Chen, H. and Dumais, S. (2000). Bringing Order to the Web: Automatically Categorising Search Results. In Proceedings of CHI-00, ACM International Conference on Human Factors in Computing Systems (pp. 145–152). Den Haag, NL.
Chickering, D.M. and Heckerman, D. (1996). Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables. Machine Learning, 29, 181–212.
Chickering, D.M., Heckerman, D., and Meek, C. (1997). A Bayesian Approach to Learning Bayesian Networks with Local Structure. In Proceedings Uncertainty and Artificial Intelligence (UAI-97).
Cooper, G.F. and Herskovits, E. (1992). A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning, 9, 309–347.
Cover, T.M. and Thomas, J.A. (1991). Elements of Information Theory. New York: John Willey &; Sons.
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. (1998). Learning to Extract Symbolic Knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intellligence (AAAI98) (pp. 509–516).
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41, 391–407.
Dumais, S.T., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive Learning Algorithms and Representations for Text Categorisation. In Proceedings of the Seventh International Conference on Information and Knowledge Management (pp. 148–155).
Dumais, S.T. and Chen, H. (2000). Hierarchical Classification of Web Content. In N.J. Belkin, P. Ingwersen, and M.K. Leong (Eds.), Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, Athens, GR (pp. 256–263). New York, US: ACM Press.
Hanson, R., Stutz, J., and Cheeseman, P. (1991). Bayesian Classification with Correlation and Inheritance. In R. Myopoulos and J. Reiter (Eds.), Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, August 1991 (pp. 692–698). Morgan Kaufmann.
Heckerman, D., Geiger, D., and Chickering, D. (1995). Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning, 20(3), 197–243.
Hofmann, T. (1999). Probabilistic Latent Semantic Analysis. In Proceedings of the Fifteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), San Fracisco, CA (pp. 289–296). Los Alfos, CA: Morgan Kaufmann.
Hofmann, T. (2000). Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization. In Proceedings of Advances in Neural Information Processing Systems (NIPS'99) (vol. 12). Cambridge, MA: MIT Press.
Jaakkola, T., Meilă, M., and Jebara, T. (2000). Maximum Entropy Discrimination. In Advances in Neural Information Processing Systems (vol. 12). Cambridge, MA: MIT Press.
Jaakkola, T., Meila, M., and Jebara, T. (1999). Maximum Entropy Discrimination. Technical Report AITR-1668, MIT AI Lab.
Jaakola, T. and Haussler, D. (1999). Exploiting Generative Models in Discriminative Classifiers. In Proceedings of Advances in Neural Information Processing System (NIPS'98) (vol. 11, pp. 487–493). Cambridge, MA: MIT Press.
Joachims, T. SVM light—Support Vector Machine. http://ais.gmd.de/thorsten/~svm light/.
Joachims, T. (1997). A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML96.)
Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In C. Nédellec and C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning, Chemnitz, DE, 1998. (pp. 137–142). Heidelberg, DE: Springer Verlag. Lecture Notes in Computer Science, vol. 1398.
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., and Saul, L.K. (1999). An Introduction to Variational Methods for Graphical Models. Machine Learning, 37, 183–233.
Jordan, M.I. (1994). Hierarchical Mixture of Experts and the EM Algorithm. Neural Computation, 6(6), 181–214.
McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. In AAAI-98 Workshop on Learning for Text Categorization.
McLachlan, G.J. and Krishnan, T. (1997). The EM Algorithm and Extensions. New York: John Wiley and Sons.
Meilă, M. and Jordan, M.I. (2000). Learning with Mixtures of Trees. Journal of Machine Learning Research, 1, 1–48.
Paradimitriou, C.H. and Raghavan, P. (1998). Latent Semantic Indexing: A Probabilistic Analysis. In Proceedings of the 17th ACM Symposium on the Priciples of Database Systems, Seattle, 1998 (pp. 159–168).
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. Singapore: World Scientific Publishing. Computer Science, vol. 15.
Thiesson, B. (1997). Score and Information for Recursive Exponential Models with Incomplete Data. In D. Geiger, and P.P. Shenoy (Eds.), Proceedings of the 13th Conference on Uncertainty in Artificial Intelligence (UAI-97), San Francisco, August 1–3, 1997 (pp. 453–463). Los Alfas, CA: Morgan Kaufmann Publishers.
Vaithyanatham, S. and Dom, B. (2000). Generalized Model Selection for Unsupervised Learning in High Dimensions. In Proceedings of Advances in Neural Information Processing Systems (NIPS'99) (vol. 12). Cambridge, MA: MIT Press.
van Rijsbergen, C.J. (1979). Information Retrieval, 2nd edn. London: Butterworths.
Vapnik, V.N. (1995). The Nature of Statistical Learning Theory. Berlin: Springer-Verlag.
Vasconcelos, N. and Lippman, A. (1999). Learning Mixture Hierarchies. In Proceedings of Advances in Neural Information Processing Systems (NIPS'98) (vol. 11, pp. 606–612). Cambridge MA: MIT Press.
Vinokourov, A. (2001). SoftwareDemo—AProbabilistic Approach to the Unsupervised Organisation of Document Collections Using Multinomial ASymmetric Hierarchical Analysis (MASHA). http://cis.paisley.ac.uk/vinoci0/ masha demo/.
Vinokourov, A. and Girolami, M. (2000). A Probabilistic Hierarchical Clustering Method for Organizing Collections of Text Documents. In 15th International Conference on Pattern Recognition (ICPR'2000) (vol. 2, pp. 182–185). IEEE Computer Society.
Weigend, A.S., Wiener, E.D., and Pedersen, J.O. (1999). Exploiting Hierarchy in Text Categorisation. Information Retrieval, 1(3), 193–216.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Vinokourov, A., Girolami, M. A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections. Journal of Intelligent Information Systems 18, 153–172 (2002). https://doi.org/10.1023/A:1013677411002
Issue Date:
DOI: https://doi.org/10.1023/A:1013677411002