Abstract
Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.
Similar content being viewed by others
References
Anick, P., & Tipirneni, S. (1999). The paraphrase search assistant: terminological feedback for iterative information seeking. In Proceedings of the 22nd Annual ACM International Conference on Research and Development in Information Retrieval (SIGIR 1999) (pp. 153–159). New York: ACM.
Beyer, K. S., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbor meaningful? In Proceedings of 7th International Conference on Database Theory(ICDT’99) (pp. 217–235). Berlin Heidelberg New York: Springer.
Chakrabarti, S., Dom, B. E., Agrawal, R., & Raghavan, P. (1997). Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB’97) (pp. 446–455). San Mateo, California: Morgan Kaufmann.
Chakrabarti, S., Roy, S., & Soundalgekar, M. V. (2002). Fast and accurate text classification via multiple linear discriminant projections. In Proceedings of the 28th International Conference on Very Large Data Bases(VLDB’02) (pp. 658–669). San Mateo, California: Morgan Kaufmann.
Chen, H., & Dumais, S. T. (2000). Bringing order to the web: Automatically categorizing search results. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI’00) (pp. 145–152). New York: ACM.
D’Alessio, S., Murray, K., Schiaffino, R., & Kershenbaum, A. (2000). The effect of using hierarchical classifiers in text categorization. In Proceeding of RIAO-00, 6th International Conference “Recherche d’Information Assistee par Ordinateur,” Paris (pp. 302–313).
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391–407.
Dumais, S. T., & Chen, H. (2000). Hierarchical classification of web content. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR’00) (pp. 256–263). New York: ACM.
Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.
Frankl, P., & Maehara, H. (1988). The johnson-lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory, 44, 355–362.
Frommholz, I. (2001). Categorizing web documents in hierarchical catalogues. In Proceedings of the 23rd European Colloquium on Information Retrieval Research (ECIR-01), Darmstadt, Delaware.
Fukunaga, K. (1990). In Introduction to Statistical Pattern Recognition. 2nd edn. New York: Academic.
Godbole, S., Sarawagi, S., & Chakrabarti, S. (2002). Scaling multi-class support vector machine using inter-class confusion. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 513–518). New York: ACM.
Hearst, M. A., & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: scatter/gather on retrieval results. In Proceedings of the 19th annual ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996) (pp. 76–84). New York: ACM.
Hofmann, T., Cai, L., & Ciaramita, M. (2003). Learning with taxonomies: classifying documents and words. In Proceedings of Workshop on Syntax, Semantics and Statistics, (NIPS 2003), British Columbia, Canada.
Howland, P., & Park, H. (2003). Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 995–1006.
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall: New Jersey.
Kleinberg, J. M. (1997). Two algorithms for nearest-neighbor search in high dimensions. In ACM Symposium on Theory of Computing, El Paso, Texas (pp. 599–608). New York: ACM.
Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 170–178). San Francisco, California: Morgan Kaufmann.
Lawrie, D., Croft, W. B., & Rosenberg, A. (2001). Finding topic words for hierarchical summarization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’01) New Orleans, Louisiana (pp. 349–357). New York, USA: ACM.
Lawrie, D. J., & Croft, W. B. (2000). Discovering and comparing topic hierarchies. In Proceedings of RIAO 2000, Paris (pp. 314–330).
Li, T., Zhu, S., & Ogihara, M. (2003a). Efficient multi-way text categorization via generalized discriminant analysis. In Proceedings of ACM CIKM, New Orleans, Louisiana (pp. 317–324).
Li, T., Zhu, S., & Ogihara, M. (2003b). Using discriminant analysis for multi-class classification. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), Melbourne, Florida (pp. 589–592).
McCallum, A. K., Rosenfeld, R., Mitchell, T. M., & Ng, A. Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the 15th International Conference on Machine Learning (ICML’98) (pp. 359–367). San Francisco, USA: Morgan Kaufmann.
Nevill-Manning, C. G., Witten, I. H., & Paynter, G. W. (1999). Lexically-generated subject hierarchies for browsing large collections. International Journal on Digital Libraries, 2, 111–123.
Ruiz, M. E., & Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5, 87–118.
Sanderson, M., & Croft, W. B. (1999). Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’99) (pp. 206–213). New York: ACM.
Sasaki, M., & Kita, K. (1998). Rule-based text categorization using hierarchical categories. In Proceedings of IEEE SMC, La Jolla, USA (pp. 2827–2830).
Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39, 135–168.
Sun, A., & Lim, E.-P. (2001). Hierarchical text classification and evaluation. In Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM’01) (pp. 521–528). Los Alamitos, California: IEEE Computer Society.
Toutanova, K., Chen, F., Popat, K., & Hofmann, T. (2001). Text classification in a hierarchical mixture model for small training sets. In Proceedings of 10th ACM International Conference on Information and Knowledge Management (CIKM’01) (pp. 105–113). New York: ACM.
Vilalta, R., & Rish, I. (2003). A decomposition of classes via clustering to explain and improve naive Bayes. In Proceedings of 14th European Conference on Machine Learning (ECML 2003) (pp. 444–455). Berlin Heidelberg New York: Springer.
Wang, K., Zhou, S., & Liew, S. C. (1999). Building hierarchical classifiers using class proximity. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB’99) (pp. 363–374). San Francisco, USA: Morgan Kaufmann.
Weigend, A. S., Wiener, E. D., & Pedersen, J. O. (1999). Exploiting hierarchy in text categorization. Information Retrieval, 1, 193–216.
Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd Annual International Conference on Research and Development in Information Retrieval (SIGIR’99) (pp. 42–49). New York: ACM.
Yu, H., Yang, J., & Han, J. (2003). Classifying large data sets using SVMs with hierarchical clusters. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2003) (pp. 306–315). New York: ACM.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, T., Zhu, S. & Ogihara, M. Hierarchical document classification using automatically generated hierarchy. J Intell Inf Syst 29, 211–230 (2007). https://doi.org/10.1007/s10844-006-0019-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-006-0019-7