Hierarchical document classification using automatically generated hierarchy

Li, Tao; Zhu, Shenghuo; Ogihara, Mitsunori

doi:10.1007/s10844-006-0019-7

Hierarchical document classification using automatically generated hierarchy

Published: 01 February 2007

Volume 29, pages 211–230, (2007)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Tao Li¹,
Shenghuo Zhu² &
Mitsunori Ogihara³

319 Accesses
32 Citations
Explore all metrics

Abstract

Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Anick, P., & Tipirneni, S. (1999). The paraphrase search assistant: terminological feedback for iterative information seeking. In Proceedings of the 22nd Annual ACM International Conference on Research and Development in Information Retrieval (SIGIR 1999) (pp. 153–159). New York: ACM.
Chapter Google Scholar
Beyer, K. S., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbor meaningful? In Proceedings of 7th International Conference on Database Theory(ICDT’99) (pp. 217–235). Berlin Heidelberg New York: Springer.
Google Scholar
Chakrabarti, S., Dom, B. E., Agrawal, R., & Raghavan, P. (1997). Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB’97) (pp. 446–455). San Mateo, California: Morgan Kaufmann.
Google Scholar
Chakrabarti, S., Roy, S., & Soundalgekar, M. V. (2002). Fast and accurate text classification via multiple linear discriminant projections. In Proceedings of the 28th International Conference on Very Large Data Bases(VLDB’02) (pp. 658–669). San Mateo, California: Morgan Kaufmann.
Google Scholar
Chen, H., & Dumais, S. T. (2000). Bringing order to the web: Automatically categorizing search results. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI’00) (pp. 145–152). New York: ACM.
Chapter Google Scholar
D’Alessio, S., Murray, K., Schiaffino, R., & Kershenbaum, A. (2000). The effect of using hierarchical classifiers in text categorization. In Proceeding of RIAO-00, 6th International Conference “Recherche d’Information Assistee par Ordinateur,” Paris (pp. 302–313).
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391–407.
Article Google Scholar
Dumais, S. T., & Chen, H. (2000). Hierarchical classification of web content. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR’00) (pp. 256–263). New York: ACM.
Chapter Google Scholar
Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.
Google Scholar
Frankl, P., & Maehara, H. (1988). The johnson-lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory, 44, 355–362.
Article MATH MathSciNet Google Scholar
Frommholz, I. (2001). Categorizing web documents in hierarchical catalogues. In Proceedings of the 23rd European Colloquium on Information Retrieval Research (ECIR-01), Darmstadt, Delaware.
Fukunaga, K. (1990). In Introduction to Statistical Pattern Recognition. 2nd edn. New York: Academic.
MATH Google Scholar
Godbole, S., Sarawagi, S., & Chakrabarti, S. (2002). Scaling multi-class support vector machine using inter-class confusion. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 513–518). New York: ACM.
Chapter Google Scholar
Hearst, M. A., & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: scatter/gather on retrieval results. In Proceedings of the 19th annual ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996) (pp. 76–84). New York: ACM.
Chapter Google Scholar
Hofmann, T., Cai, L., & Ciaramita, M. (2003). Learning with taxonomies: classifying documents and words. In Proceedings of Workshop on Syntax, Semantics and Statistics, (NIPS 2003), British Columbia, Canada.
Howland, P., & Park, H. (2003). Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 995–1006.
Article Google Scholar
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall: New Jersey.
MATH Google Scholar
Kleinberg, J. M. (1997). Two algorithms for nearest-neighbor search in high dimensions. In ACM Symposium on Theory of Computing, El Paso, Texas (pp. 599–608). New York: ACM.
Google Scholar
Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 170–178). San Francisco, California: Morgan Kaufmann.
Google Scholar
Lawrie, D., Croft, W. B., & Rosenberg, A. (2001). Finding topic words for hierarchical summarization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’01) New Orleans, Louisiana (pp. 349–357). New York, USA: ACM.
Chapter Google Scholar
Lawrie, D. J., & Croft, W. B. (2000). Discovering and comparing topic hierarchies. In Proceedings of RIAO 2000, Paris (pp. 314–330).
Li, T., Zhu, S., & Ogihara, M. (2003a). Efficient multi-way text categorization via generalized discriminant analysis. In Proceedings of ACM CIKM, New Orleans, Louisiana (pp. 317–324).
Li, T., Zhu, S., & Ogihara, M. (2003b). Using discriminant analysis for multi-class classification. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), Melbourne, Florida (pp. 589–592).
McCallum, A. K., Rosenfeld, R., Mitchell, T. M., & Ng, A. Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the 15th International Conference on Machine Learning (ICML’98) (pp. 359–367). San Francisco, USA: Morgan Kaufmann.
Google Scholar
Nevill-Manning, C. G., Witten, I. H., & Paynter, G. W. (1999). Lexically-generated subject hierarchies for browsing large collections. International Journal on Digital Libraries, 2, 111–123.
Article Google Scholar
Ruiz, M. E., & Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5, 87–118.
Article MATH Google Scholar
Sanderson, M., & Croft, W. B. (1999). Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’99) (pp. 206–213). New York: ACM.
Chapter Google Scholar
Sasaki, M., & Kita, K. (1998). Rule-based text categorization using hierarchical categories. In Proceedings of IEEE SMC, La Jolla, USA (pp. 2827–2830).
Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39, 135–168.
Article MATH Google Scholar
Sun, A., & Lim, E.-P. (2001). Hierarchical text classification and evaluation. In Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM’01) (pp. 521–528). Los Alamitos, California: IEEE Computer Society.
Google Scholar
Toutanova, K., Chen, F., Popat, K., & Hofmann, T. (2001). Text classification in a hierarchical mixture model for small training sets. In Proceedings of 10th ACM International Conference on Information and Knowledge Management (CIKM’01) (pp. 105–113). New York: ACM.
Google Scholar
Vilalta, R., & Rish, I. (2003). A decomposition of classes via clustering to explain and improve naive Bayes. In Proceedings of 14th European Conference on Machine Learning (ECML 2003) (pp. 444–455). Berlin Heidelberg New York: Springer.
Google Scholar
Wang, K., Zhou, S., & Liew, S. C. (1999). Building hierarchical classifiers using class proximity. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB’99) (pp. 363–374). San Francisco, USA: Morgan Kaufmann.
Google Scholar
Weigend, A. S., Wiener, E. D., & Pedersen, J. O. (1999). Exploiting hierarchy in text categorization. Information Retrieval, 1, 193–216.
Article Google Scholar
Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd Annual International Conference on Research and Development in Information Retrieval (SIGIR’99) (pp. 42–49). New York: ACM.
Chapter Google Scholar
Yu, H., Yang, J., & Han, J. (2003). Classifying large data sets using SVMs with hierarchical clusters. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2003) (pp. 306–315). New York: ACM.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Florida International University, Miami, FL, 33199, USA
Tao Li
NEC Labs America, Inc., Cupertino, CA, 95014, USA
Shenghuo Zhu
Department of Computer Science, University of Rochester, Rochester, NY, 14627-0226, USA
Mitsunori Ogihara

Authors

Tao Li
View author publications
You can also search for this author in PubMed Google Scholar
Shenghuo Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Mitsunori Ogihara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, T., Zhu, S. & Ogihara, M. Hierarchical document classification using automatically generated hierarchy. J Intell Inf Syst 29, 211–230 (2007). https://doi.org/10.1007/s10844-006-0019-7

Download citation

Received: 28 March 2005
Revised: 12 August 2005
Accepted: 06 December 2005
Published: 01 February 2007
Issue Date: October 2007
DOI: https://doi.org/10.1007/s10844-006-0019-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical document classification using automatically generated hierarchy

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Data clustering: application and trends

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchical document classification using automatically generated hierarchy

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Data clustering: application and trends

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation