Skip to main content
Log in

Hierarchical document classification using automatically generated hierarchy

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Anick, P., & Tipirneni, S. (1999). The paraphrase search assistant: terminological feedback for iterative information seeking. In Proceedings of the 22nd Annual ACM International Conference on Research and Development in Information Retrieval (SIGIR 1999) (pp. 153–159). New York: ACM.

    Chapter  Google Scholar 

  • Beyer, K. S., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbor meaningful? In Proceedings of 7th International Conference on Database Theory(ICDT’99) (pp. 217–235). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Chakrabarti, S., Dom, B. E., Agrawal, R., & Raghavan, P. (1997). Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB’97) (pp. 446–455). San Mateo, California: Morgan Kaufmann.

    Google Scholar 

  • Chakrabarti, S., Roy, S., & Soundalgekar, M. V. (2002). Fast and accurate text classification via multiple linear discriminant projections. In Proceedings of the 28th International Conference on Very Large Data Bases(VLDB’02) (pp. 658–669). San Mateo, California: Morgan Kaufmann.

    Google Scholar 

  • Chen, H., & Dumais, S. T. (2000). Bringing order to the web: Automatically categorizing search results. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI’00) (pp. 145–152). New York: ACM.

    Chapter  Google Scholar 

  • D’Alessio, S., Murray, K., Schiaffino, R., & Kershenbaum, A. (2000). The effect of using hierarchical classifiers in text categorization. In Proceeding of RIAO-00, 6th International Conference “Recherche d’Information Assistee par Ordinateur,” Paris (pp. 302–313).

  • Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391–407.

    Article  Google Scholar 

  • Dumais, S. T., & Chen, H. (2000). Hierarchical classification of web content. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR’00) (pp. 256–263). New York: ACM.

    Chapter  Google Scholar 

  • Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.

    Google Scholar 

  • Frankl, P., & Maehara, H. (1988). The johnson-lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory, 44, 355–362.

    Article  MATH  MathSciNet  Google Scholar 

  • Frommholz, I. (2001). Categorizing web documents in hierarchical catalogues. In Proceedings of the 23rd European Colloquium on Information Retrieval Research (ECIR-01), Darmstadt, Delaware.

  • Fukunaga, K. (1990). In Introduction to Statistical Pattern Recognition. 2nd edn. New York: Academic.

    MATH  Google Scholar 

  • Godbole, S., Sarawagi, S., & Chakrabarti, S. (2002). Scaling multi-class support vector machine using inter-class confusion. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 513–518). New York: ACM.

    Chapter  Google Scholar 

  • Hearst, M. A., & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: scatter/gather on retrieval results. In Proceedings of the 19th annual ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996) (pp. 76–84). New York: ACM.

    Chapter  Google Scholar 

  • Hofmann, T., Cai, L., & Ciaramita, M. (2003). Learning with taxonomies: classifying documents and words. In Proceedings of Workshop on Syntax, Semantics and Statistics, (NIPS 2003), British Columbia, Canada.

  • Howland, P., & Park, H. (2003). Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 995–1006.

    Article  Google Scholar 

  • Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall: New Jersey.

    MATH  Google Scholar 

  • Kleinberg, J. M. (1997). Two algorithms for nearest-neighbor search in high dimensions. In ACM Symposium on Theory of Computing, El Paso, Texas (pp. 599–608). New York: ACM.

    Google Scholar 

  • Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 170–178). San Francisco, California: Morgan Kaufmann.

    Google Scholar 

  • Lawrie, D., Croft, W. B., & Rosenberg, A. (2001). Finding topic words for hierarchical summarization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’01) New Orleans, Louisiana (pp. 349–357). New York, USA: ACM.

    Chapter  Google Scholar 

  • Lawrie, D. J., & Croft, W. B. (2000). Discovering and comparing topic hierarchies. In Proceedings of RIAO 2000, Paris (pp. 314–330).

  • Li, T., Zhu, S., & Ogihara, M. (2003a). Efficient multi-way text categorization via generalized discriminant analysis. In Proceedings of ACM CIKM, New Orleans, Louisiana (pp. 317–324).

  • Li, T., Zhu, S., & Ogihara, M. (2003b). Using discriminant analysis for multi-class classification. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), Melbourne, Florida (pp. 589–592).

  • McCallum, A. K., Rosenfeld, R., Mitchell, T. M., & Ng, A. Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the 15th International Conference on Machine Learning (ICML’98) (pp. 359–367). San Francisco, USA: Morgan Kaufmann.

    Google Scholar 

  • Nevill-Manning, C. G., Witten, I. H., & Paynter, G. W. (1999). Lexically-generated subject hierarchies for browsing large collections. International Journal on Digital Libraries, 2, 111–123.

    Article  Google Scholar 

  • Ruiz, M. E., & Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5, 87–118.

    Article  MATH  Google Scholar 

  • Sanderson, M., & Croft, W. B. (1999). Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’99) (pp. 206–213). New York: ACM.

    Chapter  Google Scholar 

  • Sasaki, M., & Kita, K. (1998). Rule-based text categorization using hierarchical categories. In Proceedings of IEEE SMC, La Jolla, USA (pp. 2827–2830).

  • Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39, 135–168.

    Article  MATH  Google Scholar 

  • Sun, A., & Lim, E.-P. (2001). Hierarchical text classification and evaluation. In Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM’01) (pp. 521–528). Los Alamitos, California: IEEE Computer Society.

    Google Scholar 

  • Toutanova, K., Chen, F., Popat, K., & Hofmann, T. (2001). Text classification in a hierarchical mixture model for small training sets. In Proceedings of 10th ACM International Conference on Information and Knowledge Management (CIKM’01) (pp. 105–113). New York: ACM.

    Google Scholar 

  • Vilalta, R., & Rish, I. (2003). A decomposition of classes via clustering to explain and improve naive Bayes. In Proceedings of 14th European Conference on Machine Learning (ECML 2003) (pp. 444–455). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Wang, K., Zhou, S., & Liew, S. C. (1999). Building hierarchical classifiers using class proximity. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB’99) (pp. 363–374). San Francisco, USA: Morgan Kaufmann.

    Google Scholar 

  • Weigend, A. S., Wiener, E. D., & Pedersen, J. O. (1999). Exploiting hierarchy in text categorization. Information Retrieval, 1, 193–216.

    Article  Google Scholar 

  • Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd Annual International Conference on Research and Development in Information Retrieval (SIGIR’99) (pp. 42–49). New York: ACM.

    Chapter  Google Scholar 

  • Yu, H., Yang, J., & Han, J. (2003). Classifying large data sets using SVMs with hierarchical clusters. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2003) (pp. 306–315). New York: ACM.

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, T., Zhu, S. & Ogihara, M. Hierarchical document classification using automatically generated hierarchy. J Intell Inf Syst 29, 211–230 (2007). https://doi.org/10.1007/s10844-006-0019-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-006-0019-7

Keywords

Navigation