research-article

Refined experts: improving classification in large taxonomies

Authors:

Paul N. Bennett,

Nam NguyenAuthors Info & Claims

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 11 - 18

https://doi.org/10.1145/1571941.1571946

Published: 19 July 2009 Publication History

Abstract

While large-scale taxonomies--especially for web pages--have been in existence for some time, approaches to automatically classify documents into these taxonomies have met with limited success compared to the more general progress made in text classification. We argue that this stems from three causes: increasing sparsity of training data at deeper nodes in the taxonomy, error propagation where a mistake made high in the hierarchy cannot be recovered, and increasingly complex decision surfaces in higher nodes in the hierarchy. While prior research has focused on the first problem, we introduce methods that target the latter two problems--first by biasing the training distribution to reduce error propagation and second by propagating up "first-guess" expert information in a bottom-up manner before making a refined top down choice. Finally, we present an empirical study demonstrating that the suggested changes lead to 10--30% improvements in F1 scores versus an accepted competitive baseline, hierarchical SVMs.

References

[1]

P. N. Bennett, S. T. Dumais, and E. Horvitz. The combination of text classifiers using reliability indicators. Information Retrieval, 8(1):67--100, 2004.

Digital Library

[2]

C. M. Bishop and M. Svensén. Bayesian hierarchical mixtures of experts. In UAI '03, 2003.

Digital Library

[3]

L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In CIKM '04, 2004.

Digital Library

[4]

N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Hierarchical classification: combining bayes with svm. In ICML '06, 2006.

Digital Library

[5]

N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Incremental algorithms for hierarchical classification. Journal of Machine Learning Research, 7:31--54, 2006.

Digital Library

[6]

O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In ICML '04, 2004.

Digital Library

[7]

S. Dumais, E. Cutrell, and H. Chen. Optimizing search by showing results in context. In CHI '01, 2001.

Digital Library

[8]

S. T. Dumais and H. Chen. Hierarchical classification of Web content. In SIGIR '00, 2000.

Digital Library

[9]

T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In ECML '98, 1998.

Digital Library

[10]

M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6:181--214, 1994.

Digital Library

[11]

A. R. Klivans and A. A. Sherstov. Improved lower bounds for learning intersections of halfspaces. In COLT '06, 2006.

Digital Library

[12]

D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In ICML '97, 1997.

Digital Library

[13]

D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004.

Digital Library

[14]

W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML '06, 2006.

Digital Library

[15]

T. Liu, Y. Yang, H. Wan, H. Zeng, Z. Chen, and W. Ma. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explorations, 7(1):36--43, 2005.

Digital Library

[16]

A. McCallum, R. Rosenfeld, T. Mitchell, and A. Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In ICML '98, 1998.

Digital Library

[17]

D. M. Mimno, W. Li, and A. McCallum. Mixtures of hierarchical topics with pachinko allocation. In ICML '07, 2007.

Digital Library

[18]

Netscape Communication Corporation. Open directory project. http://www.dmoz.org.

[19]

J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, 1999.

[20]

M. E. Ruiz and P. Srinivasan. Hierarchical neural networks for text categorization. In SIGIR '99, 1999.

Digital Library

[21]

S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-GrAdient solver for svm. In ICML '07, 2007.

Digital Library

[22]

A. Sun and E. Lim. Hierarchical text classification and evaluation. In ICDM '01, 2001.

Digital Library

[23]

C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.

Digital Library

[24]

G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In SIGIR '08, 2008.

Digital Library

[25]

Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR '99, 1999.

Digital Library

[26]

B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and W.-Y. Ma. Improving web search results using affinity graph. In SIGIR '05, 2005.

Digital Library

Cited By

Senoussi MArtieres TVilloutreix P(2024)Partial label learning for automated classification of single-cell transcriptomic profilesPLOS Computational Biology10.1371/journal.pcbi.101200620:4(e1012006)Online publication date: 5-Apr-2024
https://doi.org/10.1371/journal.pcbi.1012006
Diehl AAsh J(2024)Sparse Feature-Persistent Hierarchical ClassificationNAECON 2024 - IEEE National Aerospace and Electronics Conference10.1109/NAECON61878.2024.10670617(147-152)Online publication date: 15-Jul-2024
https://doi.org/10.1109/NAECON61878.2024.10670617
Wang XGuo L(2023)Multi-Label Classification of Chinese Rural Poverty Governance Texts Based on XLNet and Bi-LSTM Fused Hierarchical Attention MechanismApplied Sciences10.3390/app1313737713:13(7377)Online publication date: 21-Jun-2023
https://doi.org/10.3390/app13137377
Show More Cited By

Index Terms

Refined experts: improving classification in large taxonomies
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information systems applications

Recommendations

Improving Text Classification Accuracy by Training Label Cleaning

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting ...
Text Classification from Labeled and Unlabeled Documents using EM
Special issue on information retrieval

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

July 2009

896 pages

ISBN:9781605584836

DOI:10.1145/1571941

General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '09

Sponsor:

SIGIR '09: The 32nd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2009

MA, Boston, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

85
Total Citations
View Citations
1,118
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)3

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Senoussi MArtieres TVilloutreix P(2024)Partial label learning for automated classification of single-cell transcriptomic profilesPLOS Computational Biology10.1371/journal.pcbi.101200620:4(e1012006)Online publication date: 5-Apr-2024
https://doi.org/10.1371/journal.pcbi.1012006
Diehl AAsh J(2024)Sparse Feature-Persistent Hierarchical ClassificationNAECON 2024 - IEEE National Aerospace and Electronics Conference10.1109/NAECON61878.2024.10670617(147-152)Online publication date: 15-Jul-2024
https://doi.org/10.1109/NAECON61878.2024.10670617
Wang XGuo L(2023)Multi-Label Classification of Chinese Rural Poverty Governance Texts Based on XLNet and Bi-LSTM Fused Hierarchical Attention MechanismApplied Sciences10.3390/app1313737713:13(7377)Online publication date: 21-Jun-2023
https://doi.org/10.3390/app13137377
Huang WChen ELiu QXiong HHuang ZTong SZhang D(2023)HmcNet: A General Approach for Hierarchical Multi-Label ClassificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.320751135:9(8713-8728)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TKDE.2022.3207511
Zheng JWang YPei SHu Q(2023)Exploring and exploiting hierarchical structures for large-scale classificationInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-02039-615:6(2427-2437)Online publication date: 22-Dec-2023
https://doi.org/10.1007/s13042-023-02039-6
Chen JQian Y(2022)Hierarchical Multilabel Ship Classification in Remote Sensing Images Using Label Relation GraphsIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2021.311111760(1-13)Online publication date: 2022
https://doi.org/10.1109/TGRS.2021.3111117
Gao YSalekin AGordon KRose KWang HStankovic J(2021)Emotion Recognition Robust to Indoor Environmental Distortions and Non-targeted Emotions Using Out-of-distribution DetectionACM Transactions on Computing for Healthcare10.1145/34923003:2(1-22)Online publication date: 20-Dec-2021
https://dl.acm.org/doi/10.1145/3492300
Zhang YChen XMeng YHan JLewin-Eytan LCarmel DYom-Tov EAgichtein EGabrilovich E(2021)Hierarchical Metadata-Aware Document Categorization under Weak SupervisionProceedings of the 14th ACM International Conference on Web Search and Data Mining10.1145/3437963.3441730(770-778)Online publication date: 8-Mar-2021
https://dl.acm.org/doi/10.1145/3437963.3441730
Lei SHuang WTong SLiu QHuang ZChen ESu Y(2021)Consistency-aware Multi-modal Network for Hierarchical Multi-label Classification in Online Education System2021 IEEE International Conference on Big Knowledge (ICBK)10.1109/ICKG52313.2021.00063(1-8)Online publication date: Dec-2021
https://doi.org/10.1109/ICKG52313.2021.00063
Pereira RCosta YSilla C(2021)Handling imbalance in hierarchical classification problems using local classifiers approachesData Mining and Knowledge Discovery10.1007/s10618-021-00762-835:4(1564-1621)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.1007/s10618-021-00762-8
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten