research-article

Combining coregularization and consensus-based self-training for multilingual text categorization

Authors:

Massih R. Amini,

Nicolas UsunierAuthors Info & Claims

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 475 - 482

https://doi.org/10.1145/1835449.1835529

Published: 19 July 2010 Publication History

Abstract

We investigate the problem of learning document classifiers in a multilingual setting, from collections where labels are only partially available. We address this problem in the framework of multiview learning, where different languages correspond to different views of the same document, combined with semi-supervised learning in order to benefit from unlabeled documents. We rely on two techniques, coregularization and consensus-based self-training, that combine multiview and semi-supervised learning in different ways. Our approach trains different monolingual classifiers on each of the views, such that the classifiers' decisions over a set of unlabeled examples are in agreement as much as possible, and iteratively labels new examples from another unlabeled training set based on a consensus across language-specific classifiers. We derive a boosting-based training algorithm for this task, and analyze the impact of the number of views on the semi-supervised learning results on a multilingual extension of the Reuters RCV1/RCV2 corpus using five different languages. Our experiments show that coregularization and consensus-based self-training are complementary and that their combination is especially effective in the interesting and very common situation where there are few views (languages) and few labeled documents available.

References

[1]

M.-R. Amini and C. Goutte. A Co-classification Approach to Learning from Multilingual Corpora. Machine Learning, 79(1-2):105--121, 2010.

Digital Library

[2]

M.-R. Amini, N. Usunier, and C. Goutte. Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization. In Advances in Neural Information Processing Systems 22 (NIPS 2009), pages 28--36, 2009.

[3]

F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. In Proc. 21st International Conference on Machine Learning (ICML 2004), 2004.

Digital Library

[4]

N. Bel, C. H. Koster, and M. Villegas. Cross-lingual Text Categorization. In ECDL-2003, pages 126--139, 2003.

[5]

D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.

[6]

A. Blum and T. M. Mitchell. Combining Labeled and Unlabeled Data with Co-Training. In Proc. 11th Annual Conference on Learning Theory (COLT 1998), pages 92--100, 1998.

Digital Library

[7]

U. Brefeld, T. Gartner, T. Scheffer, and S. Wrobel. Efficient Co-regularised Least Squares Regression. In Proc. 23rd International Conference on Machine Learning (ICML 2006), pages 137--144, 2006.

Digital Library

[8]

O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006.

[9]

M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, Adaboost and Bregman Distances. Machine Learning, 48(1-3):253--285, 2002.

Digital Library

[10]

J. D. Farquhar, D. R. Hardoon, H. Meng, J. Shawe-Taylor, and S. Szedmak. Two View Learning: SVM-2k, Theory and Practice. In Advances in Neural Information Processing 18 (NIPS 2005), pages 355--362, 2005.

[11]

T. Joachims. Transductive Inference for Text Classification using Support Vector Machines. In Proc. of the Sixteenth International Conference on Machine Learning (ICML 1999), pages 200--209, 1999.

Digital Library

[12]

T. Joachims. Training Linear SVMs in Linear Time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), pages 217--226, 2006.

Digital Library

[13]

J. D. Lafferty, S. D. Pietra, and V. D. Pietra. Statistical Learning Algorithms Based on Bregman Distances. In Canadian Workshop on Information Theory, 1997.

[14]

E. Lehmann. Nonparametric Statistical Methods Based on Ranks. McGraw-Hill, New York, 1975.

[15]

K. Nigam, A. McCallum, S. Thrun, and T. M. Mitchell. Learning to Classify Text from Labeled and Unlabeled Documents. In Proc. of the 15th National Conference on Artificial intelligence (AAAI/IAAI 1998, pages 792--799, 1998.

Digital Library

[16]

S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proc. 3rd Text Retrieval Conference (TREC), pages 109--126, 1994.

[17]

D. S. Rosenberg and P. L. Bartlett. The Rademacher Complexity of Co-regularized Kernel Classes. In Proc. of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS 2007), pages 396--403, 2007.

[18]

V. Sindhwani, P. Niyogi, and M. Belkin. A Co-regularization Approach to Semi-supervised Learning with Multiple Views. In Proceedings of the ICML-05 Workshop on Learning with Multiple Views, pages 74--79, 2005.

[19]

N. Ueffing, M. Simard, S. Larkin, and J. H. Johnson. NRC's PORTAGE system for WMT 2007. In ACL-2007 Second Workshop on SMT, 2007.

Digital Library

[20]

C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, 1979.

Digital Library

[21]

X. Zhu. Semi-supervised Learning Literature Survey. Technical report, University of Wisconsin Madison, 2008.

Cited By

Rehan MMalik MJamjoom M(2023)Fine-Tuning Transformer Models Using Transfer Learning for Multilingual Threatening Text IdentificationIEEE Access10.1109/ACCESS.2023.332006211(106503-106515)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3320062
Malik MNazarova AJamjoom MIgnatov D(2023)Multilingual hope speech detection: A Robust framework using transfer learning of fine-tuning RoBERTa modelJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10173635:8(101736)Online publication date: Sep-2023
https://doi.org/10.1016/j.jksuci.2023.101736
Tang TTang XYuan T(2020)Fine-Tuning BERT for Multi-Label Sentiment Analysis in Unbalanced Code-Switching TextIEEE Access10.1109/ACCESS.2020.30304688(193248-193256)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3030468
Show More Cited By

Index Terms

Combining coregularization and consensus-based self-training for multilingual text categorization
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information storage systems
    1. Record storage systems
      1. Record layout alternatives

Recommendations

Self-Training with Selection-by-Rejection
ICDM '12: Proceedings of the 2012 IEEE 12th International Conference on Data Mining

Practical machine learning and data mining problems often face shortage of labeled training data. Self-training algorithms are among the earliest attempts of using unlabeled data to enhance learning. Traditional self-training algorithms label unlabeled ...
Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use ...
Self-paced multi-label co-training
Abstract
Multi-label learning aims to solve classification problems where instances are associated with a set of labels. In reality, it is generally easy to acquire unlabeled data but expensive or time-consuming to label them, and this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

July 2010

944 pages

ISBN:9781450301534

DOI:10.1145/1835449

General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '10

Sponsor:

SIGIR

SIGIR '10: The 33rd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2010

Geneva, Switzerland

Acceptance Rates

SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
305
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rehan MMalik MJamjoom M(2023)Fine-Tuning Transformer Models Using Transfer Learning for Multilingual Threatening Text IdentificationIEEE Access10.1109/ACCESS.2023.332006211(106503-106515)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3320062
Malik MNazarova AJamjoom MIgnatov D(2023)Multilingual hope speech detection: A Robust framework using transfer learning of fine-tuning RoBERTa modelJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10173635:8(101736)Online publication date: Sep-2023
https://doi.org/10.1016/j.jksuci.2023.101736
Tang TTang XYuan T(2020)Fine-Tuning BERT for Multi-Label Sentiment Analysis in Unbalanced Code-Switching TextIEEE Access10.1109/ACCESS.2020.30304688(193248-193256)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3030468
Protasiewicz JMirończuk MDadas S(2017)Categorization of Multilingual Scientific Documents by a Compound Classification SystemArtificial Intelligence and Soft Computing10.1007/978-3-319-59060-8_51(563-573)Online publication date: 24-May-2017
https://doi.org/10.1007/978-3-319-59060-8_51
Fakeri-Tabrizi AAmini MGoutte CUsunier N(2015)Multiview self-learningNeurocomputing10.1016/j.neucom.2014.12.041155:C(117-127)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1016/j.neucom.2014.12.041
Kovesi MGoutte CAmini MHersh WCallan JMaarek YSanderson M(2012)Fast on-line learning for multilingual categorizationProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348474(1071-1072)Online publication date: 12-Aug-2012
https://dl.acm.org/doi/10.1145/2348283.2348474
Lu BTan CCardie CTsou BLin D(2011)Joint bilingual sentiment classification with unlabeled parallel corporaProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 110.5555/2002472.2002514(320-330)Online publication date: 19-Jun-2011
https://dl.acm.org/doi/10.5555/2002472.2002514
Kiseleva JAgichtein EBillsus D(2011)Mining query structure from click dataProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063930(2217-2220)Online publication date: 24-Oct-2011
https://dl.acm.org/doi/10.1145/2063576.2063930

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents