skip to main content
10.1145/1835449.1835529acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Combining coregularization and consensus-based self-training for multilingual text categorization

Published: 19 July 2010 Publication History

Abstract

We investigate the problem of learning document classifiers in a multilingual setting, from collections where labels are only partially available. We address this problem in the framework of multiview learning, where different languages correspond to different views of the same document, combined with semi-supervised learning in order to benefit from unlabeled documents. We rely on two techniques, coregularization and consensus-based self-training, that combine multiview and semi-supervised learning in different ways. Our approach trains different monolingual classifiers on each of the views, such that the classifiers' decisions over a set of unlabeled examples are in agreement as much as possible, and iteratively labels new examples from another unlabeled training set based on a consensus across language-specific classifiers. We derive a boosting-based training algorithm for this task, and analyze the impact of the number of views on the semi-supervised learning results on a multilingual extension of the Reuters RCV1/RCV2 corpus using five different languages. Our experiments show that coregularization and consensus-based self-training are complementary and that their combination is especially effective in the interesting and very common situation where there are few views (languages) and few labeled documents available.

References

[1]
M.-R. Amini and C. Goutte. A Co-classification Approach to Learning from Multilingual Corpora. Machine Learning, 79(1-2):105--121, 2010.
[2]
M.-R. Amini, N. Usunier, and C. Goutte. Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization. In Advances in Neural Information Processing Systems 22 (NIPS 2009), pages 28--36, 2009.
[3]
F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. In Proc. 21st International Conference on Machine Learning (ICML 2004), 2004.
[4]
N. Bel, C. H. Koster, and M. Villegas. Cross-lingual Text Categorization. In ECDL-2003, pages 126--139, 2003.
[5]
D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
[6]
A. Blum and T. M. Mitchell. Combining Labeled and Unlabeled Data with Co-Training. In Proc. 11th Annual Conference on Learning Theory (COLT 1998), pages 92--100, 1998.
[7]
U. Brefeld, T. Gartner, T. Scheffer, and S. Wrobel. Efficient Co-regularised Least Squares Regression. In Proc. 23rd International Conference on Machine Learning (ICML 2006), pages 137--144, 2006.
[8]
O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006.
[9]
M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, Adaboost and Bregman Distances. Machine Learning, 48(1-3):253--285, 2002.
[10]
J. D. Farquhar, D. R. Hardoon, H. Meng, J. Shawe-Taylor, and S. Szedmak. Two View Learning: SVM-2k, Theory and Practice. In Advances in Neural Information Processing 18 (NIPS 2005), pages 355--362, 2005.
[11]
T. Joachims. Transductive Inference for Text Classification using Support Vector Machines. In Proc. of the Sixteenth International Conference on Machine Learning (ICML 1999), pages 200--209, 1999.
[12]
T. Joachims. Training Linear SVMs in Linear Time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), pages 217--226, 2006.
[13]
J. D. Lafferty, S. D. Pietra, and V. D. Pietra. Statistical Learning Algorithms Based on Bregman Distances. In Canadian Workshop on Information Theory, 1997.
[14]
E. Lehmann. Nonparametric Statistical Methods Based on Ranks. McGraw-Hill, New York, 1975.
[15]
K. Nigam, A. McCallum, S. Thrun, and T. M. Mitchell. Learning to Classify Text from Labeled and Unlabeled Documents. In Proc. of the 15th National Conference on Artificial intelligence (AAAI/IAAI 1998, pages 792--799, 1998.
[16]
S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proc. 3rd Text Retrieval Conference (TREC), pages 109--126, 1994.
[17]
D. S. Rosenberg and P. L. Bartlett. The Rademacher Complexity of Co-regularized Kernel Classes. In Proc. of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS 2007), pages 396--403, 2007.
[18]
V. Sindhwani, P. Niyogi, and M. Belkin. A Co-regularization Approach to Semi-supervised Learning with Multiple Views. In Proceedings of the ICML-05 Workshop on Learning with Multiple Views, pages 74--79, 2005.
[19]
N. Ueffing, M. Simard, S. Larkin, and J. H. Johnson. NRC's PORTAGE system for WMT 2007. In ACL-2007 Second Workshop on SMT, 2007.
[20]
C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, 1979.
[21]
X. Zhu. Semi-supervised Learning Literature Survey. Technical report, University of Wisconsin Madison, 2008.

Cited By

View all
  • (2023)Fine-Tuning Transformer Models Using Transfer Learning for Multilingual Threatening Text IdentificationIEEE Access10.1109/ACCESS.2023.332006211(106503-106515)Online publication date: 2023
  • (2023)Multilingual hope speech detection: A Robust framework using transfer learning of fine-tuning RoBERTa modelJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10173635:8(101736)Online publication date: Sep-2023
  • (2020)Fine-Tuning BERT for Multi-Label Sentiment Analysis in Unbalanced Code-Switching TextIEEE Access10.1109/ACCESS.2020.30304688(193248-193256)Online publication date: 2020
  • Show More Cited By

Index Terms

  1. Combining coregularization and consensus-based self-training for multilingual text categorization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
      July 2010
      944 pages
      ISBN:9781450301534
      DOI:10.1145/1835449
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 July 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. learning from multiple views
      2. multilingual document classification
      3. semi-supervised learning

      Qualifiers

      • Research-article

      Conference

      SIGIR '10
      Sponsor:

      Acceptance Rates

      SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;
      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 20 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Fine-Tuning Transformer Models Using Transfer Learning for Multilingual Threatening Text IdentificationIEEE Access10.1109/ACCESS.2023.332006211(106503-106515)Online publication date: 2023
      • (2023)Multilingual hope speech detection: A Robust framework using transfer learning of fine-tuning RoBERTa modelJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10173635:8(101736)Online publication date: Sep-2023
      • (2020)Fine-Tuning BERT for Multi-Label Sentiment Analysis in Unbalanced Code-Switching TextIEEE Access10.1109/ACCESS.2020.30304688(193248-193256)Online publication date: 2020
      • (2017)Categorization of Multilingual Scientific Documents by a Compound Classification SystemArtificial Intelligence and Soft Computing10.1007/978-3-319-59060-8_51(563-573)Online publication date: 24-May-2017
      • (2015)Multiview self-learningNeurocomputing10.1016/j.neucom.2014.12.041155:C(117-127)Online publication date: 1-May-2015
      • (2012)Fast on-line learning for multilingual categorizationProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348474(1071-1072)Online publication date: 12-Aug-2012
      • (2011)Joint bilingual sentiment classification with unlabeled parallel corporaProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 110.5555/2002472.2002514(320-330)Online publication date: 19-Jun-2011
      • (2011)Mining query structure from click dataProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063930(2217-2220)Online publication date: 24-Oct-2011

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media