BERT Self-Learning Approach with Limited Labels for Document Classification

Joaquim, Carlos Eduardo de Lima; Faleiros, Thiago de Paulo

doi:10.1007/978-3-031-24866-5_21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13621))

Included in the following conference series:

International Conference on Learning and Intelligent Optimization

526 Accesses

Abstract

The remarkable production speed of documents and, consequently, the volume of unstructured data stored in the Brazilian Government facilities requires processes that enable the capacity of classifying documents. This requirement is compliant with the existing archival legislation. In this sense, Natural Language Processing (NLP) stands as an important asset related to document classification, considering the reality of current document production, where there is a considerable number of unlabeled documentary samples. The Self-Learning approach applied to the BERT fine-tuning step delivers a model capable of classifying a partially labeled set of data according to the Requirements Model for Computerized Document Management Systems (e-ARQ Brazil). The developed model was capable of reaching a human-level performance, outperforming Active Learning and BERT in a series of defined confidence levels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Nacional, A.: Gestão de documentos: curso de capacitação para os integrantes do sistema de gestão de documentos de arquivo siga, da administração pública federal. Course packet (01 2019), electronic Data (1 file: 993 kb)
Google Scholar
Azemi, N., Zaidi, H., Hussin, N.: Information quality in organization for better decision-making. Int. J. Acad. Res. Bus. Soc. Sci. 7 (2018). https://doi.org/10.6007/IJARBSS/v7-i12/3624
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python (2009). https://nltk.org/book
Castro, N.F.F.d.S., da Silva Soares, A.: Multilingual transformer ensembles for portuguese natural language tasks (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2019)
Google Scholar
Exército Brasileiro: Instruções gerais para avaliação de documentos do exército (10 2019), eB10-IG-01.012
Google Scholar
Fragos, K., Belsis, P., Skourlas, C.: Combining probabilistic classifiers for text classification. Procedia-Soc. Beh. Sci. 147, 307–312 (2014)
Article Google Scholar
González-Carvajal, S., Garrido-Merchán, E.C.: Comparing Bert against traditional machine learning text classification (2021)
Google Scholar
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-strength natural language processing in Python (2020). https://doi.org/10.5281/zenodo.1212303
Howe, J.S.T., Khang, L.H., Chai, I.E.: Legal area classification: a comparative study of text classifiers on Singapore supreme court judgments (2019)
Google Scholar
Iosifidis, V., Ntoutsi, E.: Large scale sentiment learning with limited labels. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2017, New York, NY, USA, pp. 1823–1832. Association for Computing Machinery (2017). https://doi.org/10.1145/3097983.3098159, https://doi-org.ez54.periodicos.capes.gov.br/10.1145/3097983.3098159
Jean, N., Xie, S.M., Ermon, S.: Semi-supervised deep kernel learning: regression with unlabeled data by minimizing predictive variance (2019)
Google Scholar
Li, Y., Ye, J.: Learning adversarial networks for semi-supervised text classification via policy gradient. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1715–1723 (2018)
Google Scholar
Liang, P.: Semi-supervised learning for natural language. Ph.D. thesis, Massachusetts Institute of Technology (2005)
Google Scholar
Maiya, A.S.: ktrain: a low-code library for augmented machine learning. CoRR abs/2004.10703 (2020), https://arxiv.org/abs/2004.10703
McCloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks: The sequential learning problem. Psychol. Learn. Mot. 24, 109–165 (1989)
Article Google Scholar
McEntee, E.: Enhancing partially labelled data: self learning and word vectors in natural language processing (2019)
Google Scholar
Meng, Y., et al.: Text classification using label names only: a language model self-training approach (2020)
Google Scholar
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2), 103–134 (2000)
Article MATH Google Scholar
Oliver, A., Odena, A., Raffel, C., Cubuk, E.D., Goodfellow, I.J.: Realistic evaluation of deep semi-supervised learning algorithms (2019)
Google Scholar
Redman, T.C.: Improve data quality for competitive advantage. MIT Sloan Manage. Rev. 36(2), 99 (1995)
Google Scholar
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20–23 (2020, to appear)
Google Scholar
Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997)
Article Google Scholar
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune Bert for text classification? (2020)
Google Scholar
Wolf, F., Poggio, T., Sinha, P.: Human document classification using bags of words, August 2006
Google Scholar
Zhu, X.J.: Semi-supervised learning literature survey (2005). last modified on 19 July 2008
Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Ciência da Computação, Universidade de Brasília, Campus Universitário Darcy Ribeiro, 70910-900, Brasília, Brazil
Carlos Eduardo de Lima Joaquim & Thiago de Paulo Faleiros
Exército Brasileiro, Centro de Desenvolvimento de Sistemas, QGEx - Bloco G - 2o Piso, 70630-901, Brasília, Brazil
Carlos Eduardo de Lima Joaquim

Authors

Carlos Eduardo de Lima Joaquim
View author publications
You can also search for this author in PubMed Google Scholar
Thiago de Paulo Faleiros
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Eduardo de Lima Joaquim .

Editor information

Editors and Affiliations

SBA Research, Vienna, Austria
Dimitris E. Simos
Moscow Aviation Institute (National Research University), Moscow, Russia
Varvara A. Rasskazova
Università degli Studi di Milano-Bicocca, Milan, Italy
Francesco Archetti
Wilfrid Laurier University, Waterloo, ON, Canada
Ilias S. Kotsireas
University of Florida, Gainesville, FL, USA
Panos M. Pardalos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joaquim, C.E.d.L., Faleiros, T.d.P. (2022). BERT Self-Learning Approach with Limited Labels for Document Classification. In: Simos, D.E., Rasskazova, V.A., Archetti, F., Kotsireas, I.S., Pardalos, P.M. (eds) Learning and Intelligent Optimization. LION 2022. Lecture Notes in Computer Science, vol 13621. Springer, Cham. https://doi.org/10.1007/978-3-031-24866-5_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-24866-5_21
Published: 05 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24865-8
Online ISBN: 978-3-031-24866-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BERT Self-Learning Approach with Limited Labels for Document Classification