Abstract
The remarkable production speed of documents and, consequently, the volume of unstructured data stored in the Brazilian Government facilities requires processes that enable the capacity of classifying documents. This requirement is compliant with the existing archival legislation. In this sense, Natural Language Processing (NLP) stands as an important asset related to document classification, considering the reality of current document production, where there is a considerable number of unlabeled documentary samples. The Self-Learning approach applied to the BERT fine-tuning step delivers a model capable of classifying a partially labeled set of data according to the Requirements Model for Computerized Document Management Systems (e-ARQ Brazil). The developed model was capable of reaching a human-level performance, outperforming Active Learning and BERT in a series of defined confidence levels.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Nacional, A.: Gestão de documentos: curso de capacitação para os integrantes do sistema de gestão de documentos de arquivo siga, da administração pública federal. Course packet (01 2019), electronic Data (1 file: 993 kb)
Azemi, N., Zaidi, H., Hussin, N.: Information quality in organization for better decision-making. Int. J. Acad. Res. Bus. Soc. Sci. 7 (2018). https://doi.org/10.6007/IJARBSS/v7-i12/3624
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python (2009). https://nltk.org/book
Castro, N.F.F.d.S., da Silva Soares, A.: Multilingual transformer ensembles for portuguese natural language tasks (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2019)
Exército Brasileiro: Instruções gerais para avaliação de documentos do exército (10 2019), eB10-IG-01.012
Fragos, K., Belsis, P., Skourlas, C.: Combining probabilistic classifiers for text classification. Procedia-Soc. Beh. Sci. 147, 307–312 (2014)
González-Carvajal, S., Garrido-Merchán, E.C.: Comparing Bert against traditional machine learning text classification (2021)
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-strength natural language processing in Python (2020). https://doi.org/10.5281/zenodo.1212303
Howe, J.S.T., Khang, L.H., Chai, I.E.: Legal area classification: a comparative study of text classifiers on Singapore supreme court judgments (2019)
Iosifidis, V., Ntoutsi, E.: Large scale sentiment learning with limited labels. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2017, New York, NY, USA, pp. 1823–1832. Association for Computing Machinery (2017). https://doi.org/10.1145/3097983.3098159, https://doi-org.ez54.periodicos.capes.gov.br/10.1145/3097983.3098159
Jean, N., Xie, S.M., Ermon, S.: Semi-supervised deep kernel learning: regression with unlabeled data by minimizing predictive variance (2019)
Li, Y., Ye, J.: Learning adversarial networks for semi-supervised text classification via policy gradient. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1715–1723 (2018)
Liang, P.: Semi-supervised learning for natural language. Ph.D. thesis, Massachusetts Institute of Technology (2005)
Maiya, A.S.: ktrain: a low-code library for augmented machine learning. CoRR abs/2004.10703 (2020), https://arxiv.org/abs/2004.10703
McCloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks: The sequential learning problem. Psychol. Learn. Mot. 24, 109–165 (1989)
McEntee, E.: Enhancing partially labelled data: self learning and word vectors in natural language processing (2019)
Meng, Y., et al.: Text classification using label names only: a language model self-training approach (2020)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2), 103–134 (2000)
Oliver, A., Odena, A., Raffel, C., Cubuk, E.D., Goodfellow, I.J.: Realistic evaluation of deep semi-supervised learning algorithms (2019)
Redman, T.C.: Improve data quality for competitive advantage. MIT Sloan Manage. Rev. 36(2), 99 (1995)
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20–23 (2020, to appear)
Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997)
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune Bert for text classification? (2020)
Wolf, F., Poggio, T., Sinha, P.: Human document classification using bags of words, August 2006
Zhu, X.J.: Semi-supervised learning literature survey (2005). last modified on 19 July 2008
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Joaquim, C.E.d.L., Faleiros, T.d.P. (2022). BERT Self-Learning Approach with Limited Labels for Document Classification. In: Simos, D.E., Rasskazova, V.A., Archetti, F., Kotsireas, I.S., Pardalos, P.M. (eds) Learning and Intelligent Optimization. LION 2022. Lecture Notes in Computer Science, vol 13621. Springer, Cham. https://doi.org/10.1007/978-3-031-24866-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-24866-5_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24865-8
Online ISBN: 978-3-031-24866-5
eBook Packages: Computer ScienceComputer Science (R0)