Skip to main content

BERT Self-Learning Approach with Limited Labels for Document Classification

  • Conference paper
  • First Online:
Learning and Intelligent Optimization (LION 2022)

Abstract

The remarkable production speed of documents and, consequently, the volume of unstructured data stored in the Brazilian Government facilities requires processes that enable the capacity of classifying documents. This requirement is compliant with the existing archival legislation. In this sense, Natural Language Processing (NLP) stands as an important asset related to document classification, considering the reality of current document production, where there is a considerable number of unlabeled documentary samples. The Self-Learning approach applied to the BERT fine-tuning step delivers a model capable of classifying a partially labeled set of data according to the Requirements Model for Computerized Document Management Systems (e-ARQ Brazil). The developed model was capable of reaching a human-level performance, outperforming Active Learning and BERT in a series of defined confidence levels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Nacional, A.: Gestão de documentos: curso de capacitação para os integrantes do sistema de gestão de documentos de arquivo siga, da administração pública federal. Course packet (01 2019), electronic Data (1 file: 993 kb)

    Google Scholar 

  2. Azemi, N., Zaidi, H., Hussin, N.: Information quality in organization for better decision-making. Int. J. Acad. Res. Bus. Soc. Sci. 7 (2018). https://doi.org/10.6007/IJARBSS/v7-i12/3624

  3. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python (2009). https://nltk.org/book

  4. Castro, N.F.F.d.S., da Silva Soares, A.: Multilingual transformer ensembles for portuguese natural language tasks (2020)

    Google Scholar 

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2019)

    Google Scholar 

  6. Exército Brasileiro: Instruções gerais para avaliação de documentos do exército (10 2019), eB10-IG-01.012

    Google Scholar 

  7. Fragos, K., Belsis, P., Skourlas, C.: Combining probabilistic classifiers for text classification. Procedia-Soc. Beh. Sci. 147, 307–312 (2014)

    Article  Google Scholar 

  8. González-Carvajal, S., Garrido-Merchán, E.C.: Comparing Bert against traditional machine learning text classification (2021)

    Google Scholar 

  9. Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-strength natural language processing in Python (2020). https://doi.org/10.5281/zenodo.1212303

  10. Howe, J.S.T., Khang, L.H., Chai, I.E.: Legal area classification: a comparative study of text classifiers on Singapore supreme court judgments (2019)

    Google Scholar 

  11. Iosifidis, V., Ntoutsi, E.: Large scale sentiment learning with limited labels. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2017, New York, NY, USA, pp. 1823–1832. Association for Computing Machinery (2017). https://doi.org/10.1145/3097983.3098159, https://doi-org.ez54.periodicos.capes.gov.br/10.1145/3097983.3098159

  12. Jean, N., Xie, S.M., Ermon, S.: Semi-supervised deep kernel learning: regression with unlabeled data by minimizing predictive variance (2019)

    Google Scholar 

  13. Li, Y., Ye, J.: Learning adversarial networks for semi-supervised text classification via policy gradient. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1715–1723 (2018)

    Google Scholar 

  14. Liang, P.: Semi-supervised learning for natural language. Ph.D. thesis, Massachusetts Institute of Technology (2005)

    Google Scholar 

  15. Maiya, A.S.: ktrain: a low-code library for augmented machine learning. CoRR abs/2004.10703 (2020), https://arxiv.org/abs/2004.10703

  16. McCloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks: The sequential learning problem. Psychol. Learn. Mot. 24, 109–165 (1989)

    Article  Google Scholar 

  17. McEntee, E.: Enhancing partially labelled data: self learning and word vectors in natural language processing (2019)

    Google Scholar 

  18. Meng, Y., et al.: Text classification using label names only: a language model self-training approach (2020)

    Google Scholar 

  19. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2), 103–134 (2000)

    Article  MATH  Google Scholar 

  20. Oliver, A., Odena, A., Raffel, C., Cubuk, E.D., Goodfellow, I.J.: Realistic evaluation of deep semi-supervised learning algorithms (2019)

    Google Scholar 

  21. Redman, T.C.: Improve data quality for competitive advantage. MIT Sloan Manage. Rev. 36(2), 99 (1995)

    Google Scholar 

  22. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20–23 (2020, to appear)

    Google Scholar 

  23. Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997)

    Article  Google Scholar 

  24. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune Bert for text classification? (2020)

    Google Scholar 

  25. Wolf, F., Poggio, T., Sinha, P.: Human document classification using bags of words, August 2006

    Google Scholar 

  26. Zhu, X.J.: Semi-supervised learning literature survey (2005). last modified on 19 July 2008

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Eduardo de Lima Joaquim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Joaquim, C.E.d.L., Faleiros, T.d.P. (2022). BERT Self-Learning Approach with Limited Labels for Document Classification. In: Simos, D.E., Rasskazova, V.A., Archetti, F., Kotsireas, I.S., Pardalos, P.M. (eds) Learning and Intelligent Optimization. LION 2022. Lecture Notes in Computer Science, vol 13621. Springer, Cham. https://doi.org/10.1007/978-3-031-24866-5_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24866-5_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24865-8

  • Online ISBN: 978-3-031-24866-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics