Inferring the Source of Official Texts: Can SVM Beat ULMFiT?

Luz de Araujo, Pedro Henrique; de Campos, Teófilo Emidio; Magalhães Silva de Sousa, Marcelo

doi:10.1007/978-3-030-41505-1_8

Pedro Henrique Luz de Araujo¹⁴,
Teófilo Emidio de Campos¹⁴ &
Marcelo Magalhães Silva de Sousa¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12037))

Included in the following conference series:

International Conference on Computational Processing of the Portuguese Language

649 Accesses

Abstract

Official Gazettes are a rich source of relevant information to the public. Their careful examination may lead to the detection of frauds and irregularities that may prevent mismanagement of public funds. This paper presents a dataset composed of documents from the Official Gazette of the Federal District, containing both samples with document source annotation and unlabeled ones. We train, evaluate and compare a transfer learning based model that uses ULMFiT with traditional bag-of-words models that use SVM and Naive Bayes as classifiers. We find the SVM to be competitive, its performance being marginally worse than the ULMFiT while having much faster train and inference time and being less computationally expensive. Finally, we conduct ablation analysis to assess the performance impact of the ULMFiT parts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Open-source LLMs for text annotation: a practical guide for model setting and fine-tuning

Article Open access 18 December 2024

L3Cube-MahaNews: News-Based Short Text and Long Document Classification Datasets in Marathi

Topic Modelling vs Distant Supervision: A Comparative Evaluation Based on the Classification of Parliamentary Enquiries

Notes

1.
Available at https://www.dodf.df.gov.br/.
2.
Available at https://github.com/piegu/language-models/tree/master/models.

References

Aletras, N., Tsarapatsanis, D., Preotiuc-Pietro, D., Lampos, V.: Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective. PeerJ Comput. Sci. 2, e93 (2016). https://doi.org/10.7717/peerj-cs.93
Article Google Scholar
Bradbury, J., Merity, S., Xiong, C., Socher, R.: Quasi-recurrent neural networks. CoRR abs/1611.01576 (2016). http://arxiv.org/abs/1611.01576
Cardellino, C., Teruel, M., Alonso Alemany, L., Villata, S.: A low-cost, high-coverage legal named entity recognizer, classifier and linker. In: Proceedings of the 16th International Conference on Artificial Intelligence and Law (ICAIL), London, UK, June 2017, preprint available from https://hal.archives-ouvertes.fr/hal-01541446
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Dozier, C., Kondadadi, R., Light, M., Vachher, A., Veeramachaneni, S., Wudali, R.: Named entity recognition and resolution in legal text. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts. LNCS (LNAI), vol. 6036, pp. 27–43. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12837-0_2
Chapter Google Scholar
Galgani, F., Compton, P., Hoffmann, A.: Combining different summarization techniques for legal text. In: Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, HYBRID, pp. 115–123. Association for Computational Linguistics (ACL), Stroudsburg, PA, USA (2012). http://dl.acm.org/citation.cfm?id=2388632.2388647
Hearst, M.A.: Support vector machines. IEEE Intell. Syst. 13(4), 18–28 (1998)
Article Google Scholar
Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR abs/1801.06146 (2018). http://arxiv.org/abs/1801.06146
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37. pp. 448–456. JMLR.org (2015). http://dl.acm.org/citation.cfm?id=3045118.3045167
Kanapala, A., Pal, S., Pamula, R.: Text summarization from legal documents: a survey. Artif. Intell. Rev. (2017). https://doi.org/10.1007/s10462-017-9566-2
Article Google Scholar
Katz, D.M., Bommarito, Michael J, I., Blackman, J.: A general approach for predicting the behavior of the Supreme Court of the United States. PLoS ONE (2017). https://doi.org/10.1371/journal.pone.0174698
Article Google Scholar
Kim, M.-Y., Xu, Y., Goebel, R.: Summarization of legal texts with high cohesion and automatic compression rate. In: Motomura, Y., Butler, A., Bekki, D. (eds.) JSAI-isAI 2012. LNCS (LNAI), vol. 7856, pp. 190–204. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39931-2_14
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Kudo, T., Richardson, J.: SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP), pp. 66–71. Association for Computational Linguistics (ACL), Brussels, Belgium, November 2018
Google Scholar
Kumar, R., Raghuveer, K.: Legal document summarization using latent Dirichlet allocation. Int. J. Comput. Sci. Telecommun. 3, 114–117 (2012)
Google Scholar
Loshchilov, I., Hutter, F.: Fixing weight decay regularization in Adam. CoRR abs/1711.05101 (2017). http://arxiv.org/abs/1711.05101
Luz de Araujo, P.H., de Campos, T.E., de Oliveira, R.R.R., Stauffer, M., Couto, S., Bermejo, P.: LeNER-Br: a dataset for named entity recognition in Brazilian legal text. In: Villavicencio, A., Moreira, V., Abad, A., Caseli, H., Gamallo, P., Ramisch, C., Gonçalo Oliveira, H., Paetzold, G.H. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 313–323. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_32
Chapter Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICLR), pp. 807–814. Omnipress, USA (2010). https://icml.cc/Conferences/2010/papers/432.pdf
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8) (2019). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
da Silva, N.C., et al.: Document type classification for Brazil’s supreme court using a convolutional neural network. In: 10th International Conference on Forensic Computer Science and Cyber Law (ICoFCS), Sao Paulo, Brazil, 29–30 October 2018. https://doi.org/10.5769/C2018001. Winner of the best paper award
Smith, L.N.: No more pesky learning rate guessing games. CoRR abs/1506.01186 (2015). http://arxiv.org/abs/1506.01186
Smith, L.N., Topin, N.: Super-convergence: Very fast training of residual networks using large learning rates. CoRR abs/1708.07120 (2017). http://arxiv.org/abs/1708.07120
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014). http://dl.acm.org/citation.cfm?id=2627435.2670313
MathSciNet MATH Google Scholar
Şulea, O.M., Zampieri, M., Vela, M., van Genabith, J.: Predicting the law area and decisions of french supreme court cases. In: Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP), pp. 716–722. INCOMA Ltd. (2017)
Google Scholar
de Vargas Feijó, D., Moreira, V.P.: RulingBR: a summarization dataset for legal texts. In: Villavicencio, A., Moreira, V., Abad, A., Caseli, H., Gamallo, P., Ramisch, C., Gonçalo Oliveira, H., Paetzold, G.H. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 255–264. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_26
Chapter Google Scholar

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. TdC received support from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), grant PQ 314154/2018-3. We are also grateful for the support from Fundação de Apoio à Pesquisa do Distrito Federal (FAPDF).

Author information

Authors and Affiliations

Departamento de Ciência da Computação (CiC), Universidade de Brasília (UnB), Brasilia, Brazil
Pedro Henrique Luz de Araujo & Teófilo Emidio de Campos
Tribunal de Contas do Distrito Federal, Zona Cívico-Administrativa, Brasília, DF, Brazil
Marcelo Magalhães Silva de Sousa

Authors

Pedro Henrique Luz de Araujo
View author publications
You can also search for this author in PubMed Google Scholar
Teófilo Emidio de Campos
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo Magalhães Silva de Sousa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Pedro Henrique Luz de Araujo or Teófilo Emidio de Campos .

Editor information

Editors and Affiliations

University of Évora, Evora, Portugal
Paulo Quaresma
University of Évora, Evora, Portugal
Renata Vieira
University of São Paulo, São Carlos, Brazil
Sandra Aluísio
University of Lisbon, Lisbon, Portugal
Helena Moniz
INESC-ID/ISCTE-IUL, Lisbon, Portugal
Fernando Batista
University of Évora, Evora, Portugal
Teresa Gonçalves

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Luz de Araujo, P.H., de Campos, T.E., Magalhães Silva de Sousa, M. (2020). Inferring the Source of Official Texts: Can SVM Beat ULMFiT?. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds) Computational Processing of the Portuguese Language. PROPOR 2020. Lecture Notes in Computer Science(), vol 12037. Springer, Cham. https://doi.org/10.1007/978-3-030-41505-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-41505-1_8
Published: 24 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41504-4
Online ISBN: 978-3-030-41505-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics