Synergistic Text Annotation Based on Rule-Based Expressions and DistilBERT

Sbei, Arafet; ElBedoui, Khaoula; Barhoumi, Walid

doi:10.1007/978-981-97-4985-0_32

Arafet Sbei¹⁴,
Khaoula ElBedoui^15,16 &
Walid Barhoumi^15,16

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14796))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

238 Accesses

Abstract

This study introduces a novel hybrid approach to text annotation that combines rule-based regular expressions with the pretrained neural network model DistilBERT. Given limited task-specific labeled data, regular expressions are first leveraged to efficiently annotate sentences, providing a cost-effective alternative to manual labeling. The annotated dataset then serves as training data for DistilBERT, enabling the model to learn nuanced linguistic patterns and improve upon the rule-based annotations. Results demonstrate that this pretraining strategy significantly enhances performance, achieving state-of-the-art models performance, notably those reliant solely on prompt engineering, such as the biggest large language model GPT-4. This study underscores the efficacy of integrating data-driven strategies with modern pretrained models, particularly for tasks where annotated data is scarce. The proposed method presents a promising direction for building robust and adaptable sentence annotation pipelines across diverse and resource-constrained natural language processing applications. By capitalizing on both manually crafted rules and learned representations, this hybrid approach can potentially generalize better compared to relying solely on either technique alone.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

APRE: Annotation-Aware Prompt-Tuning for Relation Extraction

Article Open access 21 February 2024

LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages

Capturing the Varieties of Natural Language Inference: A Systematic Survey of Existing Datasets and Two Novel Benchmarks

Article Open access 20 November 2023

References

Khan, J., Ahmad, N., Khalid, S., Ali, F., Lee, Y.: Sentiment and context-aware hybrid DNN with attention for text sentiment classification. IEEE Access 11, 28162–28179 (2023)
Article Google Scholar
Liu, C., Xu, X.: AMFF: a new attention-based multi-feature fusion method for intention recognition. Knowl.-Based Syst. 233, 107525 (2021)
Article Google Scholar
Wagh, R., Punde, P.: Survey on sentiment analysis using Twitter dataset. In: 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 208–211. IEEE, March 2018
Google Scholar
Li, Q., et al.: A survey on text classification: from traditional to deep learning. ACM Trans. Intell. Systems and Technology (TIST) 13(2), 1–41 (2022)
Google Scholar
Meng, Y., et al.: Text classification using label names only: a language model self-training approach. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9006–9017 (2020)
Google Scholar
Jánez-Martino, F., Fidalgo, E., Gonzá-Martínez, S., Velasco-Mata, J.: Classification of spam emails through hierarchical clustering and supervised learning. arXiv preprint arXiv:2005.08773 (2020)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Shetty, J., Adibi, J.: The Enron email dataset database schema and brief statistical report. Inf. Sci. Inst. Tech. Rep. Univ. Southern Calif. 4(1), 120–128 (2004)
Google Scholar
Neves, M., Ševa, J.: An extensive review of tools for manual annotation of documents. Brief. Bioinform. 22(1), 146–163 (2021)
Article Google Scholar
van Gompel, M., Reynaert, M.: FoLiA: a practical XML format for linguistic annotation - a descriptive and comparative study. Comput. Linguist. Netherlands J. 3, 63–81 (2013)
Google Scholar
Islamaj, R., Kwon, D., Kim, S., Lu, Z.: TeamTat: a collaborative text annotation tool. Nucleic Acids Res. 48(W1), W5–W11 (2020)
Article Google Scholar
Zhang, Z., Strubell, E., Hovy, E.: A survey of active learning for natural language processing. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6166–6190 (2022)
Google Scholar
Cejuela, J.M., et al.: tagtog: Interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles. Database, bau033 (2014)
Google Scholar
Schroder, C., Niekler, A., Potthast, M.: Revisiting uncertainty-based query strategies for active learning with transformers. Findings Assoc. Comput. Linguist. (ACL), pp. 2194–2203 (2022)
Google Scholar
Liu, X., et al.: Developing multi-labelled corpus of twitter short texts: a semi-automatic method. Systems 11(8), 390 (2023)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (NAACL) (2019)
Google Scholar
Karim, A., Azam, S., Shanmugam, B., Kannoorpatti, K.: Efficient clustering of emails into spam and ham: the foundational study of a comprehensive unsupervised framework. IEEE Access 8, 154759–154788 (2020)
Article Google Scholar
Wendy, W., Will Bridewell, C., Hanbury, P., Cooper, G.F., Buchanan, B.G.: A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform. 34(5), 301–310 (2001)
Google Scholar
Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., Lungren, M.P.: CheXBERT: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv preprint arXiv:2004.09167 (2020)
Yogish, D., Manjunath, T.N., Hegadi, R.S.: Review on natural language processing trends and techniques using NLTK. In: Santosh, K.C., Hegadi, R.S. (eds.) RTIP2R 2018. CCIS, vol. 1037, pp. 589–606. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-9187-3_53
Chapter Google Scholar
Sbei, A., ElBedoui, K., Barhoumi, W., Maktouf, C.: Adaptive feature selection in PET scans based on shared information and multi-label learning. Vis. Comput. 1–21 (2022)
Google Scholar
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

Download references

Author information

Authors and Affiliations

ICube, CNRS, UMR 7357, University of Strasbourg, Strasbourg, France
Arafet Sbei
Higher Institute of Computer Science, Research Team on Intelligent Systems in Imaging and Artificial Vision (SIIVA), LR16ES06 Computer Science Research Laboratory, Modeling and Processing of Information and Knowledge (LIMTIC), University of Tunis El Manar, 2 Bayrouni Street, 2080, Ariana, Tunisia
Khaoula ElBedoui & Walid Barhoumi
National School of Engineers of Carthage, University of Carthage, 45 Entrepreneurs Street, 2035, Tunis-Carthage, Tunisia
Khaoula ElBedoui & Walid Barhoumi

Authors

Arafet Sbei
View author publications
You can also search for this author in PubMed Google Scholar
Khaoula ElBedoui
View author publications
You can also search for this author in PubMed Google Scholar
Walid Barhoumi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arafet Sbei .

Editor information

Editors and Affiliations

Wroclaw University of Science and Technology, Wroclaw, Poland
Ngoc Thanh Nguyen
University of Pau and Adour Countries, Pau, France
Richard Chbeir
Open University of Cyprus, Latsia, Cyprus
Yannis Manolopoulos
Iwate Prefectural University, Takizawa, Japan
Hamido Fujita
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Japan Advanced Institute of Science and Technology, Nomi, Japan
Le Minh Nguyen
Wrocław University of Science and Technology, Wrocław, Poland
Krystian Wojtkiewicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sbei, A., ElBedoui, K., Barhoumi, W. (2024). Synergistic Text Annotation Based on Rule-Based Expressions and DistilBERT. In: Nguyen, N.T., et al. Intelligent Information and Database Systems. ACIIDS 2024. Lecture Notes in Computer Science(), vol 14796. Springer, Singapore. https://doi.org/10.1007/978-981-97-4985-0_32

Download citation

DOI: https://doi.org/10.1007/978-981-97-4985-0_32
Published: 16 July 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-4984-3
Online ISBN: 978-981-97-4985-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Synergistic Text Annotation Based on Rule-Based Expressions and DistilBERT