Skip to main content

Synergistic Text Annotation Based on Rule-Based Expressions and DistilBERT

  • Conference paper
  • First Online:
Intelligent Information and Database Systems (ACIIDS 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14796))

Included in the following conference series:

  • 238 Accesses

Abstract

This study introduces a novel hybrid approach to text annotation that combines rule-based regular expressions with the pretrained neural network model DistilBERT. Given limited task-specific labeled data, regular expressions are first leveraged to efficiently annotate sentences, providing a cost-effective alternative to manual labeling. The annotated dataset then serves as training data for DistilBERT, enabling the model to learn nuanced linguistic patterns and improve upon the rule-based annotations. Results demonstrate that this pretraining strategy significantly enhances performance, achieving state-of-the-art models performance, notably those reliant solely on prompt engineering, such as the biggest large language model GPT-4. This study underscores the efficacy of integrating data-driven strategies with modern pretrained models, particularly for tasks where annotated data is scarce. The proposed method presents a promising direction for building robust and adaptable sentence annotation pipelines across diverse and resource-constrained natural language processing applications. By capitalizing on both manually crafted rules and learned representations, this hybrid approach can potentially generalize better compared to relying solely on either technique alone.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Khan, J., Ahmad, N., Khalid, S., Ali, F., Lee, Y.: Sentiment and context-aware hybrid DNN with attention for text sentiment classification. IEEE Access 11, 28162–28179 (2023)

    Article  Google Scholar 

  2. Liu, C., Xu, X.: AMFF: a new attention-based multi-feature fusion method for intention recognition. Knowl.-Based Syst. 233, 107525 (2021)

    Article  Google Scholar 

  3. Wagh, R., Punde, P.: Survey on sentiment analysis using Twitter dataset. In: 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 208–211. IEEE, March 2018

    Google Scholar 

  4. Li, Q., et al.: A survey on text classification: from traditional to deep learning. ACM Trans. Intell. Systems and Technology (TIST) 13(2), 1–41 (2022)

    Google Scholar 

  5. Meng, Y., et al.: Text classification using label names only: a language model self-training approach. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9006–9017 (2020)

    Google Scholar 

  6. Jánez-Martino, F., Fidalgo, E., Gonzá-Martínez, S., Velasco-Mata, J.: Classification of spam emails through hierarchical clustering and supervised learning. arXiv preprint arXiv:2005.08773 (2020)

  7. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  8. Shetty, J., Adibi, J.: The Enron email dataset database schema and brief statistical report. Inf. Sci. Inst. Tech. Rep. Univ. Southern Calif. 4(1), 120–128 (2004)

    Google Scholar 

  9. Neves, M., Ševa, J.: An extensive review of tools for manual annotation of documents. Brief. Bioinform. 22(1), 146–163 (2021)

    Article  Google Scholar 

  10. van Gompel, M., Reynaert, M.: FoLiA: a practical XML format for linguistic annotation - a descriptive and comparative study. Comput. Linguist. Netherlands J. 3, 63–81 (2013)

    Google Scholar 

  11. Islamaj, R., Kwon, D., Kim, S., Lu, Z.: TeamTat: a collaborative text annotation tool. Nucleic Acids Res. 48(W1), W5–W11 (2020)

    Article  Google Scholar 

  12. Zhang, Z., Strubell, E., Hovy, E.: A survey of active learning for natural language processing. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6166–6190 (2022)

    Google Scholar 

  13. Cejuela, J.M., et al.: tagtog: Interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles. Database, bau033 (2014)

    Google Scholar 

  14. Schroder, C., Niekler, A., Potthast, M.: Revisiting uncertainty-based query strategies for active learning with transformers. Findings Assoc. Comput. Linguist. (ACL), pp. 2194–2203 (2022)

    Google Scholar 

  15. Liu, X., et al.: Developing multi-labelled corpus of twitter short texts: a semi-automatic method. Systems 11(8), 390 (2023)

    Article  Google Scholar 

  16. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (NAACL) (2019)

    Google Scholar 

  17. Karim, A., Azam, S., Shanmugam, B., Kannoorpatti, K.: Efficient clustering of emails into spam and ham: the foundational study of a comprehensive unsupervised framework. IEEE Access 8, 154759–154788 (2020)

    Article  Google Scholar 

  18. Wendy, W., Will Bridewell, C., Hanbury, P., Cooper, G.F., Buchanan, B.G.: A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform. 34(5), 301–310 (2001)

    Google Scholar 

  19. Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., Lungren, M.P.: CheXBERT: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv preprint arXiv:2004.09167 (2020)

  20. Yogish, D., Manjunath, T.N., Hegadi, R.S.: Review on natural language processing trends and techniques using NLTK. In: Santosh, K.C., Hegadi, R.S. (eds.) RTIP2R 2018. CCIS, vol. 1037, pp. 589–606. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-9187-3_53

    Chapter  Google Scholar 

  21. Sbei, A., ElBedoui, K., Barhoumi, W., Maktouf, C.: Adaptive feature selection in PET scans based on shared information and multi-label learning. Vis. Comput. 1–21 (2022)

    Google Scholar 

  22. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arafet Sbei .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sbei, A., ElBedoui, K., Barhoumi, W. (2024). Synergistic Text Annotation Based on Rule-Based Expressions and DistilBERT. In: Nguyen, N.T., et al. Intelligent Information and Database Systems. ACIIDS 2024. Lecture Notes in Computer Science(), vol 14796. Springer, Singapore. https://doi.org/10.1007/978-981-97-4985-0_32

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-4985-0_32

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-4984-3

  • Online ISBN: 978-981-97-4985-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics