Skip to main content

FETD\(^{2}\): A Framework for Enabling Textual Data Denoising via Robust Contextual Embeddings

  • Conference paper
  • First Online:
Linking Theory and Practice of Digital Libraries (TPDL 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12866))

Included in the following conference series:

  • 698 Accesses

Abstract

Efforts by national libraries, institutions, and (inter-) national projects have led to an increased effort in preserving textual contents - including non-digitally born data - for future generations . These activities have resulted in novel initiatives in preserving the cultural heritage by digitization. However, a systematic approach toward Textual Data Denoising (TD\(^{2}\)) is still in its infancy and commonly limited to a primarily dominant language (mostly English). However, digital preservation requires a universal approach. To this end, we introduce a “Framework for Enabling Textual Data Denoising via robust contextual embeddings” (FETD\(^{2}\)). FETD\(^{2}\) improves data quality by training language-specific data denoising models based on a small number of language-specific training data. Our approach employs a bi-directional language modeling in order to produce noise-resilient deep contextualized embeddings. In experiments we show the superiority compared with the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Wikipedia Dumps https://dumps.wikimedia.org/.

  2. 2.

    20 Newsgroups dataset http://qwone.com/~jason/20Newsgroups/.

  3. 3.

    L’Express https://www.lexpress.fr/.

  4. 4.

    TensorFlow https://www.tensorflow.org/.

  5. 5.

    AllenAI bilm-tf https://github.com/allenai/bilm-tf.

  6. 6.

    FETD\(^2\) data https://spaniol.users.greyc.fr/research/FETD%5e2/.

References

  1. Astudillo, R., Amir, S., Ling, W., Silva, M., Trancoso, I.: Learning word Representations from scarce and noisy data with embedding subspaces. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1074–1084. Association for Computational Linguistics, Beijing, China, July 2015. https://www.aclweb.org/anthology/P15-1104

  2. Belinkov, Y., Bisk, Y.: Synthetic and natural noise both break neural machine translation. In: International Conference on Learning Representations (2018)

    Google Scholar 

  3. Boukkouri, H.E., Ferret, O., Lavergne, T., Noji, H., Zweigenbaum, P., Tsujii, J.: CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters (2020)

    Google Scholar 

  4. Chiron, G., Doucet, A., Coustaty, M., Moreux, J.: ICDAR2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). vol. 01, pp. 1423–1428, November 2017. https://doi.org/10.1109/ICDAR.2017.232

  5. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. In: ICLR (2020). https://openreview.net/pdf?id=r1xMH1BtvB

  6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018)

    Google Scholar 

  7. Edizel, B., Piktus, A., Bojanowski, P., Ferreira, R., Grave, E., Silvestri, F.: Misspelling oblivious word embeddings. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, June 2–7 2019, Vol. 1 (Long and Short Papers), pp. 3226–3234 (2019). https://aclweb.org/anthology/papers/N/N19/N19-1326/

  8. Eger, S., et al.: Text processing like humans do: visually attacking and shielding NLP systems. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 1634–1647. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://www.aclweb.org/anthology/N19-1165

  9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997), http://dx.doi.org/10.1162/neco.1997.9.8.1735

  10. Józefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of language modeling. CoRR abs/1602.02410 (2016). http://arxiv.org/abs/1602.02410

  11. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI 2016), pp. 2741–2749. AAAI Press (2016)

    Google Scholar 

  12. Kumar, A., Makhija, P., Gupta, A.: noisy text data: achilles’ heel of BERT. In: Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pp. 16–21. Association for Computational Linguistics, November 2020. https://doi.org/10.18653/v1/2020.wnut-1.3, https://www.aclweb.org/anthology/2020.wnut-1.3

  13. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=H1eA7AEtvS

  14. Larson, C., Lahlou, T., Mingels, D., Kulis, Z., Mueller, E.: Telephonetic: making neural language models robust to ASR and semantic noise. ArXiv abs/1906.05678 (2019)

    Google Scholar 

  15. Linhares Pontes, E., Hamdi, A., Sidere, N., Doucet, A.: Impact of OCR quality on named entity linking. In: Proceedings of 21st International Conference on Asia-Pacific Digital Libraries (ICADL 2019) (2019)

    Google Scholar 

  16. Liza, F.F., Grzes, M.: Improving language modelling with noise-contrastive estimation. In: AAAI (2018)

    Google Scholar 

  17. Malykh, V., Logacheva, V., Khakhulin, T.: Robust word vectors: context-informed embeddings for noisy texts. In: Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pp. 54–63. Association for Computational Linguistics, Brussels, Belgium, November 2018. https://www.aclweb.org/anthology/W18-6108

  18. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  19. Nayak, A., Timmapathini, H., Ponnalagu, K., Venkoparao, V.G.: Domain adaptation challenges of BERT in tokenization and sub-word representations of out-of-vocabulary words. In: Rogers, A., Sedoc, J., Rumshisky, A. (eds.) Proceedings of the 1st Workshop on Insights from Negative Results in NLP, Insights 2020, pp. 1–5. ACL (2020)

    Google Scholar 

  20. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  21. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of NAACL (2018)

    Google Scholar 

  22. Ren, S., Deng, Y., He, K., Che, W.: Generating natural language adversarial examples through probability weighted word saliency. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1085–1097. Association for Computational Linguistics, Florence, Italy, July 2019. https://www.aclweb.org/anthology/P19-1103

  23. Subramaniam, L., Roy, S., Faruquie, T., Negi, S.: A survey of types of text noise and techniques to handle noisy text. In: ACM International Conference Proceeding Serie, pp. 115–122, January 2009. https://doi.org/10.1145/1568296.1568315

  24. Sun, L., et al.: Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT. arXiv preprint arXiv:2003.04985 (2020)

  25. Sun, Y., Jiang, H.: Contextual text denoising with masked language model. In: Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pp. 286–290. Association for Computational Linguistics, Hong Kong, China, November 2019

    Google Scholar 

  26. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc., Red Hook (2017)

    Google Scholar 

  27. Wang, W., Tang, B., Wang, R., Wang, L., Ye, A.: A survey on adversarial attacks and defenses in text. arXiv preprint arXiv:1902.07285 (2019)

  28. Xiong, W., et al.: TweetQA: a social media focused question answering dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)

    Google Scholar 

  29. Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237 (2019). http://arxiv.org/abs/1906.08237

  30. Zhang, W.E., Sheng, Q.Z., Alhazmi, A.A.F.: Generating textual adversarial examples for deep learning models: a survey. arXiv preprint arXiv:1901.06796 (2019)

Download references

Acknowledgements

This work was supported by the RIN RECHERCHE Normandie Digitale research project ASTURIAS contract no. 18E01661. We thank our colleagues for the inspiring discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marc Spaniol .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Govind, Alec, C., Manguin, JL., Spaniol, M. (2021). FETD\(^{2}\): A Framework for Enabling Textual Data Denoising via Robust Contextual Embeddings. In: Berget, G., Hall, M.M., Brenn, D., Kumpulainen, S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2021. Lecture Notes in Computer Science(), vol 12866. Springer, Cham. https://doi.org/10.1007/978-3-030-86324-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86324-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86323-4

  • Online ISBN: 978-3-030-86324-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics