Skip to main content
Log in

Unsupervised statistical text simplification using pre-trained language modeling for initialization

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Unsupervised text simplification has attracted much attention due to the scarcity of high-quality parallel text simplification corpora. Recent an unsupervised statistical text simplification based on phrase-based machine translation system (UnsupPBMT) achieved good performance, which initializes the phrase tables using the similar words obtained by word embedding modeling. Since word embedding modeling only considers the relevance between words, the phrase table in UnsupPBMT contains a lot of dissimilar words. In this paper, we propose an unsupervised statistical text simplification using pre-trained language modeling BERT for initialization. Specifically, we use BERT as a general linguistic knowledge base for predicting similar words. Experimental results show that our method outperforms the state-of-the-art unsupervised text simplification methods on three benchmarks, even outperforms some supervised baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Martin L, de la Clergerie É, Sagot B, Bordes A. Controllable sentence simplification. In: Proceedings of the 12th Conference on Language Resources and Evaluation. 2020, 4689–4698

  2. Nisioi S, Štajner S, Ponzetto S P, Dinu L P. Exploring neural text simplification models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017, 85–91

  3. Wubben S, van den Bosch A, Krahmer E. Sentence simplification by monolingual machine translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers. 2012, 1015–1024

  4. Xu W, Napoles C, Pavlick E, Chen Q, Callison-Burch C. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 2016, 4: 401–415

    Article  Google Scholar 

  5. Zhang X, Lapata M. Sentence simplification with deep reinforcement learning. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 584–594

  6. Zhu Z, Bernhard D, Gurevych I. A monolingual tree-based translation model for sentence simplification. In: Proceedings of the 23rd International Conference on Computational Linguistics. 2010, 1353–1361

  7. Xu W, Callison-Burch C, Napoles C. Problems in current text simplification research: new data can help. Transactions of the Association for Computational Linguistics, 2015, 3: 283–297

    Article  Google Scholar 

  8. Surya S, Mishra A, Laha A, Jain P, Sankaranarayanan K. Unsupervised neural text simplification. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 2058–2068

  9. Kumar D, Mou L, Golab L, Vechtomova O. Iterative edit-based unsupervised sentence simplification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7918–7928

  10. Qiang J, Wu X. Unsupervised statistical text simplification. IEEE Transactions on Knowledge and Data Engineering, 2021, 33(4): 1802–1806

    Article  MathSciNet  Google Scholar 

  11. Meng Y, Zhang Y, Huang J, Xiong C, Ji H, Zhang C, Han J. Text classification using label names only: a language model self-training approach. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, 9006–9017

  12. Petroni F, Rocktäschel T, Lewis P, Bakhtin A, Wu Y, Miller A H, Riedel S. Language models as knowledge bases?. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 2463–2473

  13. Roberts A, Raffel C, Shazeer N. How much knowledge can you pack into the parameters of a language model?. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, 5418–5426

  14. Zhang H, Khashabi D, Song Y, Roth D. TransOMCS: from linguistic graphs to commonsense knowledge. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2020, 4004–4010

  15. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E. Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. 2007, 177–180

  16. Artetxe M, Labaka G, Agirre E, Cho K. Unsupervised neural machine translation. In: Proceedings of the 6th International Conference on Learning Representations. 2018

  17. Pennington J, Socher R, Manning C. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532–1543

  18. Farr J N, Jenkins J J, Paterson D G. Simplification of flesch reading ease formula. Journal of Applied Psychology, 1951, 35(5): 333–337

    Article  Google Scholar 

  19. Heafield K. KenLM: faster and smaller language model queries. In: Proceedings of the 6th Workshop on Statistical Machine Translation. 2011, 187–197

  20. Lample G, Ott M, Conneau A, Denoyer L, Ranzato M. Phrase-based & neural unsupervised machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 5039–5049

  21. Li D, Zhang Y, Peng H, Chen L, Brockett C, Sun M T, Dolan B. Contextualized perturbation for textual adversarial attack. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 5053–5069

  22. Glavaš G, Štajner S. Simplifying lexical simplification: do we need simplified corpora?. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015, 63–68

  23. Brysbaert M, New B. Moving beyond Kučera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 2009, 41(4): 977–990

    Article  Google Scholar 

  24. Qiang J, Li Y, Zhu Y, Yuan Y, Wu X. Lexical simplification with pretrained encoders. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8649–8656

    Article  Google Scholar 

  25. Qiang J, Lv X, Li Y, Yuan Y, Wu X. Chinese lexical simplification. IEEE/ACV Transactions on Audio, Speech, and Language Processing, 2021, 29: 1819–1828

    Article  Google Scholar 

  26. Zhao S, Meng R, He D, Andi S, Bambang P. Integrating transformer and paraphrase rules for sentence simplification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3164–3173

  27. Narayan S, Gardent C. Hybrid simplification using deep semantics and machine translation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014, 435–445

  28. Guo H, Pasunuru R, Bansal M. Dynamic multi-level multi-task learning for sentence simplification. In: Proceedings of the 27th International Conference on Computational Linguistics. 2018, 462–476

  29. Dong Y, Li Z, Rezagholizadeh M, Cheung J C K. EditNTS: an neural programmer-interpreter model for sentence simplification through explicit editing. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 3393–3402

  30. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9

    Google Scholar 

  31. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le Q V. XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). 2019, 5754–5764

  32. Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. 2019, 4171–4186

  33. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: a lite BERT for self-supervised learning of language representations. In: Proceedings of the 8th International Conference on Learning Representations. 2020

  34. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2019, 7871–7880

  35. Scarton C, Specia L. Learning simplifications for specific target audiences. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 712–718

  36. Narayan S, Gardent C. Unsupervised sentence simplification using deep semantics. In: Proceedings of the 9th International Natural Language Generation Conference. 2015, 111–120

  37. Martin L, Fan A, de la Clergerie É, Bordes A, Sagot B. MUSS: multilingual unsupervised sentence simplification by mining paraphrases. 2021, arXiv preprint arXiv: 2005.00352

  38. Artetxe M, Labaka G, Agirre E. Unsupervised statistical machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3632–3642

  39. Wenzek G, Lachaux M A, Conneau A, Chaudhary V, Guzmán F, Joulin A, Grave E. CCNET: extracting high quality monolingual datasets from web crawl data. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020, 4003–4012

  40. Pavlick E, Callison-Burch C. Simple PPDB: a paraphrase database for simplification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 143–148

Download references

Acknowledgements

This research was partially supported by the National Natural Science Foundation of China (Grant Nos. 62076217 and 61906060); and the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education, China (IRT17R32).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jipeng Qiang or Yun Li.

Additional information

Jipeng Qiang is currently an associate professor in the School of Information Engineering, at Yangzhou University, China. He received his PhD degree in computer science and technology from Hefei University of Technology, China in 2016. He was a PhD visiting student in the Artificial Intelligence Lab at the University of Massachusetts Boston, USA from 2014 to 2016. He has published more than 50 papers, including AAAI, EMNLP, TKDE, TASLP, and TKDD. His research interests mainly include natural language processing and data mining.

Feng Zhang is currently working toward the MS degree of computer science at the Yangzhou University, China. He received his BS degree in computer science from Huaiyin Institute of Technology, China. His research interest is text simplification.

Yun Li is currently a professor in the School of Information Engineering, Yangzhou University, China. He received the MS degree in computer science and technology from Hefei University of Technology, China in 1991, and the PhD degree in control theory and control engineering from Shanghai University, China in 2005. He has

Yunhao Yuan is currently an associate professor in the School of Information Engineering, Yangzhou University, China. He received the MEng degree in computer science and technology from Yangzhou University, China, in 2009, and the PhD degree in pattern recognition and intelligence system from Nanjing University of Science and Technology, China in 2013. His research interests include pattern recognition, data mining, and image processing.

Yi Zhu is currently an assistant professor in the School of Information Engineering, Yangzhou University, China. He received the BS degree from Anhui University, the MS degree from the University of Science and Technology of China, and the PhD degree from Hefei University of Technology, China. His research interests are in data mining and knowledge engineering. His research interests include data mining and recommendation systems.

Xindong Wu is a professor in the School of Computer Science and Information Engineering at the Hefei University of Technology, China, and the president of Mininglamp Academy of Sciences, Mininglamp, China, and a fellow of IEEE and AAAS. He received his BS and MS degrees in computer science from the Hefei University of Technology, China, and his PhD degree in artificial intelligence from the University of Edinburgh, UK. His research interests include data mining, big data analytics, and knowledge-based systems.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qiang, J., Zhang, F., Li, Y. et al. Unsupervised statistical text simplification using pre-trained language modeling for initialization. Front. Comput. Sci. 17, 171303 (2023). https://doi.org/10.1007/s11704-022-1244-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-022-1244-0

Keywords

Navigation