LexDivPara: A Measure of Paraphrase Quality with Integrated Sentential Lexical Complexity

Thieu, Thanh; Do, Ha; Duong, Thanh; Pu, Shi; Aakur, Sathyanarayanan; Khan, Saad

doi:10.1007/978-3-030-82199-9_1

Thanh Thieu¹⁰,
Ha Do¹¹,
Thanh Duong¹⁰,
Shi Pu¹³,
Sathyanarayanan Aakur¹⁰ &
…
Saad Khan¹²

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 296))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

1486 Accesses

Abstract

We present a novel method that automatically measures quality of sentential paraphrasing. Our method balances two conflicting criteria: semantic similarity and lexical diversity. Using a diverse annotated corpus, we built learning to rank models on edit distance, BLEU, ROUGE, and cosine similarity features. Extrinsic evaluation on STS Benchmark and ParaBank Evaluation datasets resulted in a model ensemble with moderate to high quality. We applied our method on both small benchmarking and large-scale datasets as resources for the community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Softcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Alfter, D., Volodina, E.: Towards single word lexical complexity prediction. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, New Orleans, Louisiana, pp. 79–88. Association for Computational Linguistics (2018)
Google Scholar
Diego Antognini. Py-rouge (2018)
Google Scholar
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Newton (2009)
MATH Google Scholar
Burges, C.J.C., Svore, K.M., Wu, Q., Gao, J.: Ranking, boosting, and model adaptation (2008)
Google Scholar
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 Task 1: semantic textual similarity multilingual and crosslingual focused evaluation, Vancouver, Canada, pp. 1–14. Association for Computational Linguistics (2017)
Google Scholar
Cer, D., et al.: Universal sentence encoder for English. In: 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 169–174. Association for Computational Linguistics (2018)
Google Scholar
Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System (2016)
Google Scholar
Hu, J.E., Rudinger, R., Post, M., Van Durme, B.: ParaBank: monolingual bitext generation and sentential paraphrasing via lexically-constrained neural machine translation. In: AAAI 2019, Honolulu, Hawaii (2019)
Google Scholar
Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database, pp. 758–764. Association for Computational Linguistics (2013)
Google Scholar
Hu, J.E., et al.: Improved lexically constrained decoding for translation and monolingual rewriting. In: NAACL 2019, Minneapolis, Minnesota (2019)
Google Scholar
Hu, J.E., Singh, A., Holzenberger, N., Post, M., Van Durme, B.: Large-scale, diverse, paraphrastic bitexts via sampling and clustering. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, pp. 44–54. Association for Computational Linguistics (2019)
Google Scholar
Iyyer, M., Manjunatha, V., Boyd-Graber, J., DaumÈ III, H.: Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1: Long Papers), Beijing, China, pp. 1681–1691. Association for Computational Linguistics (2015)
Google Scholar
Johansson, V.: Lexical diversity and lexical density in speech and writing: a developmental perspective. Lund Work. Papers Linguist. 53, 61–79 (2009)
Google Scholar
Kriz, R., Miltsakaki, E., Apidianaki, M., Callison-Burch, C.: Simplification using paraphrases and context-based lexical substitution. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), New Orleans, Louisiana, pp. 207–217. Association for Computational Linguistics (2018)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics (2004)
Google Scholar
Xiaofei, L.: The relationship of lexical richness to the quality of ESL learners’ oral narratives. Mod. Lang. J. 96(2), 190–208 (2012)
Article Google Scholar
Maddela, M., Xu, W.: A word-complexity lexicon and a neural readability ranking model for lexical simplification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3749–3760. Association for Computational Linguistics (2018)
Google Scholar
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The stanford CoreNLP natural language processing toolkit, pp. 55–60. Association for Computational Linguistics (2014)
Google Scholar
Miller, F.P., Vandome, A.F., McBrewster, J.: Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance. Alpha Press, Orlando (2009)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pap. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Read, J.: Assessing Vocabulary. Cambridge University Press, Cambridge (2000)
Book Google Scholar
Sakaguchi, K., Van Durme, B.: Efficient online scalar annotation with bounded support, Melbourne, Australia, pp. 208–218. Association for Computational Linguistics (2018)
Google Scholar
Wieting, J., Gimpel, K.: ParaNMT-50M: pushing the limits of paraphrastic sentence embeddings with millions of machine translations, pp. 451–462. Association for Computational Linguistics (2018)
Google Scholar
Wilkens, R., Vecchia, A.D., Boito, M.Z., Padró, M., Villavicencio, A.: Size does not matter. frequency does. a study of features for measuring lexical complexity. In: Bazzan, A.L.C., Pichara, K. (eds.) IBERAMIA 2014. LNCS (LNAI), vol. 8864, pp. 129–140. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12027-0_11
Chapter Google Scholar
Yang, Y., et al.: Multilingual universal sentence encoder for semantic retrieval, pp. 87–94 (2019)
Google Scholar
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
Article MathSciNet Google Scholar

Download references

Acknowledgment

The authors would like to thank ACT for assisting with collection of the original text and annotation on Amazon Mechanical Turk. This work is partly supported by the first author’s start-up fund, the first author’s OSU ASR FY22 summer program, NSF CISE/IIS 1838808 grant, and NSF OIA 1849213 grant.

Author information

Authors and Affiliations

Oklahoma State University, Stillwater, OK, 74078, USA
Thanh Thieu, Thanh Duong & Sathyanarayanan Aakur
University of Louisville, Louisville, KY, 40292, USA
Ha Do
FineTune Learning, Boston, MA, USA
Saad Khan
Educational Testing Service, Toronto, Ontario, Canada
Shi Pu

Authors

Thanh Thieu
View author publications
You can also search for this author in PubMed Google Scholar
Ha Do
View author publications
You can also search for this author in PubMed Google Scholar
Thanh Duong
View author publications
You can also search for this author in PubMed Google Scholar
Shi Pu
View author publications
You can also search for this author in PubMed Google Scholar
Sathyanarayanan Aakur
View author publications
You can also search for this author in PubMed Google Scholar
Saad Khan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thanh Thieu .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Thieu, T., Do, H., Duong, T., Pu, S., Aakur, S., Khan, S. (2022). LexDivPara: A Measure of Paraphrase Quality with Integrated Sentential Lexical Complexity. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2021. Lecture Notes in Networks and Systems, vol 296. Springer, Cham. https://doi.org/10.1007/978-3-030-82199-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-82199-9_1
Published: 07 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82198-2
Online ISBN: 978-3-030-82199-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics