Crowdsourcing Syntactically Diverse Paraphrases with Diversity-Aware Prompts and Workflows

Ramírez, Jorge; Baez, Marcos; Berro, Auday; Benatallah, Boualem; Casati, Fabio

doi:10.1007/978-3-031-07472-1_15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13295))

Included in the following conference series:

International Conference on Advanced Information Systems Engineering

1737 Accesses
1 Citations

Abstract

Task-oriented bots (or simply bots) enable humans to perform tasks in natural language. For example, to book a restaurant or check the weather. Crowdsourcing has become a prominent approach to build datasets for training and evaluating task-oriented bots, where the crowd grows an initial seed of utterances through paraphrasing, i.e., reformulating a given seed into semantically equivalent sentences. In this context, the resulting diversity is a relevant dimension of high-quality datasets, as diverse paraphrases capture the many ways users may express an intent. Current techniques, however, are either based on the assumption that crowd-powered paraphrases are naturally diverse or focus only on lexical diversity. In this paper, we address an overlooked aspect of diversity and introduce an approach for guiding the crowdsourcing process towards paraphrases that are syntactically diverse. We introduce a workflow and novel prompts that are informed by syntax patterns to elicit paraphrases avoiding or incorporating desired syntax. Our empirical analysis indicates that our approach yields higher syntactic diversity, syntactic novelty and more uniform pattern distribution than state-of-the-art baselines, albeit incurring on higher task effort.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Automated Paraphrase Generation with Over-Generation and Pruning Services

Error Types in Transformer-Based Paraphrasing Models: A Taxonomy, Paraphrase Annotation Model and Dataset

PTT5-Paraphraser: Diversity and Meaning Fidelity in Automatic Portuguese Paraphrasing

Notes

1.
Refer to [26, 28] for other relevant quality aspects in crowdsourced paraphrases.
2.
Reference for bracket labels at https://gist.github.com/nlothian/9240750.
3.
We set $k=2$ as prompts from prior art typically include two examples [17].
4.
Online supplementary material available at https://tinyurl.com/caise-2022-diversity.
5.
We used the dataset available at https://www.kaggle.com/siddhadev/atis-dataset-from-ms-cntk. The top-5 intents are those with the highest number of training items.
6.
We stress that BertScore was not designed specifically for assessing paraphrases, so it does not capture the full range of criteria of the more specific manual evaluation.
7.
The datasets can be found at https://github.com/jorgeramirez/syntactic-diversity.

References

Bapat, R., Kucherbaev, P., Bozzon, A.: Effective crowdsourced generation of training data for chatbots natural language understanding. In: Mikkonen, T., Klamma, R., Hernández, J. (eds.) ICWE 2018. LNCS, vol. 10845, pp. 114–128. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91662-0_8
Chapter Google Scholar
Berro, A., Baez, M., Benatallah, B., Benabdeslem, K., Fard, M.-A.Y.Z.: Automated paraphrase generation with over-generation and pruning services. In: Hacid, H., Kao, O., Mecella, M., Moha, N., Paik, H. (eds.) ICSOC 2021. LNCS, vol. 13121, pp. 400–414. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91431-8_25
Chapter Google Scholar
Berro, A., et al.: An extensible and reusable pipeline for automated utterance paraphrases. In: Proceedings of the VLDB Endowment (2021)
Google Scholar
Chen, M., et al.: Controllable paraphrase generation with a syntactic exemplar. In: ACL (2019)
Google Scholar
Coucke, A., et al.: Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. CoRR abs/1805.10190 (2018)
Google Scholar
Goyal, T., Durrett, G.: Neural syntactic preordering for controlled paraphrase generation. In: ACL (2020)
Google Scholar
Hemphill, C.T., et al.: The ATIS spoken language systems pilot corpus. In: Workshop Held at Hidden Valley, Pennsylvania, USA (1990)
Google Scholar
Huang, K.H., Chang, K.W.: Generating syntactically controlled paraphrases without using annotated parallel pairs. arXiv preprint arXiv:2101.10579 (2021)
Iyyer, M., et al.: Adversarial example generation with syntactically controlled paraphrase networks. In: NAACL (2018)
Google Scholar
Jiang, Y., Kummerfeld, J.K., Lasecki, W.S.: Understanding task design trade-offs in crowdsourced paraphrase collection. In: ACL (2017)
Google Scholar
Kang, Y., et al.: Data collection for dialogue system: a startup perspective. In: Proceedings of the HLT, vol. 3, pp. 33–40 (2018)
Google Scholar
Larson, S., et al.: Outlier detection for improved data quality and diversity in dialog systems. In: NAACL-HLT (2019)
Google Scholar
Larson, S., et al.: Iterative feature mining for constraint-based data collection to increase data diversity and model robustness. In: EMNLP (2020)
Google Scholar
Lee, W., et al.: Effective quality assurance for data labels through crowdsourcing and domain expert collaboration. In: EDBT (2018)
Google Scholar
Liu, P., Liu, T.: Optimizing the design and cost for crowdsourced conversational utterances. In: KDD-DCCL (2019)
Google Scholar
Manning, C.D., et al.: The Stanford CoreNLP natural language processing toolkit. In: ACL (2014)
Google Scholar
Negri, M., et al.: Chinese whispers: cooperative paraphrase acquisition. In: LREC (2012)
Google Scholar
Park, S., et al.: Paraphrase diversification using counterfactual debiasing. In: AAAI (2019)
Google Scholar
Qi, P., et al.: Stanza: a Python natural language processing toolkit for many human languages. In: ACL (2020)
Google Scholar
Ravichander, A., et al.: How would you say it? Eliciting lexically diverse dialogue for supervised semantic parsing. In: SIGDIAL (2017)
Google Scholar
Su, Y., et al.: Building natural language interfaces to web APIs. In: CIKM (2017)
Google Scholar
Thompson, B., Post, M.: Paraphrase generation as zero-shot multilingual translation. arXiv:2008.04935 (2020)
Wang, W.Y., et al.: Crowdsourcing the acquisition of natural language corpora: methods and observations. In: (SLT) (2012)
Google Scholar
Wasow, T., Perfors, A., Beaver, D.: The puzzle of ambiguity. Morphology and the web of grammar: essays in memory of Steven G. Lapointe, pp. 265–282 (2005)
Google Scholar
Xu, Q., et al.: D-page: Diverse paraphrase generation. arXiv:1808.04364 (2018)
Yaghoub-Zadeh-Fard, M., et al.: A study of incorrect paraphrases in crowdsourced user utterances. In: NAACL-HLT (2019)
Google Scholar
Yaghoub-Zadeh-Fard, M., et al.: Dynamic word recommendation to obtain diverse crowdsourced paraphrases of user utterances. In: IUI (2020)
Google Scholar
Yaghoub-Zadeh-Fard, M., et al.: User utterance acquisition for training task-oriented bots: a review of challenges, techniques and opportunities. IC (2020)
Google Scholar
Zhang, T., et al.: BERTScore: evaluating text generation with BERT. arXiv:1904.09675 (2019)

Download references

Author information

Authors and Affiliations

LIRIS – University of Claude Bernard Lyon 1, Villeurbanne, France
Jorge Ramírez, Marcos Baez & Auday Berro
University of New South Wales, Kensington, Australia
Boualem Benatallah
ServiceNow, Santa Clara, USA
Fabio Casati

Authors

Jorge Ramírez
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Baez
View author publications
You can also search for this author in PubMed Google Scholar
Auday Berro
View author publications
You can also search for this author in PubMed Google Scholar
Boualem Benatallah
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Casati
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorge Ramírez .

Editor information

Editors and Affiliations

Department of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain
Xavier Franch
Ghent University, Gent, Belgium
Geert Poels
Ghent University, Gent, Belgium
Frederik Gailly
KU Leuven, Leuven, Belgium
Monique Snoeck

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ramírez, J., Baez, M., Berro, A., Benatallah, B., Casati, F. (2022). Crowdsourcing Syntactically Diverse Paraphrases with Diversity-Aware Prompts and Workflows. In: Franch, X., Poels, G., Gailly, F., Snoeck, M. (eds) Advanced Information Systems Engineering. CAiSE 2022. Lecture Notes in Computer Science, vol 13295. Springer, Cham. https://doi.org/10.1007/978-3-031-07472-1_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-07472-1_15
Published: 03 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07471-4
Online ISBN: 978-3-031-07472-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics