Abstract
Task-oriented bots (or simply bots) enable humans to perform tasks in natural language. For example, to book a restaurant or check the weather. Crowdsourcing has become a prominent approach to build datasets for training and evaluating task-oriented bots, where the crowd grows an initial seed of utterances through paraphrasing, i.e., reformulating a given seed into semantically equivalent sentences. In this context, the resulting diversity is a relevant dimension of high-quality datasets, as diverse paraphrases capture the many ways users may express an intent. Current techniques, however, are either based on the assumption that crowd-powered paraphrases are naturally diverse or focus only on lexical diversity. In this paper, we address an overlooked aspect of diversity and introduce an approach for guiding the crowdsourcing process towards paraphrases that are syntactically diverse. We introduce a workflow and novel prompts that are informed by syntax patterns to elicit paraphrases avoiding or incorporating desired syntax. Our empirical analysis indicates that our approach yields higher syntactic diversity, syntactic novelty and more uniform pattern distribution than state-of-the-art baselines, albeit incurring on higher task effort.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Reference for bracket labels at https://gist.github.com/nlothian/9240750.
- 3.
We set \(k=2\) as prompts from prior art typically include two examples [17].
- 4.
Online supplementary material available at https://tinyurl.com/caise-2022-diversity.
- 5.
We used the dataset available at https://www.kaggle.com/siddhadev/atis-dataset-from-ms-cntk. The top-5 intents are those with the highest number of training items.
- 6.
We stress that BertScore was not designed specifically for assessing paraphrases, so it does not capture the full range of criteria of the more specific manual evaluation.
- 7.
The datasets can be found at https://github.com/jorgeramirez/syntactic-diversity.
References
Bapat, R., Kucherbaev, P., Bozzon, A.: Effective crowdsourced generation of training data for chatbots natural language understanding. In: Mikkonen, T., Klamma, R., Hernández, J. (eds.) ICWE 2018. LNCS, vol. 10845, pp. 114–128. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91662-0_8
Berro, A., Baez, M., Benatallah, B., Benabdeslem, K., Fard, M.-A.Y.Z.: Automated paraphrase generation with over-generation and pruning services. In: Hacid, H., Kao, O., Mecella, M., Moha, N., Paik, H. (eds.) ICSOC 2021. LNCS, vol. 13121, pp. 400–414. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91431-8_25
Berro, A., et al.: An extensible and reusable pipeline for automated utterance paraphrases. In: Proceedings of the VLDB Endowment (2021)
Chen, M., et al.: Controllable paraphrase generation with a syntactic exemplar. In: ACL (2019)
Coucke, A., et al.: Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. CoRR abs/1805.10190 (2018)
Goyal, T., Durrett, G.: Neural syntactic preordering for controlled paraphrase generation. In: ACL (2020)
Hemphill, C.T., et al.: The ATIS spoken language systems pilot corpus. In: Workshop Held at Hidden Valley, Pennsylvania, USA (1990)
Huang, K.H., Chang, K.W.: Generating syntactically controlled paraphrases without using annotated parallel pairs. arXiv preprint arXiv:2101.10579 (2021)
Iyyer, M., et al.: Adversarial example generation with syntactically controlled paraphrase networks. In: NAACL (2018)
Jiang, Y., Kummerfeld, J.K., Lasecki, W.S.: Understanding task design trade-offs in crowdsourced paraphrase collection. In: ACL (2017)
Kang, Y., et al.: Data collection for dialogue system: a startup perspective. In: Proceedings of the HLT, vol. 3, pp. 33–40 (2018)
Larson, S., et al.: Outlier detection for improved data quality and diversity in dialog systems. In: NAACL-HLT (2019)
Larson, S., et al.: Iterative feature mining for constraint-based data collection to increase data diversity and model robustness. In: EMNLP (2020)
Lee, W., et al.: Effective quality assurance for data labels through crowdsourcing and domain expert collaboration. In: EDBT (2018)
Liu, P., Liu, T.: Optimizing the design and cost for crowdsourced conversational utterances. In: KDD-DCCL (2019)
Manning, C.D., et al.: The Stanford CoreNLP natural language processing toolkit. In: ACL (2014)
Negri, M., et al.: Chinese whispers: cooperative paraphrase acquisition. In: LREC (2012)
Park, S., et al.: Paraphrase diversification using counterfactual debiasing. In: AAAI (2019)
Qi, P., et al.: Stanza: a Python natural language processing toolkit for many human languages. In: ACL (2020)
Ravichander, A., et al.: How would you say it? Eliciting lexically diverse dialogue for supervised semantic parsing. In: SIGDIAL (2017)
Su, Y., et al.: Building natural language interfaces to web APIs. In: CIKM (2017)
Thompson, B., Post, M.: Paraphrase generation as zero-shot multilingual translation. arXiv:2008.04935 (2020)
Wang, W.Y., et al.: Crowdsourcing the acquisition of natural language corpora: methods and observations. In: (SLT) (2012)
Wasow, T., Perfors, A., Beaver, D.: The puzzle of ambiguity. Morphology and the web of grammar: essays in memory of Steven G. Lapointe, pp. 265–282 (2005)
Xu, Q., et al.: D-page: Diverse paraphrase generation. arXiv:1808.04364 (2018)
Yaghoub-Zadeh-Fard, M., et al.: A study of incorrect paraphrases in crowdsourced user utterances. In: NAACL-HLT (2019)
Yaghoub-Zadeh-Fard, M., et al.: Dynamic word recommendation to obtain diverse crowdsourced paraphrases of user utterances. In: IUI (2020)
Yaghoub-Zadeh-Fard, M., et al.: User utterance acquisition for training task-oriented bots: a review of challenges, techniques and opportunities. IC (2020)
Zhang, T., et al.: BERTScore: evaluating text generation with BERT. arXiv:1904.09675 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Ramírez, J., Baez, M., Berro, A., Benatallah, B., Casati, F. (2022). Crowdsourcing Syntactically Diverse Paraphrases with Diversity-Aware Prompts and Workflows. In: Franch, X., Poels, G., Gailly, F., Snoeck, M. (eds) Advanced Information Systems Engineering. CAiSE 2022. Lecture Notes in Computer Science, vol 13295. Springer, Cham. https://doi.org/10.1007/978-3-031-07472-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-07472-1_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07471-4
Online ISBN: 978-3-031-07472-1
eBook Packages: Computer ScienceComputer Science (R0)