Skip to main content

Crowdsourcing Syntactically Diverse Paraphrases with Diversity-Aware Prompts and Workflows

  • Conference paper
  • First Online:
Advanced Information Systems Engineering (CAiSE 2022)

Abstract

Task-oriented bots (or simply bots) enable humans to perform tasks in natural language. For example, to book a restaurant or check the weather. Crowdsourcing has become a prominent approach to build datasets for training and evaluating task-oriented bots, where the crowd grows an initial seed of utterances through paraphrasing, i.e., reformulating a given seed into semantically equivalent sentences. In this context, the resulting diversity is a relevant dimension of high-quality datasets, as diverse paraphrases capture the many ways users may express an intent. Current techniques, however, are either based on the assumption that crowd-powered paraphrases are naturally diverse or focus only on lexical diversity. In this paper, we address an overlooked aspect of diversity and introduce an approach for guiding the crowdsourcing process towards paraphrases that are syntactically diverse. We introduce a workflow and novel prompts that are informed by syntax patterns to elicit paraphrases avoiding or incorporating desired syntax. Our empirical analysis indicates that our approach yields higher syntactic diversity, syntactic novelty and more uniform pattern distribution than state-of-the-art baselines, albeit incurring on higher task effort.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Refer to [26, 28] for other relevant quality aspects in crowdsourced paraphrases.

  2. 2.

    Reference for bracket labels at https://gist.github.com/nlothian/9240750.

  3. 3.

    We set \(k=2\) as prompts from prior art typically include two examples [17].

  4. 4.

    Online supplementary material available at https://tinyurl.com/caise-2022-diversity.

  5. 5.

    We used the dataset available at https://www.kaggle.com/siddhadev/atis-dataset-from-ms-cntk. The top-5 intents are those with the highest number of training items.

  6. 6.

    We stress that BertScore was not designed specifically for assessing paraphrases, so it does not capture the full range of criteria of the more specific manual evaluation.

  7. 7.

    The datasets can be found at https://github.com/jorgeramirez/syntactic-diversity.

References

  1. Bapat, R., Kucherbaev, P., Bozzon, A.: Effective crowdsourced generation of training data for chatbots natural language understanding. In: Mikkonen, T., Klamma, R., Hernández, J. (eds.) ICWE 2018. LNCS, vol. 10845, pp. 114–128. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91662-0_8

    Chapter  Google Scholar 

  2. Berro, A., Baez, M., Benatallah, B., Benabdeslem, K., Fard, M.-A.Y.Z.: Automated paraphrase generation with over-generation and pruning services. In: Hacid, H., Kao, O., Mecella, M., Moha, N., Paik, H. (eds.) ICSOC 2021. LNCS, vol. 13121, pp. 400–414. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91431-8_25

    Chapter  Google Scholar 

  3. Berro, A., et al.: An extensible and reusable pipeline for automated utterance paraphrases. In: Proceedings of the VLDB Endowment (2021)

    Google Scholar 

  4. Chen, M., et al.: Controllable paraphrase generation with a syntactic exemplar. In: ACL (2019)

    Google Scholar 

  5. Coucke, A., et al.: Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. CoRR abs/1805.10190 (2018)

    Google Scholar 

  6. Goyal, T., Durrett, G.: Neural syntactic preordering for controlled paraphrase generation. In: ACL (2020)

    Google Scholar 

  7. Hemphill, C.T., et al.: The ATIS spoken language systems pilot corpus. In: Workshop Held at Hidden Valley, Pennsylvania, USA (1990)

    Google Scholar 

  8. Huang, K.H., Chang, K.W.: Generating syntactically controlled paraphrases without using annotated parallel pairs. arXiv preprint arXiv:2101.10579 (2021)

  9. Iyyer, M., et al.: Adversarial example generation with syntactically controlled paraphrase networks. In: NAACL (2018)

    Google Scholar 

  10. Jiang, Y., Kummerfeld, J.K., Lasecki, W.S.: Understanding task design trade-offs in crowdsourced paraphrase collection. In: ACL (2017)

    Google Scholar 

  11. Kang, Y., et al.: Data collection for dialogue system: a startup perspective. In: Proceedings of the HLT, vol. 3, pp. 33–40 (2018)

    Google Scholar 

  12. Larson, S., et al.: Outlier detection for improved data quality and diversity in dialog systems. In: NAACL-HLT (2019)

    Google Scholar 

  13. Larson, S., et al.: Iterative feature mining for constraint-based data collection to increase data diversity and model robustness. In: EMNLP (2020)

    Google Scholar 

  14. Lee, W., et al.: Effective quality assurance for data labels through crowdsourcing and domain expert collaboration. In: EDBT (2018)

    Google Scholar 

  15. Liu, P., Liu, T.: Optimizing the design and cost for crowdsourced conversational utterances. In: KDD-DCCL (2019)

    Google Scholar 

  16. Manning, C.D., et al.: The Stanford CoreNLP natural language processing toolkit. In: ACL (2014)

    Google Scholar 

  17. Negri, M., et al.: Chinese whispers: cooperative paraphrase acquisition. In: LREC (2012)

    Google Scholar 

  18. Park, S., et al.: Paraphrase diversification using counterfactual debiasing. In: AAAI (2019)

    Google Scholar 

  19. Qi, P., et al.: Stanza: a Python natural language processing toolkit for many human languages. In: ACL (2020)

    Google Scholar 

  20. Ravichander, A., et al.: How would you say it? Eliciting lexically diverse dialogue for supervised semantic parsing. In: SIGDIAL (2017)

    Google Scholar 

  21. Su, Y., et al.: Building natural language interfaces to web APIs. In: CIKM (2017)

    Google Scholar 

  22. Thompson, B., Post, M.: Paraphrase generation as zero-shot multilingual translation. arXiv:2008.04935 (2020)

  23. Wang, W.Y., et al.: Crowdsourcing the acquisition of natural language corpora: methods and observations. In: (SLT) (2012)

    Google Scholar 

  24. Wasow, T., Perfors, A., Beaver, D.: The puzzle of ambiguity. Morphology and the web of grammar: essays in memory of Steven G. Lapointe, pp. 265–282 (2005)

    Google Scholar 

  25. Xu, Q., et al.: D-page: Diverse paraphrase generation. arXiv:1808.04364 (2018)

  26. Yaghoub-Zadeh-Fard, M., et al.: A study of incorrect paraphrases in crowdsourced user utterances. In: NAACL-HLT (2019)

    Google Scholar 

  27. Yaghoub-Zadeh-Fard, M., et al.: Dynamic word recommendation to obtain diverse crowdsourced paraphrases of user utterances. In: IUI (2020)

    Google Scholar 

  28. Yaghoub-Zadeh-Fard, M., et al.: User utterance acquisition for training task-oriented bots: a review of challenges, techniques and opportunities. IC (2020)

    Google Scholar 

  29. Zhang, T., et al.: BERTScore: evaluating text generation with BERT. arXiv:1904.09675 (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jorge Ramírez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ramírez, J., Baez, M., Berro, A., Benatallah, B., Casati, F. (2022). Crowdsourcing Syntactically Diverse Paraphrases with Diversity-Aware Prompts and Workflows. In: Franch, X., Poels, G., Gailly, F., Snoeck, M. (eds) Advanced Information Systems Engineering. CAiSE 2022. Lecture Notes in Computer Science, vol 13295. Springer, Cham. https://doi.org/10.1007/978-3-031-07472-1_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-07472-1_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-07471-4

  • Online ISBN: 978-3-031-07472-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics