Skip to main content
Log in

Sentence extraction with topic modeling for question–answer pair generation

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Recently, automatic QA pair generation has been an essential technique to reduce human involvement in the construction of QA systems. In a big data era, huge information is produced every day. Therefore, it is an important issue for QA systems to be able to respond to users with up-to-date information, e.g., to answer questions regarding recent posts on blogs. The major problem in building such systems is the efficiency to capture relevant text sources for specific QA domains. In this study, topic modeling is used as a means to help determine efficiently if an article is of the same topic as a specific domain of interest, e.g., health domain as exemplified in this paper. QA pairs are then generated from these selected articles using the proposed sentence extraction method. Experimental results show that, using the proposed method with topic modeling, a 7.3 % acceptance rate improvement on the generated questions was achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Ali H, Chali Y et al (2010) Automation of question generation from sentences. Boyer & Piwek, pp 58–67

  • Bernhard D, De Viron L et al (2012) Question generation for French: collating parsers and paraphrasing questions. Dialogue Discourse 3(2):43–74

  • Bird S, Klein E et al (2009) Natural language processing with Python. O’Reilly Media Inc., Sebastopol, CA

  • Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on machine learning. ACM

  • Blei DM, Ng AY et al (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Chen K-J, Hsieh Y-M (2005) Chinese treebanks and grammar extraction. In: Natural language processing-IJCNLP 2004. Springer, pp 655–663

  • Chen K-J, Huang C-R et al (1996) Sinica corpus: design methodology for balanced corpora. Language 167:176

    Google Scholar 

  • Chen K-J, Luo C-C et al (1999) The CKIP Chinese Treebank: guidelines for annotation. ATALA Workshop-Treebanks, Paris

    Google Scholar 

  • Gildea D, Jurafsky D (2002) Automatic labeling of semantic roles. Comput. Linguist. 28(3):245–288

  • Graff D, Chen K (2003) Chinese Gigaword Corpus produced by Linguistic Data Consortium. LDC, Philadelphia, PA

  • Hakkani-Tur D, Tur G (2007) Statistical sentence extraction for information distillation. In: Acoustics, speech and signal processing, 2007. ICASSP 2007. IEEE International Conference on, IEEE

  • Huang C-R (2009) Tagged Chinese Gigaword version 2.0, LDC2009T14. Linguistic Data Consortium, Philadelphia, PA

  • Huang S-L, Chung Y-S et al (2008) E-HowNet—an expansion of HowNet. The First National HowNet Workshop, Beijing

    Google Scholar 

  • Kuyten P, Bickmore T et al (2012) Fully automated generation of question–answer Pairs for scripted virtual instruction. Intelligent Virtual Agents, Springer

  • Liu C-H, Wu C-H (2010) Sentence decomplexification using holistic aspect-based clause detection for long sentence understanding. In: 7th International Symposium on Chinese spoken language processing (ISCSLP). IEEE

  • Manyika J, Chui M et al (2011) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, New York

  • Phan X-H, Nguyen C-T (2007) GibbsLDA++: a C/C++ implementation of latent Dirichlet allocation (LDA). Technical report

  • Rus V, Wyse B et al (2010) Overview of the first question generation shared task evaluation challenge. In: Proceedings of QG2010: the third workshop on question generation

  • Shawar BA, Atwell E (2007) Different measurements metrics to evaluate a chatbot system. In: Proceedings of the Workshop on bridging the gap: academic and industrial research in dialog technologies. Association for Computational Linguistics

  • Sun W, Sui Z et al (2009) Chinese semantic role labeling with shallow parsing. In: Proceedings of the 2009 Conference on empirical methods in natural language processing. Association for Computational Linguistics, Stroudsburg, PA

  • Tan Y, Yao T et al (2005) Applying conditional random fields to chinese shallow parsing. Computational linguistics and intelligent text processing. Springer, Berlin, Heidelberg, pp 167–176

  • Wu C-H, Liu C-H et al (2010) Sentence correction incorporating relative position and parse template language models. Audio Speech Lang Process IEEE Trans 18(6):1170–1181

  • You J-M, Chen K-J (2004) Automatic semantic role assignment for a tree structure. In: Proceedings of the 3rd SigHAN Workshop on Chinese language processing

  • Zhao S, Wang H et al (2011) Automatically generating questions from queries for community-based question answering. In: Proceedings of the 5th International Joint Conference on natural language processing

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chung-Hsien Wu.

Additional information

Communicated by L. Xie.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, CH., Liu, CH. & Su, PH. Sentence extraction with topic modeling for question–answer pair generation. Soft Comput 19, 39–46 (2015). https://doi.org/10.1007/s00500-014-1386-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-014-1386-6

Keywords

Navigation