Abstract
A significant amount of business and scientific data is collected via question and answer reports. However, these reports often suffer from various data quality issues. In many cases, questionnaires contain a number of questions that require multiple answers, which we argue can be a potential source of problems that may lead to poor-quality answers. This paper introduces multi-focal questions and proposes a model for identifying them. The model consists of three phases: question pre-processing, feature engineering and question classification. We use six types of features: lexical/surface features, Part-of-Speech, readability, question structure, wording and placement features, question response type and format features and question focus. A comparative study of three different machine learning algorithms (Bayes Net, Decision Tree and Support Vector Machine) is performed on a dataset of 150 questions obtained from the Carbon Disclosure Project, achieving the accuracy of 91%.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Blumberg, R., Atre, S.: The problem with unstructured data. DM Review 13, 42–49 (2003)
Marshall, G.: The purpose, design and administration of a questionnaire for data collection. Radiography 11(2), 131–136 (2005)
Fadem, T.J.: The art of asking: ask better questions, get better answers. FT Press (2008)
Leung, W.-C.: How to design a questionnaire. BMJ 9(11), 187–189 (2001)
Huang, P., Bu, J., Chen, C., Qiu, G.: An effective feature-weighting model for question classification. In: Computational Intelligence and Security International Conference, pp. 32–36. IEEE (2007)
Tamura, A., Takamura, H., Okumura, M.: Classification of multiple-sentence questions. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 426–437. Springer, Heidelberg (2005)
Xiao-Ming, L., Li, L.: Question Classification Based on Focus. In: 2012 International Conference Communication Systems and Network Technologies (CSNT), pp. 512–516. IEEE (2012)
Bos, J.: The “La Sapienza” Question Answering System at TREC-2006. In: Voorhees, E.M., Buckland, L.P. (eds.) The Fifteenth Text RETrieval Conference, Gaitersburg, MD, pp. 797–803 (2006)
Sahin, A., Kulm, G.: Sixth grade mathematics teachers’ intentions and use of probing, guiding, and factual questions. Journal of Mathematics Teacher Education 11(3), 221–241 (2008)
Hagstrom, P.A.: Decomposing questions. PhD dissertation, Massachusetts Institute of Technology (1998)
Isaacs, J., Rawlins, K.: Conditional questions. Journal of Semantics 25(3), 269–319 (2008)
Rubin, A., Babbie, E.R.: Research methods for social work. Cengage Learning (2008)
Voorhees, E.M.: Overview of the TREC 2001 question answering track. In: NIST Special Publication, pp. 42–51 (2002)
Sehgal, A.K., Das, S., Noto, K., Saier, M.K., Elkan, C.: Identifying relevant data for a biological database: Handcrafted rules versus machine learning. IEEE/ACM Transactions Computational Biology and Bioinformatics 8(3), 851–857 (2011)
Zhang, D., Lee, W.S.: Question classification using support vector machines. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 26–32. ACM (2003)
Loni, B., van Tulder, G., Wiggers, P., Tax, D.M.J., Loog, M.: Question classification by weighted combination of lexical, syntactic and semantic features. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 243–250. Springer, Heidelberg (2011)
Metzler, D., Croft, W.B.: Analysis of statistical question classification for fact-based questions. Information Retrieval 8 3, 481–504 (2005)
Carbon Disclosure Project, https://www.cdproject.net
Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Computational Linguistics 34(4), 555–596 (2008)
Murray, P.: Fundamental issues in questionnaire design. Accident and Emergency Nursing 7(3), 148–153 (1999)
TreeTagger - a language independent part-of-speech tagger, http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
Flesch, R.: A new readability yardstick. Journal of Applied Psychology 32, 221 (1948)
Kincaid, J.P., Fishburne Jr., R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Naval Technical Training Command Millington TN Research Branch (1975)
Flesch Reading Ease Readability Score, http://rfptemplates.technologyevaluation.com/readability-scores/flesch-reading-ease-readability-score.html
Flesch, R.F.: How to test readability. Harper (1951)
IBM SPSS Modeler for data and text mining, http://www.01.ibm.com/software/analytics-/spss-/products/modeler/
IBM SPSS Modeler Text Analytics, ftp://public.dhe.ibm.com/software/analytics/spss/doc-umentation/modeler/15.0/en/Users_Guide_For_Text_Analytics.pdf
Nenadié, G., Ananiadou, S., McNaught, J.: Enhancing automatic term recognition through recognition of variation. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 604. ACL (2004)
Bishop, C.M., Nasrabadi, N.M.: Pattern recognition and machine learning, vol. 1. Springer, New York (2006)
Kantardzic, M.: Data mining: concepts, models, methods, and algorithms. John Wiley & Sons (2011)
Li, D.-C., Fang, Y.-H., Fang, Y.M.: The data complexity index to construct an efficient cross-validation method. Decision Support Systems 50(1), 93–102 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Zaki Ali, M.M., Nenadic, G., Theodoulidis, B. (2014). Identification of Multi-Focal Questions in Question and Answer Reports. In: Métais, E., Roche, M., Teisseire, M. (eds) Natural Language Processing and Information Systems. NLDB 2014. Lecture Notes in Computer Science, vol 8455. Springer, Cham. https://doi.org/10.1007/978-3-319-07983-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-07983-7_17
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07982-0
Online ISBN: 978-3-319-07983-7
eBook Packages: Computer ScienceComputer Science (R0)