Abstract
Classifying the task of automatically assigning unlabeled questions into predefined categories (or topics) and effectively retrieving a similar question are crucial aspects of an effective cQA service. We first address the problems associated with estimating and utilizing the distribution of words in each category of word weights. We then apply an automatic expansion word generation technique that is based on our proposed weighting method and the pseudo relevance feedback to question classification. Secondly to address the lexical gap problem in question retrieval, the case frame of the sentence is first defined using the extracted components of a sentence, and a similarity measure based on the case frame and the word embedding is then derived to determine the similarities between two sentences. These similarities are then used to reorder the results of the first retrieval model. Consequently, the proposed methods significantly improve the performance of question classification and retrieval.
Similar content being viewed by others
Notes
It is similar to the spacing words
References
Bae, K.M., & Ko, T. J. (2014). An effective question expanding method for question classification in cqa services, PIKM ’14: 51–55. https://doi.org/10.1145/2663714.2668050.
Bernhard, D., & Gurevych, I. (2009). Combining lexical semantic resources with question & answer archives for translation-based answer finding, ACL ’09, pp. 728—736. https://doi.org/10.3115/1690219.1690248.
Berger, A., & Lafferty, J. (1999). Information retrieval as statistical translation, SIGIR’99, pp. 222–229. https://doi.org/10.1145/312624.312681.
Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computaional Linguistics, 19(2), 263–311.
Bracewell, D. B., Yan, J., Ren, F., Kuroiwa, S. (2009). Category classification and topic discovery of Japanese and English news articles. Electronic Notes in Theoretical Computer Science, 225(2), 51–65. https://doi.org/10.1016/j.entcs.2008.12.066.
Cai, L., Zhou, G., Liu, K., Zhau, J. (2011). Large-Scal question classification in cQA by leveraging Wikipedia semantic knowledge, CIKM ’11, pp. 1321–1330. https://doi.org/10.1145/2063576.2063768.
Cao, G., Gao, J., Robertson, S. (2008). Selecting good expansion terms for pseudo-relevance feedback, SIGIR ’08, pp. 243–250. https://doi.org/10.1145/1390334.1390377.
Cai, L., Zhou, G., Liu, K., Zhao, J. (2012). Learning the latent topics for question retrieval in community QA, ACL’12, pp. 273–281.
Cao, X., Cong, G., Cui, B., Jensen, C. S., Zhang, C. (2009). The use of categorization information in language models for question retrieval, CIKM’09, pp 265–274. https://doi.org/10.1145/1645953.1645989.
Cao, X., Cong, G., Cui, B., Jensen, C. S. (2010). A generalized framework of exploring category information for question retrieval in community question answer archives, WWW’10, pp. 201–210. https://doi.org/10.1145/1772690.1772712.
Duan, H., Cao, Y., Lin, C. Y., Yu, Y. (2008). Searching questions by identifying questions topics and question focus, ACL’08, pp. 156–164.
Elci, A. (2011). Text classification by PNN-based term re-weighting. International Journal of Computer Applications (0975 — 8887), 29(12), 7–13. https://doi.org/10.5120/3701-5188.
Huang, Q., Song, D., Ruger, S. (2008). Robust query-specific pseudo feedback document selection for query expasion, ECIR ’08. LNCS, 4956, 547–554.
Huang, P., Bu, J. J., Chen, C., Qiu, G. (2007). An effective feature-weighting model for question classification, CIS ’07, pp. 32–36. https://doi.org/10.1109/CIS.2007.12.
Jiang, H., Li, P., Hu, X., Wang, S. (2009). An improved method of term weighting for text classification, ICIS ’09, pp. 294–298. https://doi.org/10.1109/ICICISYS.2009.5357842.
Jehl, L., Hieber, F., Riezler, S. (2012). Twitter translation using translation-based cross-lingual retrieval, WMT ’12, pp. 410—421.
Jeon, J., Croft, W. B., Lee, J. H. (2005). Finding similar questions in large question and answer archives, CIKM ’05, pp. 84—90. https://doi.org/10.1145/1099554.1099572.
Ji, Z., Xu, F., Wang, B., He, B. (2012). Question retrieval with high quality answers in community question answering, CIKM’12, pp. 2471–2474. https://doi.org/10.1145/2661829.2661908.
Karimzadehgan, M., & Zhai, C. X. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval, SIGIR’10, pp. 323–330. https://doi.org/10.1145/1835449.1835505.
Kim, S. H., Ko, Y. J., Oard, D. W. (2015). Combining lexical and statistical translation evidence for cross-language information retrieval. Journal of the American Society for Information Science and Technology, 66(1), 1–17. https://doi.org/10.1002/asi.23153.
Lee, K. S., Croft, W. B., Allan, J. (2008a). A cluster-based resampling method for pseudo-relevance feedback, SIGIR ’08, pp. 235–242. https://doi.org/10.1145/1390334.1390376.
Lee, Z.S., Maarof, M. A., Selamat, A., Shamsuddin, S. M. (2008b). Enhance term weighting algorithm as feature selection technique for illicit web content classification, ISDA ’08, pp. 145–150. https://doi.org/10.1109/ISDA.2008.171.
Li, R., & Guo, X. (2010). An improved algorithm to term weighting in text classification, ICMT ’10, pp. 1–3. https://doi.org/10.1109/ICMULT.2010.5630962.
Loni, B. (2011). A survey of state-of-the-art methods on question classification, (pp. 1–40). Delft University of Technology: Tech. Rep. http://resolver.tudelft.nl/uuid:8e57caa8-04fc-4fe2-b668-20767ab3db92.
Magdy, W., & Jones, G. J. F. (2011). A study on query expansion methods for patent retrieval, PaIR ’11, pp. 19–24. https://doi.org/10.1145/2064975.2064982.
Manning, C. D., Raghavan, P., Schutze, H. (2007). An introduction to information retrieval, (pp. 173–1). Cambridge: Cambridge University Press.
Murdock, V., & Croft, W. B. (2005). A statistical model for sentence retrieval, EMNLP ’05, pp. 684–691.
Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval, SIGIR’98, pp. 275–281. https://doi.org/10.1145/290941.291008.
Quan, X., Liu, W., Bite, Q. (2011). Term weighting schemes for question categorization. Pattern Analysis and Machine Intelligence, 33(5), 1009–1021. https://doi.org/10.1109/TPAMI.2010.154.
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M. (1994). Okapi at trec-3, TREC-3, pp. 109–126.
Robertson, S.E., & Walker, S. (1999). Okapi/Keenbow at TREC-8. In: TREC-8, pp. 151–161. http://trec.nist.gov/pubs/trec8/papers/okapi.pdf.
Ruthven, I., & Lalmas, M. (2003). A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review, 18(2), 95–145. https://doi.org/10.1017/S0269888903000638.
Salton, G., Wong, A., Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. https://doi.org/10.1145/361219.361220.
Shah, C., & Pomerantz, J. (2010). Evaluating and predicting answer quality in community QA, SIGIR ’10, pp. 411–418. https://doi.org/10.1145/1835449.1835518.
Sun, R., Ong, C. H., Chua, T. S. (2006). Mining dependency relations for query expansion in passage retrieval, SIGIR ’06, pp. 382–389. https://doi.org/10.1145/1148170.1148237.
Yang, X., Jones, G. J., Wang, B. (2009). Query dependent pseudo-relevance feedback based on Wikipedia, SIGIR ’09, pp. 59–66. https://doi.org/10.1145/1571941.1571954.
Yu, S., Cai, D., Wen, J. R., Ma, W. Y. (2003). Improving pseudo-relevance feedback in web information retrieval using web page segmentation, WWW ’03, pp. 11–18. https://doi.org/10.1145/775152.775155.
Xue, X., & Croft, W. B. (2008). Retrieval models for question and answer archives, SIGIR ’08, pp. 475–482. https://doi.org/10.1145/1390334.1390416.
Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information System, 22(2), 179–214. https://doi.org/10.1145/984321.984322.
Zhang, K., Wu, W., Wu, H., Li, Z., Zhou, M. (2014). Question retrieval with high quality answers in community question answering, CIKM’14, pp. 371–380. https://doi.org/10.1145/2661829.2661908.
Acknowledgments
This work was supported by Institute for Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2013-2-00131, Development of Knowledge Evolutionary WiseQA Platform Technology for Human Knowledge Augmented Services).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bae, K., Ko, Y. Efficient question classification and retrieval using category information and word embedding on cQA services. J Intell Inf Syst 53, 27–49 (2019). https://doi.org/10.1007/s10844-019-00556-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-019-00556-x