Abstract
In a community question answering (CQA) system, the new questions are appeared endlessly which have no tags. And the questions must be marked as some labels. Therefore, the question classification is very important for CQA. In the traditional task of question classification, a mass of labeled questions are required. In the real world, it is effortless to obtain a large number of unlabeled question samples and the vast labeled question samples are fairly expensive to obtain. Therefore, how to utilize the unlabeled samples to improve the question classification accuracy has been the core question of the question classification. In this paper, a kind of semi-supervised question classification method based on ensemble learning is proposed. Firstly, several classifiers are combined as one, i.e. ensemble classifier. The ensemble classifier is trained firstly to utilize a small number of labeled question samples. Secondly, the trained preliminary classifier gives each of the unlabeled question samples a pseudo label. Then, the ensemble classifier is trained again to use the labeled question samples and a large number of unlabeled question samples which have pseudo labels. Finally, to verify the effectiveness of the method through the experiments on question samples of 15 classes extracted from the community question answering system. The experiments demonstrate that the method could effectively utilize a large number of unlabeled question samples to improve the question classification accuracy.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th annual conference on computational learning theory, Wisconsin, MI, USA, pp 92–100
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Carlson A, Cumby C, Rosen J, Roth D (1999) The SNoW learning architecture. Technical report UIUCDCS-R-99-2101, UIUC Computer Science Department
Chapelle O, Zien A (2005) Semi-supervised classification by low density separation. In: Proceedings of the tenth international workshop on artificial intelligence and statistics, vol 1, pp 57–64
Efron M (2008) Query expansion and dimensionality reduction: notions of optimality in rocchio relevance feedback and latent semantic indexing. Inf Process Manage 44(1):163–180
Freund Y, Schapire RE (1995) A decision-theoretic generalization of on-line learning and an application to boosting. In: Vitányi P (ed) Computational learning theory. Springer, Berlin/Heidelberg, pp 23–37
Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer, Berlin
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin
Li M, Zhou Z-H (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst Man Cybern Part A 37(6):1088–1098
Liao H, Zhou D (2012) Review of AdaBoost and its improvement. Comput Syst Appl 5:240–244
Liu Y, Agichtein E (2008) On the evolution of the Yahoo! Answers QA community. In: The ACM SIGIR international conference on research and development in information retrieval, Singapore, pp 737–738
Liu R, Wang L (2013) Keyword extraction algorithm combining semantic extension degree and lexical chain. J Comput 40(12):265–266
Mengxiao Z, Zhi C, Qingsheng C (2003) Automatic keywords extraction of Chinese document using small world structure. In: Proceedings of the international conference on natural language processing and knowledge engineering. IEEE, pp 438–443
Mihalcea R (2004) Co-training and self-training for word sense disambiguation. In: Proceedings of the conference on computational natural language learning (CoNLL-2004)
Nigam K, Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the 9th international conference on information and knowledge management, McLean, VA, USA, pp 86–93
Nigam K, McCallum A, Mitchell T (2006) Semi-supervised text classification using EM. In: Chapelle O, Schölkopf B, Zien A (eds) Semi-supervised learning. MIT Press, pp 33–56
Pu W, Liu N, Yan S et al (2007) Local word bag model for text categorization. In: Seventh IEEE international conference on data mining (ICDM 2007). IEEE, pp 625–630
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523
Tian J, Zhao W (2010) Word similarity calculation method based on Tongyici Cilin. J Jilin Univ (Inf Sci Ed) 28(6):603–604
Wang W, Zhou ZH (2010) A new analysis of co-training. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 1135–1142
Xu Xun-hua, Wang Ji-cheng (2004) The multi-classification algorithm of SVM. Microelectron Comput 21(10):149–152
Yang S, Chao G, Feng Q (2012) The question feature model of the basic features and word bag bindinfusion. J Chin Inf 26(5):46–52
Yang L, Qiu M, Gottipati S et al (2013) Cqarank: jointly model topics and expertise in community question answering. In: Proceedings of the 22nd ACM international conference on conference on information and knowledge management. ACM, pp 99–108
Zhang D, Lee WS (2003) Question classification using support vector machines. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, Toronto, Canada, pp 26–32
Zhang Z, Li Q (2010) Studies on community question answering—a survey. Comput Sci 37(1):19–20
Zhang Y, Chen M, Mao S, Hu L, Leung V (2014) CAP: crowd activity prediction based on big data analysis. IEEE Netw 28(4):52–57
Zhou Z-H, Li M (2005a) Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans Knowl Data Eng 17(11):1529–1541
Zhou ZH, Li M (2005b) Semi-supervised regression with co-training. IJCAI 5:908–913
Zhou G, Liu Y, Liu F et al (2013) Improving question retrieval in community question answering using world knowledge. In: Proceedings of the twenty-third international joint conference on artificial intelligence. AAAI Press, Palo Alto, pp 2239–2245
Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using gaussian fields and harmonic functions. ICML 3:912–919
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 61365010).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Y., Su, L., Chen, J. et al. Semi-supervised learning for question classification in CQA. Nat Comput 16, 567–577 (2017). https://doi.org/10.1007/s11047-016-9554-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11047-016-9554-5