Skip to main content
Log in

Semi-supervised learning for question classification in CQA

  • Published:
Natural Computing Aims and scope Submit manuscript

Abstract

In a community question answering (CQA) system, the new questions are appeared endlessly which have no tags. And the questions must be marked as some labels. Therefore, the question classification is very important for CQA. In the traditional task of question classification, a mass of labeled questions are required. In the real world, it is effortless to obtain a large number of unlabeled question samples and the vast labeled question samples are fairly expensive to obtain. Therefore, how to utilize the unlabeled samples to improve the question classification accuracy has been the core question of the question classification. In this paper, a kind of semi-supervised question classification method based on ensemble learning is proposed. Firstly, several classifiers are combined as one, i.e. ensemble classifier. The ensemble classifier is trained firstly to utilize a small number of labeled question samples. Secondly, the trained preliminary classifier gives each of the unlabeled question samples a pseudo label. Then, the ensemble classifier is trained again to use the labeled question samples and a large number of unlabeled question samples which have pseudo labels. Finally, to verify the effectiveness of the method through the experiments on question samples of 15 classes extracted from the community question answering system. The experiments demonstrate that the method could effectively utilize a large number of unlabeled question samples to improve the question classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th annual conference on computational learning theory, Wisconsin, MI, USA, pp 92–100

  • Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  • Carlson A, Cumby C, Rosen J, Roth D (1999) The SNoW learning architecture. Technical report UIUCDCS-R-99-2101, UIUC Computer Science Department

  • Chapelle O, Zien A (2005) Semi-supervised classification by low density separation. In: Proceedings of the tenth international workshop on artificial intelligence and statistics, vol 1, pp 57–64

  • Efron M (2008) Query expansion and dimensionality reduction: notions of optimality in rocchio relevance feedback and latent semantic indexing. Inf Process Manage 44(1):163–180

    Article  Google Scholar 

  • Freund Y, Schapire RE (1995) A decision-theoretic generalization of on-line learning and an application to boosting. In: Vitányi P (ed) Computational learning theory. Springer, Berlin/Heidelberg, pp 23–37

    Chapter  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer, Berlin

    MATH  Google Scholar 

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin

    Google Scholar 

  • Li M, Zhou Z-H (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst Man Cybern Part A 37(6):1088–1098

    Article  Google Scholar 

  • Liao H, Zhou D (2012) Review of AdaBoost and its improvement. Comput Syst Appl 5:240–244

    Google Scholar 

  • Liu Y, Agichtein E (2008) On the evolution of the Yahoo! Answers QA community. In: The ACM SIGIR international conference on research and development in information retrieval, Singapore, pp 737–738

  • Liu R, Wang L (2013) Keyword extraction algorithm combining semantic extension degree and lexical chain. J Comput 40(12):265–266

    Google Scholar 

  • Mengxiao Z, Zhi C, Qingsheng C (2003) Automatic keywords extraction of Chinese document using small world structure. In: Proceedings of the international conference on natural language processing and knowledge engineering. IEEE, pp 438–443

  • Mihalcea R (2004) Co-training and self-training for word sense disambiguation. In: Proceedings of the conference on computational natural language learning (CoNLL-2004)

  • Nigam K, Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the 9th international conference on information and knowledge management, McLean, VA, USA, pp 86–93

  • Nigam K, McCallum A, Mitchell T (2006) Semi-supervised text classification using EM. In: Chapelle O, Schölkopf B, Zien A (eds) Semi-supervised learning. MIT Press, pp 33–56

  • Pu W, Liu N, Yan S et al (2007) Local word bag model for text categorization. In: Seventh IEEE international conference on data mining (ICDM 2007). IEEE, pp 625–630

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523

    Article  Google Scholar 

  • Tian J, Zhao W (2010) Word similarity calculation method based on Tongyici Cilin. J Jilin Univ (Inf Sci Ed) 28(6):603–604

    Google Scholar 

  • Wang W, Zhou ZH (2010) A new analysis of co-training. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 1135–1142

  • Xu Xun-hua, Wang Ji-cheng (2004) The multi-classification algorithm of SVM. Microelectron Comput 21(10):149–152

    Google Scholar 

  • Yang S, Chao G, Feng Q (2012) The question feature model of the basic features and word bag bindinfusion. J Chin Inf 26(5):46–52

    Google Scholar 

  • Yang L, Qiu M, Gottipati S et al (2013) Cqarank: jointly model topics and expertise in community question answering. In: Proceedings of the 22nd ACM international conference on conference on information and knowledge management. ACM, pp 99–108

  • Zhang D, Lee WS (2003) Question classification using support vector machines. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, Toronto, Canada, pp 26–32

  • Zhang Z, Li Q (2010) Studies on community question answering—a survey. Comput Sci 37(1):19–20

    MathSciNet  Google Scholar 

  • Zhang Y, Chen M, Mao S, Hu L, Leung V (2014) CAP: crowd activity prediction based on big data analysis. IEEE Netw 28(4):52–57

    Article  Google Scholar 

  • Zhou Z-H, Li M (2005a) Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans Knowl Data Eng 17(11):1529–1541

    Article  Google Scholar 

  • Zhou ZH, Li M (2005b) Semi-supervised regression with co-training. IJCAI 5:908–913

    Google Scholar 

  • Zhou G, Liu Y, Liu F et al (2013) Improving question retrieval in community question answering using world knowledge. In: Proceedings of the twenty-third international joint conference on artificial intelligence. AAAI Press, Palo Alto, pp 2239–2245

  • Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using gaussian fields and harmonic functions. ICML 3:912–919

    Google Scholar 

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61365010).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Su.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Su, L., Chen, J. et al. Semi-supervised learning for question classification in CQA. Nat Comput 16, 567–577 (2017). https://doi.org/10.1007/s11047-016-9554-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11047-016-9554-5

Keywords

Navigation