Semi-supervised learning for question classification in CQA

Li, Yiyang; Su, Lei; Chen, Jun; Yuan, Liwei

doi:10.1007/s11047-016-9554-5

Semi-supervised learning for question classification in CQA

Published: 05 May 2016

Volume 16, pages 567–577, (2017)
Cite this article

Natural Computing Aims and scope Submit manuscript

Yiyang Li¹,
Lei Su¹,
Jun Chen¹ &
…
Liwei Yuan¹

832 Accesses
15 Citations
Explore all metrics

Abstract

In a community question answering (CQA) system, the new questions are appeared endlessly which have no tags. And the questions must be marked as some labels. Therefore, the question classification is very important for CQA. In the traditional task of question classification, a mass of labeled questions are required. In the real world, it is effortless to obtain a large number of unlabeled question samples and the vast labeled question samples are fairly expensive to obtain. Therefore, how to utilize the unlabeled samples to improve the question classification accuracy has been the core question of the question classification. In this paper, a kind of semi-supervised question classification method based on ensemble learning is proposed. Firstly, several classifiers are combined as one, i.e. ensemble classifier. The ensemble classifier is trained firstly to utilize a small number of labeled question samples. Secondly, the trained preliminary classifier gives each of the unlabeled question samples a pseudo label. Then, the ensemble classifier is trained again to use the labeled question samples and a large number of unlabeled question samples which have pseudo labels. Finally, to verify the effectiveness of the method through the experiments on question samples of 15 classes extracted from the community question answering system. The experiments demonstrate that the method could effectively utilize a large number of unlabeled question samples to improve the question classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th annual conference on computational learning theory, Wisconsin, MI, USA, pp 92–100
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MATH Google Scholar
Carlson A, Cumby C, Rosen J, Roth D (1999) The SNoW learning architecture. Technical report UIUCDCS-R-99-2101, UIUC Computer Science Department
Chapelle O, Zien A (2005) Semi-supervised classification by low density separation. In: Proceedings of the tenth international workshop on artificial intelligence and statistics, vol 1, pp 57–64
Efron M (2008) Query expansion and dimensionality reduction: notions of optimality in rocchio relevance feedback and latent semantic indexing. Inf Process Manage 44(1):163–180
Article Google Scholar
Freund Y, Schapire RE (1995) A decision-theoretic generalization of on-line learning and an application to boosting. In: Vitányi P (ed) Computational learning theory. Springer, Berlin/Heidelberg, pp 23–37
Chapter Google Scholar
Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer, Berlin
MATH Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin
Google Scholar
Li M, Zhou Z-H (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst Man Cybern Part A 37(6):1088–1098
Article Google Scholar
Liao H, Zhou D (2012) Review of AdaBoost and its improvement. Comput Syst Appl 5:240–244
Google Scholar
Liu Y, Agichtein E (2008) On the evolution of the Yahoo! Answers QA community. In: The ACM SIGIR international conference on research and development in information retrieval, Singapore, pp 737–738
Liu R, Wang L (2013) Keyword extraction algorithm combining semantic extension degree and lexical chain. J Comput 40(12):265–266
Google Scholar
Mengxiao Z, Zhi C, Qingsheng C (2003) Automatic keywords extraction of Chinese document using small world structure. In: Proceedings of the international conference on natural language processing and knowledge engineering. IEEE, pp 438–443
Mihalcea R (2004) Co-training and self-training for word sense disambiguation. In: Proceedings of the conference on computational natural language learning (CoNLL-2004)
Nigam K, Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the 9th international conference on information and knowledge management, McLean, VA, USA, pp 86–93
Nigam K, McCallum A, Mitchell T (2006) Semi-supervised text classification using EM. In: Chapelle O, Schölkopf B, Zien A (eds) Semi-supervised learning. MIT Press, pp 33–56
Pu W, Liu N, Yan S et al (2007) Local word bag model for text categorization. In: Seventh IEEE international conference on data mining (ICDM 2007). IEEE, pp 625–630
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523
Article Google Scholar
Tian J, Zhao W (2010) Word similarity calculation method based on Tongyici Cilin. J Jilin Univ (Inf Sci Ed) 28(6):603–604
Google Scholar
Wang W, Zhou ZH (2010) A new analysis of co-training. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 1135–1142
Xu Xun-hua, Wang Ji-cheng (2004) The multi-classification algorithm of SVM. Microelectron Comput 21(10):149–152
Google Scholar
Yang S, Chao G, Feng Q (2012) The question feature model of the basic features and word bag bindinfusion. J Chin Inf 26(5):46–52
Google Scholar
Yang L, Qiu M, Gottipati S et al (2013) Cqarank: jointly model topics and expertise in community question answering. In: Proceedings of the 22nd ACM international conference on conference on information and knowledge management. ACM, pp 99–108
Zhang D, Lee WS (2003) Question classification using support vector machines. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, Toronto, Canada, pp 26–32
Zhang Z, Li Q (2010) Studies on community question answering—a survey. Comput Sci 37(1):19–20
MathSciNet Google Scholar
Zhang Y, Chen M, Mao S, Hu L, Leung V (2014) CAP: crowd activity prediction based on big data analysis. IEEE Netw 28(4):52–57
Article Google Scholar
Zhou Z-H, Li M (2005a) Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans Knowl Data Eng 17(11):1529–1541
Article Google Scholar
Zhou ZH, Li M (2005b) Semi-supervised regression with co-training. IJCAI 5:908–913
Google Scholar
Zhou G, Liu Y, Liu F et al (2013) Improving question retrieval in community question answering using world knowledge. In: Proceedings of the twenty-third international joint conference on artificial intelligence. AAAI Press, Palo Alto, pp 2239–2245
Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using gaussian fields and harmonic functions. ICML 3:912–919
Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61365010).

Author information

Authors and Affiliations

School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650051, China
Yiyang Li, Lei Su, Jun Chen & Liwei Yuan

Authors

Yiyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Lei Su
View author publications
You can also search for this author in PubMed Google Scholar
Jun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Liwei Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Su.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Su, L., Chen, J. et al. Semi-supervised learning for question classification in CQA. Nat Comput 16, 567–577 (2017). https://doi.org/10.1007/s11047-016-9554-5

Download citation

Published: 05 May 2016
Issue Date: December 2017
DOI: https://doi.org/10.1007/s11047-016-9554-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised learning for question classification in CQA

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A Review on Random Forest: An Ensemble Classifier

A survey on ensemble learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semi-supervised learning for question classification in CQA

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A Review on Random Forest: An Ensemble Classifier

A survey on ensemble learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation