On Semi-supervised Learning with Sparse Data Handling for Educational Data Classification

Chau, Vo Thi Ngoc; Phung, Nguyen Hua

doi:10.1007/978-3-319-70004-5_11

Vo Thi Ngoc Chau¹⁹ &
Nguyen Hua Phung¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10646))

Included in the following conference series:

International Conference on Future Data and Security Engineering

1945 Accesses
3 Citations

Abstract

An educational data classification task at the program level is investigated in this paper. This task concentrates on predicting the final study status of each student from the second year to the fourth year in their study path. By doing that, in-trouble students can be predicted as soon as possible. However, the task faces two main problems. The first problem is the existence of incomplete data once we conduct an early prediction and the second one is the lack of labeled data for a supervised learning process of this task. In order to overcome those difficulties, our work proposes a robust semi-supervised learning method with sparse data handling in either sequential or iterative approach. The sparse data handling process can help us with the k-nearest neighbors-based data imputation and the semi-supervised learning process with a random forest model as a base learner can exploit the availability of a larger set of unlabeled data in the task. These two processes can be conducted in sequence or integrated in each other for robustness and effectiveness in educational data classification. The experimental results show that our resulting robust random forest-based self-training algorithm with the iterative approach to sparse data handling outperforms the other algorithms with different sequential and traditional approaches for conducting the task. This algorithm provides us with a more effective classifier as a practical solution on educational data over the time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Academic Affairs Office, Ho Chi Minh City University of Technology, Vietnam. http://www.aao.hcmut.edu.vn. Accessed 29 June 2017
Bayer, J., Bydzovska, H., Geryk, J., Obsivac, T., Popelinsky, L.: Predicting drop-out from social behaviour of students. In Proceedings of the 5th International Conference on Educational Data Mining, pp. 103–109 (2012)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Dejaeger, K., Goethals, F., Giangreco, A., Mola, L., Baesens, B.: Gaining insight into student satisfaction using comprehensible data mining techniques. Eur. J. Oper. Res. 218, 548–562 (2012)
Article Google Scholar
Dong, A., Chung, F., Wang, S.: Semi-supervised classification method through oversampling and common hidden space. Inf. Sci. 349–350, 216–228 (2016)
Article Google Scholar
Hathaway, R.J., Bezdek, J.C.: Fuzzy c-means clustering of incomplete data. IEEE Tran. Syst. Man Cybern. Part B Cybern. 31(5), 735–744 (2001)
Article Google Scholar
Koprinska, I., Stretton, J., Yacef, K.: Predicting student performance from multiple data sources. Artif. Intell. Educ. 9112, 678–681 (2015)
Article Google Scholar
Kostopoulos, G., Kotsiantis, S., Pintelas, P.: Estimating student dropout in distance higher education using semi-supervised techniques. In: Proceedings of the 19th Panhellenic Conference on Informatics, pp. 38–43 (2015)
Google Scholar
Kravvaris, D., Kermanidis, K.L., Thanou, E.: Success is hidden in the students’ data. Artif. Intell. Appl. Innovations 382, 401–410 (2012)
Article Google Scholar
Li, M., Zhou, Z.H.: Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 37(6), 1088–1098 (2007)
Article Google Scholar
Márquez-Vera, C., Cano, A., Romero, C., Ventura, S.: Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl. Intell. 38, 315–330 (2013)
Article Google Scholar
Peña-Ayala, A.: Educational data mining: a survey and a data mining-based analysis of recent works. Expert Syst. Appl. 41, 1432–1462 (2014)
Article Google Scholar
Romero, C., Espejo, P.G., Zafra, A., Romero, J.R., Ventura, S.: Web usage mining for predicting final marks of students that use Moodle courses. Comput. Appl. Eng. Educ. 21, 135–146 (2013)
Article Google Scholar
Saarela, M., Karkkainen, T.: Analysing Student Performance using Sparse Data of Core Bachelor Courses. Journal of Educational Data Mining 7(1), 3–32 (2015)
Google Scholar
Tanha, J., Someren, M., Afsarmanesh, H.: Semi-supervised self-training for decision tree classifier. Int. J. Mach. Learn. Cyber., 1–16 (2015). doi:10.1007/s13042-015-0328-7
Taruna, S., Pandey, M.: An empirical analysis of classification techniques for predicting academic performance. In: Proceedings of the IEEE International Advance Computing Conference, pp. 523–528 (2014)
Google Scholar
Triguero, I., Garíca, S., Herrera, F.: Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inform. Syst. 42(2), 245–284 (2015)
Article Google Scholar
Triguero, I., Garíca, S., Herrera, F.: SEG-SSC: a framework based on synthetic examples generation for self-labeled semi-supervised classification. IEEE Trans. Cybern. 45(4), 622–634 (2015)
Article Google Scholar
Weka 3, Data Mining Software in Java. http://www.cs.waikato.ac.nz/ml/weka. Accessed 28 June 2017
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196 (1995)
Google Scholar
Zhou, Z.H., Li, M.: Tri-Training: exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17, 1529–1541 (2005)
Article Google Scholar

Download references

Acknowledgments

This research is funded by Vietnam National University Ho Chi Minh City, Vietnam, under grant number C2017-20-18.

Author information

Authors and Affiliations

Ho Chi Minh City University of Technology, Vietnam National University – HCMC, Ho Chi Minh City, Vietnam
Vo Thi Ngoc Chau & Nguyen Hua Phung

Authors

Vo Thi Ngoc Chau
View author publications
You can also search for this author in PubMed Google Scholar
Nguyen Hua Phung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vo Thi Ngoc Chau .

Editor information

Editors and Affiliations

HCMC University of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Johannes Kepler University Linz, Linz, Austria
Roland Wagner
Johannes Kepler University Linz, Linz, Austria
Josef Küng
Ho Chi Minh City University of Technolog , Ho Chi Minh City, Vietnam
Nam Thoai
Hosei University, Koganei, Tokyo, Japan
Makoto Takizawa
University of Vienna, Vienna, Austria
Erich J. Neuhold

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chau, V.T.N., Phung, N.H. (2017). On Semi-supervised Learning with Sparse Data Handling for Educational Data Classification. In: Dang, T., Wagner, R., Küng, J., Thoai, N., Takizawa, M., Neuhold, E. (eds) Future Data and Security Engineering. FDSE 2017. Lecture Notes in Computer Science(), vol 10646. Springer, Cham. https://doi.org/10.1007/978-3-319-70004-5_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-70004-5_11
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70003-8
Online ISBN: 978-3-319-70004-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics