Abstract
While scientific applications can gather consistent data from the natural world, psychological, sociological, and even economic applications rely on data provided by people. Since the majority of machine learning is aimed at improving the lives of people, human input is essential for useful results. In this paper, we explore datasets where input and target attributes are provided by people taking surveys. Every survey dataset, generated from human input, is reliable and self-consistent according to Cronbach’s alpha. One expects a reliable questionnaire to provide effective data for learning. It is this expectation that our analysis finds false, when applied to supervised learning. Both statistical analysis and application of several supervised learning architectures, with a focus on neural networks, are utilized to provide insight into data gathered through human input.










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Lovinger, J.: Clever surveys. https://www.cleversurveys.com/. Accessed 30 Dec 2016
Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 845–869 (2014)
Basu, M., Ho, T.K.: Data Complexity in Pattern Recognition. Springer Science & Business Media, Berlin (2006)
Sáez, J.A., Krawczyk, B., Woźniak, M.: Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recogn. 57, 164–178 (2016)
Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)
Smith, M.R., Martinez, T.: Improving classification accuracy by identifying and removing instances that should be misclassified. In: The 2011 International Joint Conference on Neural Networks (IJCNN), pp. 2690–2697. IEEE (2011)
Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques to obtain quality training sets. Pattern Recogn. Lett. 24(7), 1015–1022 (2003)
Barandela, R., Gasca, E.: Decontamination of training samples for supervised pattern recognition methods. In: Ferri, F.J., Inesta, J.M., Amin, A., Pudil, P. (eds.) Advances in Pattern Recognition, pp. 621–630. Springer, Berlin (2000)
Jiang, Y., Zhou, Z.-H.: Editing training data for knn classifiers with neural network ensemble. In: Advances in Neural Networks–ISNN 2004, pp. 356–361. Springer, Berlin (2004)
Bootkrajang, J., Kabán, A.: Multi-class classification in the presence of labelling errors. In: ESANN, Citeseer (2011)
Harhoff, D., Körting, T.: Lending relationships in Germany–empirical evidence from survey data. J. Bank. Finance 22(10), 1317–1353 (1998)
De Vaus, D.: Surveys in Social Research. Routledge, London (2013)
Thompson, D.F.: Deliberative democratic theory and empirical political science. Annu. Rev. Polit. Sci. 11, 497–520 (2008)
van Kampen, D.: The 5-dimensional personality test (5dpt): relationships with two lexically based instruments and the validation of the absorption scale. J. Personal. Assess. 94(1), 92–101 (2012)
Burisch, M.: Approaches to personality inventory construction: a comparison of merits. Am. Psychol. 39(3), 214 (1984)
Reyes-Ortiz, J.-L., Anguita, D., Ghio, A., Parra, X.: Human activity recognition using smartphones data set. UCI Machine Learning Repository (2013)
Aha, D.W.: Heart disease data set. UCI Machine Learning Repository (1988)
Gonyea, R.M.: Self-reported data in institutional research: review and recommendations. New Dir. Inst. Res. 127, 73 (2005)
Harrison, L.D.: The validity of self-reported data on drug use. J. Drug Issues 25(1), 91–111 (1995)
van Poppel, M.N.M., de Vet, H.C.W., Koes, B.W., Smid, T., Bouter, L.M.: Measuring sick leave: a comparison of self-reported data on sick leave and data from company records. Occup. Med. 52(8), 485–490 (2002)
Wang, S.: Classification with incomplete survey data: a hopfield neural network approach. Comput. Oper. Res. 32(10), 2583–2594 (2005)
Lu, C., Li, X.-W., Pan, H.-B.: Application of extension neural network for classification with incomplete survey data. In: First International Conference on Innovative Computing, Information and Control, 2006. ICICIC’06, vol. 3, pp. 190–193. IEEE (2006)
Tagliaferri, R., Longo, G., Milano, L., Acernese, F., Barone, F., Ciaramella, A., De Rosa, R., Donalek, C., Eleuteri, A., Raiconi, G., et al.: Neural networks in astronomy. Neural Netw. 16(3), 297–319 (2003)
Hagan, M.T., Demuth, H.B., Beale, M.H., et al.: Neural Network Design. Pws Pub, Boston (1996)
Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011)
Tóth, L.: Phone recognition with deep sparse rectifier neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6985–6989. IEEE (2013)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, vol. 30, p. 1 (2013)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer, Berlin (2010)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Lowe, D., Broomhead, D.: Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321–355 (1988)
Broomhead, D.S., Lowe, D.: Radial basis functions, multi-variable functional interpolation and adaptive networks. Technical report, DTIC Document (1988)
Tan, Y., Wang, J., Zurada, J.M.: Nonlinear blind source separation using a radial basis function network. IEEE Trans. Neural Netw. 12(1), 124–134 (2001)
Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982)
Kohonen, T.: The self-organizing map. Proc. IEEE 78(9), 1464–1480 (1990)
Kalteh, A.M., Hjorth, P., Berndtsson, R.: Review of the self-organizing map (SOM) approach in water resources: analysis, modelling and application. Environ. Model. Softw. 23(7), 835–845 (2008)
Mao, K.Z., Tan, K.-C.: Probabilistic neural-network structure determination for pattern classification. IEEE Trans. Neural Netw. 11(4), 1009–1016 (2000)
Gao, M., Tian, J.: Web classification mining based on radial basic probabilistic neural network. In: 2009 First International Workshop on Database Technology and Applications, pp. 586–589. IEEE (2009)
Specht, D.F.: Probabilistic neural networks. Neural Netw. 3(1), 109–118 (1990)
Ho, T.K.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition, 1995, vol. 1, pp. 278–282. IEEE (1995)
Díaz-Uriarte, R., De Andres, S.A.: Gene selection and classification of microarray data using random forest. BMC Bioinf. 7(1), 1 (2006)
Rodriguez-Galiano, V.F., Ghimire, B., Rogan, J., Chica-Olmo, M., Rigol-Sanchez, J.P.: An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J. Photogramm. Remote Sens. 67, 93–104 (2012)
Utgoff, P.E.: Incremental induction of decision trees. Mach. Learn. 4(2), 161–186 (1989)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing (1994)
Pradhan, B.: A comparative study on the predictive ability of the decision tree, support vector machine and neuro-fuzzy models in landslide susceptibility mapping using gis. Comput. Geosci. 51, 350–365 (2013)
Milborrow, S.: Titanic decision tree. https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:CART_tree_titanic_survivors.png
Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)
Quinlan, R.J.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
scikit-learn: http://scikit-learn.org
TheStoat: Linux distributions. https://www.cleversurveys.com/surveys/survey-5763859801440256_5629499534213120/responses/-linux-distributions. Accessed 30 Dec 2016
Lewis, R., Goldberg, L.R.: The structure of phenotypic personality traits. Am. Psychol. 48(1), 26 (1993)
Costa, P.T., McCrae, R.R.: The revised neo personality inventory (neo-pi-r). SAGE Handb. Personal. Theory Assess. 2, 179–198 (2008)
Turiano, N.A., Mroczek, D.K., Moynihan, J., Chapman, B.P.: Big 5 personality traits and interleukin-6: evidence for healthy neuroticism in a us population sample. Brain Behav. Immun. 28, 83–89 (2013)
PaintingInAir: What pet should i get? https://www.cleversurveys.com/surveys/survey-5709198289534976_5629499534213120/responses/-what-pet-should-i-get. Accessed 30 Dec 2016
AvinaDiviri: Alcoholic drink predictor. https://www.cleversurveys.com/surveys/survey-6024271184789504_5668600916475904/responses/-alcoholic-drink-predictor. Accessed 30 Dec 2016
Marshall, M.: UCI machine learning repository (1988)
Mangasarian, O.L., Setiono, R., Wolberg, W.H.: Pattern recognition via linear programming: theory and application to medical diagnosis (1990)
Tavakol, M., Dennick, R.: Making sense of Cronbach’s alpha. Int. J. Med. Educ. 2, 53 (2011)
Bland, J.M., Altman, D.G.: Statistics notes: Cronbach’s alpha. Bmj 314(7080), 572 (1997)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lovinger, J., Valova, I. The effect of human thought on data: an analysis of self-reported data in supervised learning and neural networks. Prog Artif Intell 6, 221–234 (2017). https://doi.org/10.1007/s13748-017-0118-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-017-0118-4