Dataset Weighting via Intrinsic Data Characteristics for Pairwise Statistical Comparisons in Classification

Sáez, José A.; Villacorta, Pablo; Corchado, Emilio

doi:10.1007/978-3-030-29859-3_6

Dataset Weighting via Intrinsic Data Characteristics for Pairwise Statistical Comparisons in Classification

José A. Sáez¹³,
Pablo Villacorta¹⁴ &
Emilio Corchado¹³

Conference paper
First Online: 26 August 2019

1323 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11734))

Abstract

In supervised learning, some data characteristics (e.g. presence of errors, overlapping degree, etc.) may negatively influence classifier performance. Many methods are designed to overcome the undesirable effects of the aforementioned issues. When comparing one of those techniques with existing ones, a proper selection of datasets must be made, based on how well each dataset reflects the characteristic being specifically addressed by the proposed algorithm. In this setting, statistical tests are necessary to check the significance of the differences found in the comparison of different methods. Wilcoxon’s signed-ranks test is one of the most well-known statistical tests for pairwise comparisons between classifiers. However, it gives the same importance to every dataset, disregarding how representative each of them is in relation to the concrete issue addressed by the methods compared. This research proposes a hybrid approach which combines techniques of measurement for data characterization with statistical tests for decision making in data mining. Thus, each dataset is weighted according to its representativeness of the property of interest before using Wilcoxon’s test. Our proposal has been successfully compared with the standard Wilcoxon’s test in two scenarios related to the noisy data problem. As a result, this approach stands out properties of the algorithms easier, which may otherwise remain hidden if data characteristics are not considered in the comparison.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bach, F.: Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18, 1–53 (2017)
MATH Google Scholar
Bello-Orgaz, G., Jung, J., Camacho, D.: Social big data: recent achievements and new challenges. Inf. Fusion 28, 45–59 (2016)
Article Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Dua, D., Karra Taniskidou, E.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Jain, S., Shukla, S., Wadhvani, R.: Dynamic selection of normalization techniques using data complexity measures. Expert Syst. Appl. 106, 252–262 (2018)
Article Google Scholar
Khalilpour Darzi, M., Niaki, S., Khedmati, M.: Binary classification of imbalanced datasets: the case of coil challenge 2000. Expert Syst. Appl. 128, 169–186 (2019)
Article Google Scholar
Kuncheva, L., Galar, M.: Theoretical and empirical criteria for the edited nearest neighbour classifier, vol. January, pp. 817–822 (2016)
Google Scholar
Larose, D.T., Larose, C.D.: Data Mining and Predictive Analytics, 2nd edn. Wiley Publishing, Hoboken (2015)
MATH Google Scholar
Luengo, J., García, S., Herrera, F.: A study on the use of imputation methods for experimentation with radial basis function network classifiers handling missing attribute values: the good synergy between RBFs and eventcovering method. Neural Networks 23(3), 406–418 (2010)
Article Google Scholar
Nettleton, D., Orriols-Puig, A., Fornells, A.: A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 33, 275–306 (2010)
Article Google Scholar
Quade, D.: Using weighted rankings in the analysis of complete blocks with additive block effects. J. Am. Stat. Assoc. 74, 680–683 (1979)
Article MathSciNet Google Scholar
Sáez, J.A., Galar, M., Luengo, J., Herrera, F.: INFFC: an iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Inf. Fusion 27, 19–32 (2016)
Article Google Scholar
Sáez, J.A., Luengo, J., Herrera, F.: Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recogn. 46(1), 355–364 (2013)
Article Google Scholar
Santafe, G., Inza, I., Lozano, J.: Dealing with the evaluation of supervised classification algorithms. Artif. Intell. Rev. 44(4), 467–508 (2015)
Article Google Scholar
Singh, P., Sarkar, R., Nasipuri, M.: Significance of non-parametric statistical tests for comparison of classifiers over multiple datasets. Int. J. Comput. Sci. Math. 7(5), 410–442 (2016)
Article MathSciNet Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1(1), 67–82 (1997)
Article Google Scholar
Zar, J.: Biostatistical Analysis. Prentice Hall, Upper Saddle River (2009)
Google Scholar

Download references

Acknowledgment

José A. Sáez holds a Juan de la Cierva-formación fellowship (Ref. FJCI-2015-25547) from the Spanish Ministry of Economy, Industry and Competitiveness.

Author information

Authors and Affiliations

Department of Computer Science and Automatics, University of Salamanca, Plaza de los Caídos s/n, 37008, Salamanca, Spain
José A. Sáez & Emilio Corchado
Department of Computer Science and Artificial Intelligence, CITIC-UGR, University of Granada, 18071, Granada, Spain
Pablo Villacorta

Authors

José A. Sáez
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Villacorta
View author publications
You can also search for this author in PubMed Google Scholar
Emilio Corchado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José A. Sáez .

Editor information

Editors and Affiliations

University of León, León, Spain
Hilde Pérez García
University of León, León, Spain
Lidia Sánchez González
University of León, León, Spain
Manuel Castejón Limas
University of A Coruña, Ferrol, Spain
Héctor Quintián Pardo
University of Salamanca, Salamanca, Spain
Emilio Corchado Rodríguez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sáez, J.A., Villacorta, P., Corchado, E. (2019). Dataset Weighting via Intrinsic Data Characteristics for Pairwise Statistical Comparisons in Classification. In: Pérez García, H., Sánchez González, L., Castejón Limas, M., Quintián Pardo, H., Corchado Rodríguez, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2019. Lecture Notes in Computer Science(), vol 11734. Springer, Cham. https://doi.org/10.1007/978-3-030-29859-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-29859-3_6
Published: 26 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29858-6
Online ISBN: 978-3-030-29859-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics