A Set of Experiments to Consider Data Quality Criteria in Classification Techniques for Data Mining

Espinosa, Roberto; Zubcoff, José; Mazón, Jose-Norberto

doi:10.1007/978-3-642-21887-3_51

Roberto Espinosa²¹,
José Zubcoff²² &
Jose-Norberto Mazón²³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6783))

Included in the following conference series:

International Conference on Computational Science and Its Applications

2076 Accesses
6 Citations

Abstract

A successful data mining process depends on the data quality of the sources in order to obtain reliable knowledge. Therefore, preprocessing data is required for dealing with data quality criteria. However, preprocessing data has been traditionally seen as a time-consuming and non-trivial task since data quality criteria have to be considered without any guide about how they affect the data mining process. To overcome this situation, in this paper, we propose to analyze the data mining techniques to know the behavior of different data quality criteria on the sources and how they affects the results of the algorithms. To this aim, we have conducted a set of experiments to assess three data quality criteria: completeness, correlation and balance of data. This work is a first step towards considering, in a systematic and structured manner, data quality criteria for supporting and guiding data miners in obtaining reliable knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB, pp. 586–597. Morgan Kaufmann, San Francisco (2002)
Google Scholar
Berti-Equille, L.: Measuring and modelling data quality for quality-awareness in data mining. In: Guillet, F., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 101–126. Springer, Heidelberg (2007)
Chapter Google Scholar
Bharat, K., Broder, A., Dean, J., Henzinger, M.R., Automatically Extracting and Structure Free and Text Addresses, Borkar, V., Deshmukh, K., Sarawagi, S., Knoblock, C., Lerman, K., Minton, S., Muslea, I., Vassiliadis, P., Vagena, Z., Skiadopoulos, S., Karayannidis, N., Sellis, T., Lomet, D.B., Gravano, L., Levy, A., Sarawagi, S., Weikum, G.: Special Issue on Data Cleaning (2000)
Google Scholar
Chiang, R.H.L., Barron, T.M., Storey, V.C.: Reverse engineering of relational databases: Extraction of an eer model from a relational database. Data Knowl. Eng. 12(2), 107–142 (1994)
Article Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19, 1–16 (2007)
Article Google Scholar
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: Towards a unifying framework. In: KDD, pp. 82–88 (1996)
Google Scholar
González-Aranda, P., Ruiz, E.M., Millán, S., Ruiz, C., Segovia, J.: Towards a methodology for data mining project development: The importance of abstraction. In: Lin, T.Y., Xie, Y., Wasilewska, A., Liau, C.J. (eds.) Data Mining: Foundations and Practice. Studies in Computational Intelligence, vol. 118, pp. 165–178. Springer, Heidelberg (2008)
Chapter Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000)
MATH Google Scholar
Jarke, M., Vassiliou, Y.: Data warehouse quality: A review of the dwq project. In: Strong, D.M., Kahn, B.K. (eds.) IQ, pp. 299–313. MIT, Cambridge (1997)
Google Scholar
Kriegel, H.P., Borgwardt, K.M., Kröger, P., Pryakhin, A., Schubert, M., Zimek, A.: Future trends in data mining. Data Min. Knowl. Discov. 15(1), 87–97 (2007)
Article MathSciNet Google Scholar
Object Management Group: Common Warehouse Metamodel Specification 1.1, http://www.omg.org/cgi-bin/doc?formal/03-03-02
Rahm, E., Do, H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Strong, D.M., Lee, Y.W., Wang, R.Y.: 10 potholes in the road to information quality. IEEE Computer 30(8), 38–46 (1997)
Article Google Scholar
Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997)
Article Google Scholar
Troyanskaya, O.G., Cantor, M., Sherlock, G., Brown, P.O., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Matanzas, Cuba
Roberto Espinosa
Dept. of Sea Sciences and Applied Biology, University of Alicante, Spain
José Zubcoff
Dept. of Software and Computing Systems, University of Alicante, Spain
Jose-Norberto Mazón

Authors

Roberto Espinosa
View author publications
You can also search for this author in PubMed Google Scholar
José Zubcoff
View author publications
You can also search for this author in PubMed Google Scholar
Jose-Norberto Mazón
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Basilicata University Potenza, L.I.S.U.T. - D.A.P.I.T., 10, Viale dell’Ateneo Lucano, 85100, Potenza, Italy
Beniamino Murgante
Department of Mathematics and Computer Science, University of Perugia, Via Vanvitelli, 1, 06123, Perugia, Italy
Osvaldo Gervasi
Department of Applied Mathematics and Computational Sciences, University of Cantabria,, Avda. de los Castros, s/n, Santander, C.P. 39005, Spain
Andrés Iglesias
School of Business Systems, Monash University, VIC 3800, Clayton, Australia
David Taniar
Department of Intelligent Informatics, Kyushu Sangyo University, 2-3-1 Matsukadai, 813-8503, Higashi-ku, Fukuoka, Japan
Bernady O. Apduhan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Espinosa, R., Zubcoff, J., Mazón, JN. (2011). A Set of Experiments to Consider Data Quality Criteria in Classification Techniques for Data Mining. In: Murgante, B., Gervasi, O., Iglesias, A., Taniar, D., Apduhan, B.O. (eds) Computational Science and Its Applications - ICCSA 2011. ICCSA 2011. Lecture Notes in Computer Science, vol 6783. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21887-3_51

Download citation

DOI: https://doi.org/10.1007/978-3-642-21887-3_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21886-6
Online ISBN: 978-3-642-21887-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics