Abstract
A successful data mining process depends on the data quality of the sources in order to obtain reliable knowledge. Therefore, preprocessing data is required for dealing with data quality criteria. However, preprocessing data has been traditionally seen as a time-consuming and non-trivial task since data quality criteria have to be considered without any guide about how they affect the data mining process. To overcome this situation, in this paper, we propose to analyze the data mining techniques to know the behavior of different data quality criteria on the sources and how they affects the results of the algorithms. To this aim, we have conducted a set of experiments to assess three data quality criteria: completeness, correlation and balance of data. This work is a first step towards considering, in a systematic and structured manner, data quality criteria for supporting and guiding data miners in obtaining reliable knowledge.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB, pp. 586–597. Morgan Kaufmann, San Francisco (2002)
Berti-Equille, L.: Measuring and modelling data quality for quality-awareness in data mining. In: Guillet, F., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 101–126. Springer, Heidelberg (2007)
Bharat, K., Broder, A., Dean, J., Henzinger, M.R., Automatically Extracting and Structure Free and Text Addresses, Borkar, V., Deshmukh, K., Sarawagi, S., Knoblock, C., Lerman, K., Minton, S., Muslea, I., Vassiliadis, P., Vagena, Z., Skiadopoulos, S., Karayannidis, N., Sellis, T., Lomet, D.B., Gravano, L., Levy, A., Sarawagi, S., Weikum, G.: Special Issue on Data Cleaning (2000)
Chiang, R.H.L., Barron, T.M., Storey, V.C.: Reverse engineering of relational databases: Extraction of an eer model from a relational database. Data Knowl. Eng. 12(2), 107–142 (1994)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19, 1–16 (2007)
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: Towards a unifying framework. In: KDD, pp. 82–88 (1996)
González-Aranda, P., Ruiz, E.M., Millán, S., Ruiz, C., Segovia, J.: Towards a methodology for data mining project development: The importance of abstraction. In: Lin, T.Y., Xie, Y., Wasilewska, A., Liau, C.J. (eds.) Data Mining: Foundations and Practice. Studies in Computational Intelligence, vol. 118, pp. 165–178. Springer, Heidelberg (2008)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000)
Jarke, M., Vassiliou, Y.: Data warehouse quality: A review of the dwq project. In: Strong, D.M., Kahn, B.K. (eds.) IQ, pp. 299–313. MIT, Cambridge (1997)
Kriegel, H.P., Borgwardt, K.M., Kröger, P., Pryakhin, A., Schubert, M., Zimek, A.: Future trends in data mining. Data Min. Knowl. Discov. 15(1), 87–97 (2007)
Object Management Group: Common Warehouse Metamodel Specification 1.1, http://www.omg.org/cgi-bin/doc?formal/03-03-02
Rahm, E., Do, H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Strong, D.M., Lee, Y.W., Wang, R.Y.: 10 potholes in the road to information quality. IEEE Computer 30(8), 38–46 (1997)
Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997)
Troyanskaya, O.G., Cantor, M., Sherlock, G., Brown, P.O., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Espinosa, R., Zubcoff, J., Mazón, JN. (2011). A Set of Experiments to Consider Data Quality Criteria in Classification Techniques for Data Mining. In: Murgante, B., Gervasi, O., Iglesias, A., Taniar, D., Apduhan, B.O. (eds) Computational Science and Its Applications - ICCSA 2011. ICCSA 2011. Lecture Notes in Computer Science, vol 6783. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21887-3_51
Download citation
DOI: https://doi.org/10.1007/978-3-642-21887-3_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21886-6
Online ISBN: 978-3-642-21887-3
eBook Packages: Computer ScienceComputer Science (R0)