Skip to main content

A Set of Experiments to Consider Data Quality Criteria in Classification Techniques for Data Mining

  • Conference paper
Computational Science and Its Applications - ICCSA 2011 (ICCSA 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6783))

Included in the following conference series:

Abstract

A successful data mining process depends on the data quality of the sources in order to obtain reliable knowledge. Therefore, preprocessing data is required for dealing with data quality criteria. However, preprocessing data has been traditionally seen as a time-consuming and non-trivial task since data quality criteria have to be considered without any guide about how they affect the data mining process. To overcome this situation, in this paper, we propose to analyze the data mining techniques to know the behavior of different data quality criteria on the sources and how they affects the results of the algorithms. To this aim, we have conducted a set of experiments to assess three data quality criteria: completeness, correlation and balance of data. This work is a first step towards considering, in a systematic and structured manner, data quality criteria for supporting and guiding data miners in obtaining reliable knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB, pp. 586–597. Morgan Kaufmann, San Francisco (2002)

    Google Scholar 

  2. Berti-Equille, L.: Measuring and modelling data quality for quality-awareness in data mining. In: Guillet, F., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 101–126. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  3. Bharat, K., Broder, A., Dean, J., Henzinger, M.R., Automatically Extracting and Structure Free and Text Addresses, Borkar, V., Deshmukh, K., Sarawagi, S., Knoblock, C., Lerman, K., Minton, S., Muslea, I., Vassiliadis, P., Vagena, Z., Skiadopoulos, S., Karayannidis, N., Sellis, T., Lomet, D.B., Gravano, L., Levy, A., Sarawagi, S., Weikum, G.: Special Issue on Data Cleaning (2000)

    Google Scholar 

  4. Chiang, R.H.L., Barron, T.M., Storey, V.C.: Reverse engineering of relational databases: Extraction of an eer model from a relational database. Data Knowl. Eng. 12(2), 107–142 (1994)

    Article  Google Scholar 

  5. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19, 1–16 (2007)

    Article  Google Scholar 

  6. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: Towards a unifying framework. In: KDD, pp. 82–88 (1996)

    Google Scholar 

  7. González-Aranda, P., Ruiz, E.M., Millán, S., Ruiz, C., Segovia, J.: Towards a methodology for data mining project development: The importance of abstraction. In: Lin, T.Y., Xie, Y., Wasilewska, A., Liau, C.J. (eds.) Data Mining: Foundations and Practice. Studies in Computational Intelligence, vol. 118, pp. 165–178. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  8. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000)

    MATH  Google Scholar 

  9. Jarke, M., Vassiliou, Y.: Data warehouse quality: A review of the dwq project. In: Strong, D.M., Kahn, B.K. (eds.) IQ, pp. 299–313. MIT, Cambridge (1997)

    Google Scholar 

  10. Kriegel, H.P., Borgwardt, K.M., Kröger, P., Pryakhin, A., Schubert, M., Zimek, A.: Future trends in data mining. Data Min. Knowl. Discov. 15(1), 87–97 (2007)

    Article  MathSciNet  Google Scholar 

  11. Object Management Group: Common Warehouse Metamodel Specification 1.1, http://www.omg.org/cgi-bin/doc?formal/03-03-02

  12. Rahm, E., Do, H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  13. Strong, D.M., Lee, Y.W., Wang, R.Y.: 10 potholes in the road to information quality. IEEE Computer 30(8), 38–46 (1997)

    Article  Google Scholar 

  14. Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997)

    Article  Google Scholar 

  15. Troyanskaya, O.G., Cantor, M., Sherlock, G., Brown, P.O., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Espinosa, R., Zubcoff, J., Mazón, JN. (2011). A Set of Experiments to Consider Data Quality Criteria in Classification Techniques for Data Mining. In: Murgante, B., Gervasi, O., Iglesias, A., Taniar, D., Apduhan, B.O. (eds) Computational Science and Its Applications - ICCSA 2011. ICCSA 2011. Lecture Notes in Computer Science, vol 6783. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21887-3_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21887-3_51

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21886-6

  • Online ISBN: 978-3-642-21887-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics