Skip to main content
Log in

Data quality awareness: a case study for cost optimal association rule mining

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called “interesting” rule noted LHSRHS is meaningful when 30% of the LHS data are not up-to-date anymore, 20% of the RHS data are not accurate, and 15% of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Avenali A, Batini C, Bertolazzi P, Missier P (2004) A formulation of the data quality optimization problem. In: Proceedings of the international CAiSE workhop on data and information quality (DIQ), Riga, Latvia, pp 49–63

  2. Ballou DP, Pazer H (1995) Designing information systems to optimize the accuracy-timeliness trade-off. Inf Syst Res 6(1)

  3. Ballou DP, Pazer H (2002) Modeling completeness versus consistency trade-offs in information decision contexts. IEEE Trans Knowl Data Eng (TDKE) 15(1):240–243

    Google Scholar 

  4. Batini C, Catarci T, Scannapiceco M (2004) A survey of data quality issues in cooperative information systems. In: Tutorial presented at the 23rd international conference on conceptual modeling (ER), Shanghai, China

  5. Benjelloun O, Garcia-Molina H, Su Q, Widom J (2005) Swoosh: A generic approach to entity resolution. Technical Report, Stanford Database Group

  6. Berti-Équille L, Moussouni F (2005) Quality-aware integration and warehousing of genomic data. In: Proceedings of the 10th international conference on information quality (IQ'05), MIT, Cambridge, USA

  7. Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD conference on knowledge discovery and data mining (KDD), Washington, DC, USA, pp 39–48

  8. Bouzeghoub M, Peralta V (2004) A framework for analysis of data freshness. In: Proceedings of the 1st ACM SIGMOD workshop on information quality in information systems (IQIS), Paris, France, pp 59–67

  9. Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: Identifying density-based local outliers. In: Proceedings of 2000 ACM SIGMOD conference, Dallas, TX, USA, pp 93–104

  10. Brodie ML (1980) Data quality in information systems. Inform Manage 3:245–258

    Article  Google Scholar 

  11. Celko J, McDonald J (1995) Don't warehouse dirty data. Datamation 41(18)

  12. Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, San Diego, CA, USA, pp 313–324

  13. Cui Y, Widom J (2001) Lineage tracing for general data warehouse transformation. In: Proceedings of the 27th international conference on very large data bases (VLDB), Roma, Italy, September 11–14, pp 471–480

  14. Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. Wiley, New York

    MATH  Google Scholar 

  15. Dasu T, Johnson T, Muthukrishnan S, Shkapenyuk V (2002) Mining database structure or, how to build a data quality browser. In: Proceedings of the 2002 ACM SIGMOD international conference on management of data, Madison, WI, USA, pp 240–251

  16. De Giacomo G, Lembo D, Lenzerini M, Rosati R (2004) Tackling inconsistencies in data integration through source preferences. In: Proceedings of the 1st ACM SIGMOD workshop on information quality in information systems (IQIS), Paris, France, pp 27–34

  17. Delen G, Rijsenbrij D (1992) The specification, engineering and measurement of information systems quality. J Softw Syst 17:205–217

    Article  Google Scholar 

  18. Elfeky MG, Verykios VS, Elmagarmid AK (2002) Tailor: A record linkage toolbox. In: Proceedings of the 19th international conference on data engineering (ICDE), San Jose, CA, USA, pp 1–28

  19. English L (1998) Improving data warehouse and business information quality. Wiley, New York

    Google Scholar 

  20. Fan K, Lu H, Madnick S, Cheung D (2001) Discovering and reconciling value conflicts for numerical data integration. Inform Syst 26(8):235–656

    Article  Google Scholar 

  21. Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64:1183-1210

    Article  Google Scholar 

  22. Fox C, Levitin A, Redman T (1994) The notion of data and its quality dimensions. Information Processing and Management 30(1)

  23. Gravano L, Ipeirotis PG, Koudas N, Srivastava D (2003) Text joins in an RDBMS for web data integration. In: Proceedings of the 12th international world wide web conference (WWW), Budapest, Hungary, pp 90–101

  24. Hernandez M, Stolfo S (1998) Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1):9–37

    Article  Google Scholar 

  25. Hou WC, Zhang Z (1995) Enhancing database correctness: A statistical approach. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, CA, USA

  26. Huang K, Lee Y, Wang R (1999) Quality information and knowledge management. Prentice Hall, New Jersey

    Google Scholar 

  27. Jarke M, Jeusfeld MA, Quix C, Vassiliadis P (1998) Architecture and quality in data warehouses. In: Proceedings of the 10th international conference on advanced information systems engineering (CAiSE), Pisa, Italy, pp 93–113

  28. Johnson T, Dasu T (1998) Comparing massive high-dimensional data sets. In: Proceedings of the 4th international conference KDD, New York City, New York, USA, pp 229–233

  29. Kahn B, Strong D, Wang R (2002) Information quality benchmark: Product and service performance. Com. ACM 45(4):184–192

    Article  Google Scholar 

  30. Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24th international conference on very large data bases (VLDB), New York City, USA, pp 392–403

  31. Lavrač N, Flach PA, Zupan B (1999) Rule evaluation measures: A unifying view. In: Proceedings of the international workshop on inductive logic programming (ILP), Bled, Slovenia, pp 174–185

  32. Liepins G, Uppuluri V (1990) Data quality control: Theory and pragmatics. Marcel Dekker, New York

    Google Scholar 

  33. Lim L, Srivastava J, Prabhakar S, Richardson J (1993) Entity identification in database integration. In: Proceedings of the 9th international conference on data engineering (ICDE), Vienna, Austria, pp 294–301

  34. Little RJ, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York

    MATH  Google Scholar 

  35. Liu L, Chi L (2002) Evolutionary data quality. In: Proceedings of the 7th international conference on information quality (IQ), MIT, Cambridge, USA

  36. McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD conference on knowledge discovery and data mining (KDD), Boston, MA, USA, pp 169–178

  37. Mihaila GA, Raschid L, Vidal M (2000) Using quality of data metadata for source selection and ranking. In: Proceedings of the 3rd international WebDB workshop, Dallas, TX, USA, pp 93–98

  38. Missier P, Batini C (2003) A multidimensional model for information quality in CIS. In: Proceedings of the 8th international conference on information quality (IQ), MIT, Cambridge, MA, USA

  39. Monge A (2000) Matching algorithms within a duplicate detection system. IEEE Data Eng Bull 23(4):14–20

    Google Scholar 

  40. Müller H, Leser U, Freytag JC (2004) Mining for patterns in contradictory data. In: Proceedings of the 1st ACM SIGMOD workshop on information quality in information systems (IQIS) in conjunction with ACM PODS/SIGMOD, Paris, France, pp 51–58

  41. Naumann F, Leser U, Freytag J (1999) Quality-driven integration of heterogeneous information systems. In: Proceedings of the 25th international conference on very large data bases (VLDB), Edinburgh, Scotland, pp 447–458

  42. Naumann F (2002) Quality-driven query answering for integrated information systems. LNCS 2261, Springer, Berlin Heidelberg New York

    MATH  Google Scholar 

  43. Pasula H, Marthi B, Milch B, Russell S, Shpitser I (2003) Identity uncertainty and citation matching. In: Proceedings of the international conference advances in neural information processing systems (NIPS), Vancouver, British Colombia, pp 1401–1408

  44. Pearson RK (2002) Data mining in face of contaminated and incomplete records. In: Proceedings of SIAM international conference on data mining

  45. Perner P (2002) Data mining on multimedia. LNCS 2558, Springer, Berlin Heidelberg New York

    MATH  Google Scholar 

  46. Piattini M, Genero M, Calero C, Polo C, Ruiz F (2000) Database quality. Chapter 14: Advanced database technology and design. Artech House, Norwood, MA, pp 485–509

  47. Piattini, M, Calero C, Genero M (eds)(2002) Information and database quality. The Kluwer International Series on Advances in Database Systems, 25

  48. Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San Mateo, CA

  49. Rahm E, Do H (2000) Data cleaning: Problems and current approaches. IEEE Data Eng Bull 23(4):3–13

    Google Scholar 

  50. Raman V, Hellerstein JM (2001) Potter's wheel: An interactive data cleaning system. In: Proceedings of the 26th international conference on very large data bases (VLDB), Roma, Italy, pp 381–390

  51. Redman T (2001) Data quality: The field guide. Digital Press, Elsevier

  52. Rothenberg J (1996) Metadata to support data quality and longevity. In: Proceedings of the 1st IEEE metadata conference, Silver Spring, MD

  53. Santis LD, Scannapieco M, Catarci T (2003) Trusting data quality in cooperative information systems. In: Proceedings of the international conference on cooperative information systems (CoopIS), Catania, Sicily, Italy, pp 354–369

  54. Scannapieco M, Pernici B, Pierce E (2004) IP-UML: A methodology for quality improvement based on IP-MAP and UML. Advances in Management Information Systems-Information Quality Monograph (AMIS-IQ), Sharpe

  55. Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, London

    MATH  Google Scholar 

  56. Schlimmer J (1991) Learning determinations and checking databases. In: Proceedings of AAAI workshop on knowledge discovery in databases, AAAI–1991 Anaheim California

  57. Tan P-N, Kumar V, Srivastava J (2002) Selecting the right interestingness measure for association patterns. In: Proceedings of the 8th ACM SIGKDD conference on knowledge discovery and data mining (KDD), Edmonton, Canada, pp 32–41

  58. Theodoratos D, Bouzeghoub M (2001) Data currency quality satisfaction in the design of a data warehouse. Special Issue on design and management of data warehouses. Int J Coop Inf Syst 10(3):299–326

    Google Scholar 

  59. Vassiliadis P, Bouzeghoub M, Quix C (1999) Towards quality-oriented data warehouse usage and evolution. In: Proceedings of the 11th international conference on advanced information systems engineering (CAiSE), Heidelberg, Germany, pp 164–179

  60. Vassiliadis P, Simitsis A, Georgantas P, Terrovitis M (2003) A framework for the design of ETL scenarios. In: Proceedings of the 15th international conference on advanced information systems engineering (CAiSE), Klagenfurt, Austria, pp 520–535

  61. Vassiliadis P (2000) Data warehouse modeling and quality issues. PhD thesis, Technical University of Athens, Greece

  62. Wang R, Kon HB, Madnick SE (1993) Data quality requirements analysis and modeling. In: Proceedings of the 9th international conference on data engineering (ICDE), Vienna, Austria, pp 670–677

  63. Wang R, Storey V, Firth C (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng (TDKE) 7(4):670–677

    Google Scholar 

  64. Wang R (1998) A product perspective on total data quality management. Com. ACM 41(2):58–65

    Article  Google Scholar 

  65. Wang R (2002) Journey to data quality, vol 23 of Advances in database systems. Kluwer, Boston, MA, USA

    Google Scholar 

  66. Wang K, Zhou S, Yang Q, Yeung JMS (2005) Mining customer value: From association rules to direct marketing. J Data Min Knowl Discov

  67. Weis M, Naumann F (2004) Detecting duplicate objects in XML documents. In: Proceedings of the 1st international ACM SIGMOD workshop on information quality in information systems (IQIS) in conjunction with ACM PODS/SIGMOD, Paris, France, pp 10–19

  68. Winkler WE (2004) Methods for evaluating and creating data quality. Inf Syst 29(7)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laure Berti-Équille.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berti-Équille, L. Data quality awareness: a case study for cost optimal association rule mining. Knowl Inf Syst 11, 191–215 (2007). https://doi.org/10.1007/s10115-006-0006-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-006-0006-x

Keywords

Navigation