Skip to main content

Measuring and Modelling Data Quality for Quality-Awareness in Data Mining

  • Chapter
Quality Measures in Data Mining

Part of the book series: Studies in Computational Intelligence ((SCI,volume 43))

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Avenali A, Batini C, Bertolazzi P, and Missier P. A formulation of the data quality optimization problem. In Proc. of the Intl. CAiSE Workhop on Data and Information Quality (DIQ), pages 49-63, Riga, Latvia, 2004.

    Google Scholar 

  2. Karakasidis A, Vassiliadis P, and Pitoura E. Etl queues for active data warehousing. In Proc. of the 2nd ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 28-39, Baltimore, MD, USA, 2005.

    Google Scholar 

  3. McCallum A, Nigam K, and Ungar LH. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of the 6th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 169-178, Boston, MA, USA, 2000.

    Google Scholar 

  4. Monge A. Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull., 23(4):14-20, 2000.

    Google Scholar 

  5. Sheth A, Wood C, and Kashyap V. Q-data: Using deductive database technology to improve data quality. In Proc. of Intl. Workshop on Programming with Logic Databases (ILPS), pages 23-56, 1993.

    Google Scholar 

  6. Simitsis A, Vassiliadis P, and Sellis TK. Optimizing etl processes in data warehouses. In Proc. of the 11th Intl. Conf. on Data Engineering (ICDE), pages 564-575, Tokyo, Japan, 2005.

    Google Scholar 

  7. Dempster AP, Laird NM, and Rubin DB. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39:1-38, 1977.

    MATH  MathSciNet  Google Scholar 

  8. Kahn B, Strong D, and Wang R. Information quality benchmark: Product and service performance. Com. of the ACM, 45(4):184-192, 2002.

    Article  Google Scholar 

  9. Batini C, Catarci T, and Scannapiceco M. A survey of data quality issues in cooperative information systems. In Tutorial presented at the 23rd Intl. Conf. on Conceptual Modeling (ER), Shanghai, China, 2004.

    Google Scholar 

  10. Djeraba C. Association and content-based retrieval. IEEE Transactions on Knowledge and Data Engineering (TDKE), 15(1):118-135, 2003.

    Article  Google Scholar 

  11. Fox C, Levitin A, and Redman T. The notion of data and its quality dimensions. Information Processing and Management, 30(1), 1994.

    Google Scholar 

  12. Ordonez C and Omiecinski E. Discovering association rules based on image content. In Proc. of IEEE Advances in Digital Libraries Conf. (ADL’99), pages 38-49, 1999.

    Google Scholar 

  13. Carlson D. Data stewardship in action. DM Review, 2002.

    Google Scholar 

  14. Loshin D. Enterprise Knowledge Management: The Data Quality Approach. .Morgan Kaufmann, 2001.

    Google Scholar 

  15. Pyle D. Data Preparation for Data Mining. Morgan Kaufmann, 1999.

    Google Scholar 

  16. Quass D and Starkey P. Record linkage for genealogical databases. In Proc. of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 40-42, Washington, DC, USA, 2003.

    Google Scholar 

  17. Theodoratos D and Bouzeghoub M. Data currency quality satisfaction in the design of a data warehouse. Special Issue on Design and Management of Data Warehouses, Intl. Journal of Cooperative Inf. Syst., 10(3):299-326, 2001.

    Article  Google Scholar 

  18. Paradice DB and Fuerst WL. A mis data quality management strategy based on an optimal methodology. Journal of Information Systems, 5(1):48-66, 1991.

    Google Scholar 

  19. Ballou DP and Pazer H. Designing information systems to optimize the accuracy-timeliness trade-off. Information Systems Research, 6(1), 1995.

    Google Scholar 

  20. Ballou DP and Pazer H. Modeling completeness versus consistency trade-offs in information decision contexts. IEEE Transactions on Knowledge and Data Engineering (TDKE), 15(1):240-243, 2002.

    Google Scholar 

  21. Guérin E, Marquet G, Burgun A, Loral O, Berti- Équille L, Leser U, and Moussouni F. Integrating and warehousing liver gene expression data and related biomedical resources in gedaw. In Proc. of the 2nd Intl. Workshop on Data Integration in the Life Science (DILS), San Diego, CA, USA, 2005.

    Google Scholar 

  22. Knorr E and Ng R. Algorithms for mining distance-based outliers in large datasets. In Proc. of the 24th Intl. Conf. on Very Large Data Bases (VLDB), pages 392-403, New York City, USA, 1998.

    Google Scholar 

  23. Rahm E and Do H. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3-13, 2000.

    Google Scholar 

  24. Caruso F, Cochinwala M, Ganapathy U, Lalk G, and Missier P. Telcordia’s database reconciliation and data quality analysis tool. In Proc. of the 26th Intl. Conf. on Very Large Data Bases (VLDB), pages 615-618, Cairo, Egypt, September 10-14 2000.

    Google Scholar 

  25. Naumann F. Quality-Driven Query Answering for Integrated Information Systems, volume 2261 of LNCS. Springer, 2002.

    Google Scholar 

  26. Naumann F, Leser U, and Freytag JC. Quality-driven integration of hetero-geneous information systems. In Proc. of the 25th Intl. Conf. on Very Large Data Bases (VLDB), pages 447-458, Edinburgh, Scotland, 1999.

    Google Scholar 

  27. De Giacomo G, Lembo D, Lenzerini M, and Rosati R. Tackling inconsistencies in data integration through source preferences. In Proc. of the 1rst ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 27-34, Paris, France, 2004.

    Google Scholar 

  28. Delen G and Rijsenbrij D. The specification, engineering and measurement of information systems quality. Journal of Software Systems, 17:205-217, 1992.

    Article  Google Scholar 

  29. Liepins G and Uppuluri V. Data Quality Control: Theory and Pragmatics. M. Dekker, 1990.

    Google Scholar 

  30. Navarro G. A guided tour to approximate string matching. ACM Computer Surveys, 33(1):31-88, 2001.

    Article  Google Scholar 

  31. Shankaranarayan G, Wang RY, and Ziad M. Modeling the manufacture of an information product with ip-map. In Proc. of the 6th Intl. Conf. on Information Quality, Boston, MA, USA, 2000.

    Google Scholar 

  32. Mihaila GA, Raschid L, and Vidal M. Using quality of data metadata for source selection and ranking. In Proc. of the 3rd Intl. WebDB Workshop, pages 93-98, Dallas, TX, USA, 2000.

    Google Scholar 

  33. Tayi GK and Ballou DP. Examining data quality. Com. of the ACM, 41(2):54-57,1998.

    Article  Google Scholar 

  34. Galhardas H, Florescu D, Shasha D, Simon E, and Saita C. Declarative data cleaning: Language, model and algorithms. In Proc. of the 9th Intl. Conf. on Very Large Data Bases (VLDB), pages 371-380, Roma, Italy, 2001.

    Google Scholar 

  35. Müller H, Leser U, and Freytag JC. Mining for patterns in contradictory data. In Proc. of the 1rst ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 51-58, Paris, France, 2004.

    Google Scholar 

  36. Pasula H, Marthi B, Milch B, Russell S, and Shpitser I. Identity uncertainty and citation matching. In Proc. of the Intl. Conf. Advances in Neural Information Processing Systems (NIPS), pages 1401-1408, Vancouver, British Colombia, 2003.

    Google Scholar 

  37. Newcombe HB, Kennedy JM, Axford SJ, and James AP. Automatic linkage of vital records. Science, 130:954-959, 1959.

    Article  Google Scholar 

  38. Fellegi IP and Sunter AB. A theory for record linkage. Journal of the American Statistical Association, 64:1183-1210, 1969.

    Article  Google Scholar 

  39. Celko J and McDonald J. Don’t warehouse dirty data. Datamation, 41(18), 1995.

    Google Scholar 

  40. Rothenberg J. Metadata to support data quality and longevity. In Proc. Of the 1st IEEE Metadata Conf., 1996.

    Google Scholar 

  41. Schlimmer J. Learning determinations and checking databases. In Proc. Of AAAI Workshop on Knowledge Discovery in Databases, 1991.

    Google Scholar 

  42. Schafer JL. Analysis of Incomplete Multivariate Data. Chapman & Hall, 1997.

    Google Scholar 

  43. Ullmann JR. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. The Computer Journal, 20(2):141-147, 1997.

    Article  Google Scholar 

  44. Fan K, Lu H, Madnick S, and Cheung D. Discovering and reconciling value conflicts for numerical data integration. Information Systems, 26(8):235-656, 2001.

    Article  Google Scholar 

  45. Huang K, Lee Y, and Wang R. Quality Information and Knowledge Management. Prentice Hall, New Jersey, 1999.

    Google Scholar 

  46. Berti- Équille L. Data quality awareness: a case study for cost-optimal association rule mining. Knowl. Inf. Syst., 2006.

    Google Scholar 

  47. English L. Improving Data Warehouse and Business Information Quality. Wiley, New York, 1998.

    Google Scholar 

  48. Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Pietarinen L, and Srivastava D. Using Q-grams in a DBMS for Approximate String Processing. IEEE Data Eng. Bull., 24(4), December 2001.

    Google Scholar 

  49. Gravano L, Ipeirotis PG, Koudas N, and Srivastava D. Text joins in an rdbms for web data integration. In Proc. of the 12th Intl. World Wide Web Conf. (WWW), pages 90-101, Budapest, Hungary, 2003.

    Google Scholar 

  50. Lim L, Srivastava J, Prabhakar S, and Richardson J. Entity identification in database integration. In Proc. of the 9th Intl. Conf. on Data Engineering (ICDE), pages 294-301, Vienna, Austria, 1993.

    Google Scholar 

  51. Liu L and Chi L. Evolutionary data quality. In Proc. of the 7th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, USA, 2002.

    Google Scholar 

  52. Santis LD, Scannapieco M, and Catarci T. Trusting data quality in cooperative information systems. In Proc. of the Intl. Conf. on Cooperative Information Systems (CoopIS), pages 354-369, Catania, Sicily, Italy, 2003.

    Google Scholar 

  53. Bilenko M and Mooney RJ. Adaptive duplicate detection using learnable string similarity measures. In Proc. of the 9th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 39-48, Washington, DC, USA, 2003.

    Google Scholar 

  54. Bouzeghoub M and Peralta V. A framework for analysis of data freshness. In Proc. of the 1st ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 59-67, Paris, France, 2004.

    Google Scholar 

  55. Breunig M, Kriegel H, Ng R, and Sander J. Lof: Identifying density-based local outliers. In Proc. of 2000 ACM SIGMOD Conf., pages 93-104, Dallas, TX, USA, May 16-18 2000.

    Google Scholar 

  56. Buechi M, Borthwick A, Winkel A, and Goldberg A. Cluemaker: a language for approximate record matching. In Proc. of the 8th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, USA, 2003.

    Google Scholar 

  57. Goodchild M and Jeansoulin R. Data Quality in Geographic Information: From Error to Uncertainty. Hermès, 1998.

    Google Scholar 

  58. Hernandez M and Stolfo S. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9-37, 1998.

    Article  Google Scholar 

  59. Jarke M, Jeusfeld MA, Quix C, and Vassiliadis P. Architecture and quality in data warehouses. In Proc. of the 10th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 93-113, Pisa, Italy, 1998.

    Google Scholar 

  60. Piattini M, Calero C, and Genero M, editors. Information and Database Quality, volume 25. Kluwer International Series on Advances in Database Systems, 2002.

    Google Scholar 

  61. Piattini M, Genero M, Calero C, Polo C, and Ruiz F. Chapter 14: Advanced Database Technology and Design, chapter Database Quality, pages 485-509. Artech House, 2000.

    Google Scholar 

  62. Scannapieco M, Pernici B, and Pierce E. Advances in Management Information Systems - Information Quality Monograph (AMIS-IQ), chapter IP-UML: A Methodology for Quality Improvement Based on IP-MAP and UML. Sharpe, 2004.

    Google Scholar 

  63. Weis M and Naumann F. Detecting duplicate objects in xml documents. In Proc. of the 1st Intl. ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 10-19, Paris, France, 2004.

    Google Scholar 

  64. Jeusfeld MA, Quix C, and Jarke M. Design and analysis of quality information for data warehouses. In Proc. of 17th Intl. Conf. Conceptual Modelling (ER), pages 349-362, Singapore, 1998.

    Google Scholar 

  65. Elfeky MG, Verykios VS, and Elmagarmid AK. Tailor: A record linkage toolbox. In Proc. of the 19th Intl. Conf. on Data Engineering (ICDE), pages 1-28, San Jose, CA, USA, 2002.

    Google Scholar 

  66. Brodie ML. Data quality in information systems. Information and Management, 3:245-258, 1980.

    Article  Google Scholar 

  67. Lavrač N, Flach PA, and Zupan B. Rule evaluation measures: A unifying view. In Proc. of the Intl. Workshop on Inductive Logic Programming (ILP), pages 174-185, Bled, Slovenia, 1999.

    Google Scholar 

  68. Benjelloun O, Garcia-Molina H, Su Q, and Widom J. Swoosh: A generic approach to entity resolution. Technical report, Stanford Database Group., 2005.

    Google Scholar 

  69. ıane O, Han J, and Zhu H. Mining recurrent items in multimedia with progressive resolution refinement. In Proc. of the 16th Intl. Conf. on Data Engineering (ICDE), p.461-476, San Diego, CA, USA, 2000.

    Google Scholar 

  70. Christen P, Churches T, and Hegland M. Febrl - a parallel open source data linkage system. In Proc. of the 8th Pacific Asia Conf. on Advances in Knowledege Discovery and Data Mining (PAKDD), pages 638-647, Sydney, Australia, May 26-28 2004.

    Google Scholar 

  71. Missier P and Batini C. A multidimensional model for information quality in cis. In Proc. of the 8th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, MA, USA, 2003.

    Google Scholar 

  72. Perner P. Data Mining on Multimedia, volume LNCS 2558. Springer, 2002.

    Google Scholar 

  73. Vassiliadis P. Data Warehouse Modeling and Quality Issues. PhD thesis, Technical University of Athens, Greece, 2000.

    Google Scholar 

  74. Vassiliadis P, Simitsis A, Georgantas P, and Terrovitis M. A framework for the design of etl scenarios. In Proc. of the 15th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 520-535, Klagenfurt, Austria, 2003.

    Google Scholar 

  75. Vassiliadis P, Bouzeghoub M, and Quix C. Towards quality-oriented data warehouse usage and evolution. In Proc. of the 11th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 164-179, Heidelberg, Germany, 1999.

    Google Scholar 

  76. Vassiliadis P, Vagena Z, Skiadopoulos S, and Karayannidis N. ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments. IEEE Data Eng. Bull., 23(4):42-47, 2000.

    Google Scholar 

  77. Tan PN, Kumar V, and Srivastava J. Selecting the right interestingness measure for association patterns. In Proc. of the 8th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 32-41, Edmonton, Canada, 2002.

    Google Scholar 

  78. Agrawal R, Imielinski T, and Swami AN. Mining association rules between sets of items in large databases. In Proc. of the 1993 ACM SIGMOD Conf., pages 207-216, Washington, DC,USA, 1993.

    Google Scholar 

  79. Ananthakrishna R, Chaudhuri S, and Ganti V. Eliminating fuzzy duplicates in datawarehouses. In Proc. of the 28th Intl. Conf. on Very Large Data Bases (VLDB), pages 586-597, Hong-Kong, China, 2002.

    Google Scholar 

  80. Baxter R, Christen P, and Churches T. A comparison of fast blocking methods for record linkage. In Proc. of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 27-29, Washington, DC, USA, 2003.

    Google Scholar 

  81. Wang R. A product perspective on total data quality management. Com. Of the ACM, 41(2):58-65, 1998.

    Article  Google Scholar 

  82. Wang R. Advances in Database Systems, volume 23, chapter Journey to Data Quality. Kluwer Academic Press, Boston, MA, USA, 2002.

    Google Scholar 

  83. Wang R, Storey V, and Firth C. A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering (TDKE), 7(4):670-677, 1995.

    Google Scholar 

  84. Little RJ and Rubin DB. Statistical Analysis with Missing Data. Wiley, New-York, 1987.

    MATH  Google Scholar 

  85. Pearson RK. Data mining in face of contaminated and incomplete records. In Proc. of SIAM Intl. Conf. Data Mining, 2002.

    Google Scholar 

  86. Hamming RW. Error-detecting and error-correcting codes. Bell System Technical Journal, 29(2):147-160, 1950.

    MathSciNet  Google Scholar 

  87. Chaudhuri S, Ganjam K, Ganti V, and Motwani R. Robust and efficient fuzzy match for online data cleaning. In Proc. of the 2003 ACM SIGMOD Intl. Conf. on Management of Data, pages 313-324, San Diego, CA, USA, 2003.

    Google Scholar 

  88. Tejada S, Knoblock CA, and Minton S. Learning object identification rules for information integration. Information Systems, 26(8), 2001.

    Google Scholar 

  89. Ahmed T, Asgari AH, Mehaoua A, Borcoci E, Berti- Équille L, and Kormentzas G. End-to-end quality of service provisioning through an integrated management system for multimedia content delivery. Special Issue of Computer Communications on Emerging Middleware for Next Generation Networks, 2005.

    Google Scholar 

  90. Dasu T and Johnson T. Exploratory Data Mining and Data Cleaning. Wiley, New York, 2003.

    Book  MATH  Google Scholar 

  91. Dasu T, Johnson T, Muthukrishnan S, and Shkapenyuk V. Mining database structure or how to build a data quality browser. In Proc. of the 2002 ACM SIGMOD Intl. Conf., pages 240-251, Madison, WI, USA, 2002.

    Google Scholar 

  92. Johnson T and Dasu T. Comparing massive high-dimensional data sets. In Proc. of the 4th Intl. Conf. KDD, pages 229-233, New York City, New York, USA, 1998.

    Google Scholar 

  93. Redman T. Data Quality: The Field Guide. Digital Press, Elsevier, 2001.

    Google Scholar 

  94. Raman V and Hellerstein JM. Potter’s wheel: an interactive data cleaning system. In Proc. of the 26th Intl. Conf. on Very Large Data Bases (VLDB), pages 381-390, Roma, Italy, 2001.

    Google Scholar 

  95. DuMouchel W, Volinsky C, Johnson T, Cortez C, and Pregibon D. Squashing flat files flatter. In Proc. of the 5th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 6-16, San Diego, CA, USA, 1999.

    Google Scholar 

  96. Madnick SE Wang R, Kon HB. Data quality requirements analysis and modeling. In Proc. of the 9th Intl. Conf. on Data Engineering (ICDE), pages 670-677, Vienna, Austria, 1993.

    Google Scholar 

  97. Hou WC and Zhang Z. Enhancing database correctness: A statistical approach. In Proc. of the 1995 ACM SIGMOD Intl. Conf. on Management of Data, San Jose, CA, USA, 1995.

    Google Scholar 

  98. Winkler WE. Methods for evaluating and creating data quality. Information Systems, 29(7), 2004.

    Google Scholar 

  99. Winkler WE and Thibaudeau Y. An application of the fellegi-sunter model of record linkage to the 1990 u.s. decennial census. Technical Report Statistical Research Report Series RR91/09, U.S. Bureau of the Census, Washington, DC, USA, 1991.

    Google Scholar 

  100. Low WL, Lee ML, and Ling TW. A knowledge-based approach for duplicate elimination in data cleaning. Information System, 26(8), 2001.

    Google Scholar 

  101. Cui Y and Widom J. Lineage tracing for general data warehouse transformation. In Proc. of the 27th Intl. Conf. on Very Large Data Bases (VLDB), pages 471-480, Roma, Italy, September 11-14 2001.

    Google Scholar 

  102. Zhu Y and Shasha D. Statstream: Statistical monitoring of thousands of data streams in real time. In Proc. of the 10th Intl. Conf. on Very Large Data Bases (VLDB), pages 358-369, Hong-Kong, China, 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Berti-Équille, L. (2007). Measuring and Modelling Data Quality for Quality-Awareness in Data Mining. In: Guillet, F.J., Hamilton, H.J. (eds) Quality Measures in Data Mining. Studies in Computational Intelligence, vol 43. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-44918-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-44918-8_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44911-9

  • Online ISBN: 978-3-540-44918-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics