Skip to main content

Data Warehouse Quality: Summary and Outlook

  • Chapter
  • First Online:
  • 5207 Accesses

Abstract

Data warehouses correlate data from various sources to enable reporting, data mining, and decision support. Some of the unique features of data warehouses (as compared to transactional databases) include data integration from multiple sources and emphasis on temporal, historical, and multidimensional data. In this chapter, we survey data warehouse quality problems and solutions, including data freshness (ensuring that materialized views are up to date as new data arrive over time), data completeness (capturing all the required history), data correctness (as defined by various types of integrity constraints, including those which govern how data may evolve over time), consistency, error detection and profiling, and distributed data quality.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Recall that a functional dependency (FD) X → Y asserts that two tuples having the same value of the left-hand-side attributes (X) must also agree on the right-hand-side attributes (Y ).

References

  1. Adelberg B, Garcia-Molina H, Kao B (1995) Applying update streams in a soft real-time database system. In: SIGMOD conference, pp 245–256

    Google Scholar 

  2. Baer A, Golab L (2012) Towards benchmarking stream data warehouses. In: DOLAP

    Google Scholar 

  3. Berti-Equille L, Dasu T, Srivastava D (2011) Discovery of complex glitch patterns: a novel approach to quantitative data cleaning. In: Proceedings of the ICDE

    Google Scholar 

  4. Beskales G, Ilyas IF, Golab L (2010) Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1):197–207

    Google Scholar 

  5. Brown PG, Haas PJ (2006) Techniques for warehousing of sample data. In: Proceedings of the ICDE

    Google Scholar 

  6. Chiang F, Miller RJ (2008) Discovering data quality rules. PVLDB 1(1):1166–1177

    Google Scholar 

  7. Chiang F, Miller RJ (2011) A unified model for data and constraint repair. In: Proceedings of the ICDE

    Google Scholar 

  8. Cho J, Garcia-Molina H (2000) Synchronizing a database to improve freshness. In: SIGMOD conference, pp 117–128

    Google Scholar 

  9. Cong G, Fan W, Geerts F, Jia X, Ma S (2007) Improving data quality: consistency and accuracy. In: VLDB, pp 315–326

    Google Scholar 

  10. Cormode G, Golab L, Korn F, McGregor A, Srivastava D, Zhang X (2009) Estimating the confidence of conditional functional dependencies. In: SIGMOD conference, pp 469–482

    Google Scholar 

  11. De Marchi F, Lopes S, Petit J-M (2009) Unary and n-ary inclusion dependency discovery in relational databases. J Intell Inf Syst 32(1):53–73

    Article  Google Scholar 

  12. Fan W, Geerts F, Jia X (2008) Semandaq: a data quality system based on conditional functional dependencies. PVLDB 1(2):1460–1463

    Google Scholar 

  13. Fan W, Geerts F, Jia X, Kementsietsidis A (2008) Conditional functional dependencies for capturing data inconsistencies. ACM Trans Database Syst 33(2):1–48

    Article  Google Scholar 

  14. Fan W, Geerts F, Li J, Xiong M (2011) Discovering conditional functional dependencies. IEEE Trans Knowl Data Eng 23(5):683–698

    Article  Google Scholar 

  15. Fan W, Geerts F, Ma S, Müller H: Detecting inconsistencies in distributed data. In: Proceedings of the ICDE

    Google Scholar 

  16. Fan W, Geerts F, Wijsen J (2011) Determining the currency of data. In: PODS, pp 71–82

    Google Scholar 

  17. Fan W, Li J, Ma S, Tang N, Yu W (2012) Towards certain fixes with editing rules and master data. VLDB J 21(2):213–238

    Article  Google Scholar 

  18. Fan W, Li J, Tang N, Yu W (2012) Incremental detection of inconsistencies in distributed data. In: Proceedings of the ICDE

    Google Scholar 

  19. Folkert N, Gupta A, Witkowski A, Subramanian S, Bellamkonda S, Shankar S, Bozkaya T, Sheng L (2005) Optimizing refresh of a set of materialized views. In: VLDB, pp 1043–1054

    Google Scholar 

  20. Golab L, Johnson T (2011) Consistency in a stream warehouse. In: CIDR, pp 114–122

    Google Scholar 

  21. Golab L, Johnson T, Spencer Seidel J, Shkapenyuk V (2009) Stream warehousing with DataDepot. In: SIGMOD conference, pp 847–854

    Google Scholar 

  22. Golab L, Johnson T, Shkapenyuk V (2012) Scalable scheduling of updates in streaming data warehouses. IEEE Trans Knowl Data Eng 24(6):1092–1105

    Article  Google Scholar 

  23. Golab L, Karloff HJ, Korn F, Saha B, Srivastava D (2012) Discovering conservation rules. In: Proceedings of the ICDE

    Google Scholar 

  24. Golab L, Karloff HJ, Korn F, Srivastava D (2010) Data auditor: exploring data quality and semantics using pattern tableaux. PVLDB 3(2):1641–1644

    Google Scholar 

  25. Golab L, Karloff HJ, Korn F, Srivastava D, Yu B (2008) On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1):376–390

    Google Scholar 

  26. Golab L, Tamer Ozsu M (2010) Data stream management. Synthesis lectures on data management. Morgan & Claypool Publishers, San Rafael

    Google Scholar 

  27. Hellerstein JM (2009) Quantitative data cleaning for large databases. Keynote at QDB (technical report at db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf)

    Google Scholar 

  28. Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126

    Article  MATH  Google Scholar 

  29. Jeffery SR, Alonso G, Franklin MJ, Hong W, Widom J (2006) A pipelined framework for online cleaning of sensor data streams. In: Proceedings of the ICDE

    Google Scholar 

  30. Jensen CS, Pedersen TB, Thomsen C (2010) Multidimensional databases and data warehousing. Synthesis lectures on data management. Morgan & Claypool Publishers, San Rafael

    Google Scholar 

  31. Khoussainova N, Balazinska M, Suciu D (2006) Towards correcting input data errors probabilistically using integrity constraints. In: MobiDE, pp 43–50

    Google Scholar 

  32. Kivinen J, Mannila H (1995) Approximate inference of functional dependencies from relations. Theor Comput Sci 149(1):129–149

    Article  MathSciNet  MATH  Google Scholar 

  33. Krishnamurthy S, Franklin MJ, Davis J, Farina D, Golovko P, Li A, Thombre N (2010) Continuous analytics over discontinuous streams. In: SIGMOD conference, pp 1081–1092

    Google Scholar 

  34. Kolahi S, Lakshmanan LVS (2009) On approximating optimum repairs for functional dependency violations. In: ICDT, pp 53–62

    Google Scholar 

  35. Korn F, Muthukrishnan S, Zhu Y (2003) Checks and balances: monitoring data quality problems in network traffic databases. In: VLDB, pp 536–547

    Google Scholar 

  36. Labio W, Yerneni R, Garcia-Molina H (1999) Shrinking the warehouse update window. In: SIGMOD conference, pp 383–394

    Google Scholar 

  37. Labrinidis A, Roussopoulos N (2001) Update propagation strategies for improving the quality of data on the web. In: VLDB, pp 391–400

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lukasz Golab .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Golab, L. (2013). Data Warehouse Quality: Summary and Outlook. In: Sadiq, S. (eds) Handbook of Data Quality. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36257-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36257-6_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36256-9

  • Online ISBN: 978-3-642-36257-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics