Abstract
Data warehouses correlate data from various sources to enable reporting, data mining, and decision support. Some of the unique features of data warehouses (as compared to transactional databases) include data integration from multiple sources and emphasis on temporal, historical, and multidimensional data. In this chapter, we survey data warehouse quality problems and solutions, including data freshness (ensuring that materialized views are up to date as new data arrive over time), data completeness (capturing all the required history), data correctness (as defined by various types of integrity constraints, including those which govern how data may evolve over time), consistency, error detection and profiling, and distributed data quality.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Recall that a functional dependency (FD) X → Y asserts that two tuples having the same value of the left-hand-side attributes (X) must also agree on the right-hand-side attributes (Y ).
References
Adelberg B, Garcia-Molina H, Kao B (1995) Applying update streams in a soft real-time database system. In: SIGMOD conference, pp 245–256
Baer A, Golab L (2012) Towards benchmarking stream data warehouses. In: DOLAP
Berti-Equille L, Dasu T, Srivastava D (2011) Discovery of complex glitch patterns: a novel approach to quantitative data cleaning. In: Proceedings of the ICDE
Beskales G, Ilyas IF, Golab L (2010) Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1):197–207
Brown PG, Haas PJ (2006) Techniques for warehousing of sample data. In: Proceedings of the ICDE
Chiang F, Miller RJ (2008) Discovering data quality rules. PVLDB 1(1):1166–1177
Chiang F, Miller RJ (2011) A unified model for data and constraint repair. In: Proceedings of the ICDE
Cho J, Garcia-Molina H (2000) Synchronizing a database to improve freshness. In: SIGMOD conference, pp 117–128
Cong G, Fan W, Geerts F, Jia X, Ma S (2007) Improving data quality: consistency and accuracy. In: VLDB, pp 315–326
Cormode G, Golab L, Korn F, McGregor A, Srivastava D, Zhang X (2009) Estimating the confidence of conditional functional dependencies. In: SIGMOD conference, pp 469–482
De Marchi F, Lopes S, Petit J-M (2009) Unary and n-ary inclusion dependency discovery in relational databases. J Intell Inf Syst 32(1):53–73
Fan W, Geerts F, Jia X (2008) Semandaq: a data quality system based on conditional functional dependencies. PVLDB 1(2):1460–1463
Fan W, Geerts F, Jia X, Kementsietsidis A (2008) Conditional functional dependencies for capturing data inconsistencies. ACM Trans Database Syst 33(2):1–48
Fan W, Geerts F, Li J, Xiong M (2011) Discovering conditional functional dependencies. IEEE Trans Knowl Data Eng 23(5):683–698
Fan W, Geerts F, Ma S, Müller H: Detecting inconsistencies in distributed data. In: Proceedings of the ICDE
Fan W, Geerts F, Wijsen J (2011) Determining the currency of data. In: PODS, pp 71–82
Fan W, Li J, Ma S, Tang N, Yu W (2012) Towards certain fixes with editing rules and master data. VLDB J 21(2):213–238
Fan W, Li J, Tang N, Yu W (2012) Incremental detection of inconsistencies in distributed data. In: Proceedings of the ICDE
Folkert N, Gupta A, Witkowski A, Subramanian S, Bellamkonda S, Shankar S, Bozkaya T, Sheng L (2005) Optimizing refresh of a set of materialized views. In: VLDB, pp 1043–1054
Golab L, Johnson T (2011) Consistency in a stream warehouse. In: CIDR, pp 114–122
Golab L, Johnson T, Spencer Seidel J, Shkapenyuk V (2009) Stream warehousing with DataDepot. In: SIGMOD conference, pp 847–854
Golab L, Johnson T, Shkapenyuk V (2012) Scalable scheduling of updates in streaming data warehouses. IEEE Trans Knowl Data Eng 24(6):1092–1105
Golab L, Karloff HJ, Korn F, Saha B, Srivastava D (2012) Discovering conservation rules. In: Proceedings of the ICDE
Golab L, Karloff HJ, Korn F, Srivastava D (2010) Data auditor: exploring data quality and semantics using pattern tableaux. PVLDB 3(2):1641–1644
Golab L, Karloff HJ, Korn F, Srivastava D, Yu B (2008) On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1):376–390
Golab L, Tamer Ozsu M (2010) Data stream management. Synthesis lectures on data management. Morgan & Claypool Publishers, San Rafael
Hellerstein JM (2009) Quantitative data cleaning for large databases. Keynote at QDB (technical report at db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf)
Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126
Jeffery SR, Alonso G, Franklin MJ, Hong W, Widom J (2006) A pipelined framework for online cleaning of sensor data streams. In: Proceedings of the ICDE
Jensen CS, Pedersen TB, Thomsen C (2010) Multidimensional databases and data warehousing. Synthesis lectures on data management. Morgan & Claypool Publishers, San Rafael
Khoussainova N, Balazinska M, Suciu D (2006) Towards correcting input data errors probabilistically using integrity constraints. In: MobiDE, pp 43–50
Kivinen J, Mannila H (1995) Approximate inference of functional dependencies from relations. Theor Comput Sci 149(1):129–149
Krishnamurthy S, Franklin MJ, Davis J, Farina D, Golovko P, Li A, Thombre N (2010) Continuous analytics over discontinuous streams. In: SIGMOD conference, pp 1081–1092
Kolahi S, Lakshmanan LVS (2009) On approximating optimum repairs for functional dependency violations. In: ICDT, pp 53–62
Korn F, Muthukrishnan S, Zhu Y (2003) Checks and balances: monitoring data quality problems in network traffic databases. In: VLDB, pp 536–547
Labio W, Yerneni R, Garcia-Molina H (1999) Shrinking the warehouse update window. In: SIGMOD conference, pp 383–394
Labrinidis A, Roussopoulos N (2001) Update propagation strategies for improving the quality of data on the web. In: VLDB, pp 391–400
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Golab, L. (2013). Data Warehouse Quality: Summary and Outlook. In: Sadiq, S. (eds) Handbook of Data Quality. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36257-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-36257-6_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36256-9
Online ISBN: 978-3-642-36257-6
eBook Packages: Computer ScienceComputer Science (R0)