Skip to main content
Log in

Managing Data Quality of the Data Warehouse: A Chance-Constrained Programming Approach

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

To make informed decisions, managers establish data warehouses that integrate multiple data sources. However, the outcomes of the data warehouse-based decisions are not always satisfactory due to low data quality. Although many studies focused on data quality management, little effort has been made to explore effective data quality control strategies for the data warehouse. In this study, we propose a chance-constrained programming model that determines the optimal strategy for allocating the control resources to mitigate the data quality problems of the data warehouse. We develop a modified Artificial Bee Colony algorithm to solve the model. Our work contributes to the literature on evaluation of data quality problem propagation in data integration process and data quality control on the data sources that make up the data warehouse. We use a data warehouse in the healthcare organization to illustrate the model and the effectiveness of the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig.2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Afshang, M., & Dhillon, H. S. (2018). Poisson cluster process based analysis of hetnets with correlated user and base station locations. IEEE Transactions on Wireless Communications, 17(4), 2417–2431.

    Google Scholar 

  • Akay, B., & Karaboga, D. (2012). A modified artificial bee colony algorithm for real-parameter optimization. Information Sciences, 192, 120–142.

    Google Scholar 

  • Allam, A., Skiadopoulos, S., & Kalnis, P. (2018). Improved suffix blocking for record linkage and entity resolution. Data & Knowledge Engineering, 117, 98–113.

    Google Scholar 

  • Aquilani, B., Silvestri, C., Ruggieri, A., & Gatti, C. (2017). A systematic literature review on total quality management critical success factors and the identification of new avenues of research. The TQM Journal, 29(1), 184–213.

    Google Scholar 

  • Arora, R., Pahwa, P., & Gupta, D. (2017). Data quality improvement in data warehouse: A framework. International Journal of Data Analysis Techniques & Strategies, 9(1), 17–33.

    Google Scholar 

  • Bai, X., Krishnan, R., Padman, R., & Wang, H. J. (2013). On risk management with information flows in business processes. Information Systems Research, 24(3), 731–749.

    Google Scholar 

  • Ballou, D. P., & Tayi, G. K. (1999). Enhancing data quality in data warehouse environments. Communications of the ACM, 42(1), 73–78.

    Google Scholar 

  • Ballou, D. P., Chengalur-Smith, I. S. N., & Wang, R. Y. (2006). Sample-based quality estimation of query results in relational database environments. IEEE Transactions on Knowledge and Data Engineering, 18(5), 639–650.

    Google Scholar 

  • Batini, C., & Scannapieco, M. (2016). Data and information quality: Dimensions, principles and techniques. Berlin: Springer.

    Google Scholar 

  • Cannella, S., Framinan, J. M., Bruccoleri, M., Barbosa-Póvoa, A. P., & Relvas, S. (2015). The effect of inventory record inaccuracy in information exchange supply chains. European Journal of Operational Research, 243(1), 120–129.

    Google Scholar 

  • Charnes, A., & Cooper, W. (1959). Chance-constrained programming. Management Science, 6(1), 73–79.

    Google Scholar 

  • Chen, C. Y., Chi, Y. L., & Wolfe, P. (2005). An object-oriented quality framework with optimization models for managing data quality in data warehouse applications. International Journal of Operations Research, 2(2), 1–81.

    Google Scholar 

  • Chen, L., Zhou, C., Li, X., & Dai, G. (2017). An improved differential evolution algorithm based on suboptimal solution mutation. International Journal of Computing Science and Mathematics, 8(1), 28–34.

    Google Scholar 

  • Conforti, R., Dumas, M., García-Bañuelos, L., & La Rosa, M. (2016). Bpmn miner: Automated discovery of bpmn process models with hierarchical structure. Information Systems, 56, 284–303.

    Google Scholar 

  • Dakrory, S. B., Mahmoud, T. M., & Ali, A. A. (2015). Automated etl testing on the data quality of a data warehouse. International Journal of Computer Applications, 131(16), 9–16.

    Google Scholar 

  • Davidson, I., & Tayi, G. (2009). Data preparation using data quality matrices for classification mining. European Journal of Operational Research, 197(2), 764–772.

    Google Scholar 

  • DeWitt, J. G., & Hampton, P. M. (2005). Development of a data warehouse at an academic health system: Knowing a place for the first time. Academic Medicine, 80(11), 1019–1025.

    Google Scholar 

  • Dey, D., & Kumar, S. (2010). Reassessing data quality for information products. Management Science, 56(12), 2316–2322.

    Google Scholar 

  • Dey, D., & Kumar, S. (2013). Data quality of query results with generalized selection conditions. Operations Research, 61(1), 17–31.

    Google Scholar 

  • Even, A., Shankaranarayanan, G., & Berger, P. D. (2010). Evaluating a model for cost-effective data quality management in a real-world crm setting. Decision Support Systems, 50(1), 152–163.

    Google Scholar 

  • Experian. (2016). The 2016 global data management benchmark report. Retrieved from Boston: https://www.edq.com/globalassets/white-papers/2016-global-data-management-benchmark-report.pdf

  • Experian. (2017). The 2017 global data management benchmark report. Retrieved from https://www.edq.com/globalassets/white-papers/2017-global-data-management-benchmark-report.pdf

  • Garcia-Bernardo, J., & Takes, F. W. (2018). The effects of data quality on the analysis of corporate board interlock networks. Information Systems, 78, 164–172.

    Google Scholar 

  • Harkany, T., & Hagnermcwhirter, A. (2015). Quantitative western blotting: Improving your data quality and reproducibility. Science, 347(6225), 1022.

    Google Scholar 

  • Hartzema, A. G., Reich, C. G., Ryan, P. B., Stang, P. E., Madigan, D., Welebob, E., & Overhage, J. M. (2013). Managing data quality for a drug safety surveillance system. Drug Safety, 36(1), 49–58.

    Google Scholar 

  • Heinrich, B., Hristova, D., Klier, M., Schiller, A., & Szubartowicz, M. (2018). Requirements for data quality metrics. Journal of Data and Information Quality (JDIQ), 9(2), 12.

    Google Scholar 

  • Jannot, A.-S., Zapletal, E., Avillach, P., Mamzer, M.-F., Burgun, A., & Degoulet, P. (2017). The georges pompidou university hospital clinical data warehouse: A 8-years follow-up experience. International Journal of Medical Informatics, 102, 21–28.

    Google Scholar 

  • Jiang, Z., Sarkar, S., De, P., & Dey, D. (2007). A framework for reconciling attribute values from multiple data sources. Management Science, 53(12), 1946–1963.

    Google Scholar 

  • Jones-Farmer, L. A., Ezell, J. D., & Hazen, B. T. (2014). Applying control chart methods to enhance data quality. Technometrics, 56(1), 29–41.

    Google Scholar 

  • Lee, Y. W. (2006). Journey to data quality. Cambridge, MA: MIT Press.

    Google Scholar 

  • Liu, X., Heller, A., & Nielsen, P. S. (2017). Citiesdata: A smart city data management framework. Knowledge and Information Systems, 53(3), 699–722.

    Google Scholar 

  • Liu, Q., Feng, G., Wang, N., & Tayi, G. K. (2018). A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge. Information Systems Frontiers, 20(2), 401–416.

    Google Scholar 

  • Lu, J., Feng, G., Lai, K. K., & Wang, N. (2017). The bullwhip effect on inventory: A perspective on information quality. Applied Economics, 49(24), 2322–2338.

    Google Scholar 

  • Lukyanenko, R., Wiggins, A., & Rosser, H. K. (2019). Citizen science: An information quality research frontier. Information Systems Frontiers, 1–23. https://doi.org/10.1007/s10796-019-09915-z.

  • Manogaran, G., & Lopez, D. (2018). A gaussian process based big data processing framework in cluster computing environment. Cluster Computing, 21(1), 189–204.

    Google Scholar 

  • Mohammed, A., & Talab, S. A. (2015). Enhanced extraction clinical data technique to improve data quality in clinical data warehouse. International Journal of Database Theory and Application, 8(3), 333–342.

    Google Scholar 

  • Parssian, A., Sarkar, S., & Jacob, V. S. (2004). Assessing data quality for information products: Impact of selection, projection, and cartesian product. Management Science, 50(7), 967–982.

    Google Scholar 

  • Parssian, A., Sarkar, S., & Jacob, V. S. (2009). Impact of the union and difference operations on the quality of information products. Information Systems Research, 20(1), 99–120.

    Google Scholar 

  • Pittet, D., & Donaldson, L. (2006). Challenging the world: Patient safety and health care-associated infection. International Journal for Quality in Health Care, 18(1), 4–8.

    Google Scholar 

  • Poojari, C. A., & Varghese, B. (2008). Genetic algorithm based technique for solving chance constrained problems. European Journal of Operational Research, 185(3), 1128–1154.

    Google Scholar 

  • Qin, X., & Huang, G. (2009). An inexact chance-constrained quadratic programming model for stream water quality management. Water Resources Management, 23(4), 661–695.

    Google Scholar 

  • Sagi, T., Gal, A., Barkol, O., Bergman, R., & Avram, A. (2017). Multi-source uncertain entity resolution: Transforming holocaust victim reports into people. Information Systems, 65, 124–136.

    Google Scholar 

  • Sakalli, Ü. S. (2013). A simulated annealing approach for reliability-based chance-constrained programming. Applied Stochastic Models in Business & Industry, 30(4), 497–508.

    Google Scholar 

  • Sebaa, A., Chikh, F., Nouicer, A., & Tari, A. (2018). Medical big data warehouse: Architecture and system design, a case study: Improving healthcare resources distribution. Journal of Medical Systems, 42(4), 59.

    Google Scholar 

  • Subramanian, G. H., & Wang, K. (2017). Systems dynamics-based modeling of data warehouse quality. Journal of Computer Information Systems, 1–8. https://doi.org/10.1080/08874417.2017.1383863.

  • Szeto, W., Wu, Y., & Ho, S. C. (2011). An artificial bee colony algorithm for the capacitated vehicle routing problem. European Journal of Operational Research, 215(1), 126–135.

    Google Scholar 

  • Wahyudi, A., Kuk, G., & Janssen, M. (2018). A process pattern model for tackling and improving big data quality. Information Systems Frontiers, 20(3), 457–469.

    Google Scholar 

  • Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33.

    Google Scholar 

  • Wang, Y. Y., Huang, G. H., Wang, S., Li, W., & Guan, P. B. (2016). A risk-based interactive multi-stage stochastic programming approach for water resources planning under dual uncertainties. Advances in Water Resources, 94, 217–230.

    Google Scholar 

  • Watson, H. J., Fuller, C., & Ariyachandra, T. (2004). Data warehouse governance: Best practices at blue cross and blue shield of North Carolina. Decision Support Systems, 38(3), 435–450.

    Google Scholar 

  • Watts, S., Shankaranarayanan, G., & Even, A. (2009). Data quality assessment in context: A cognitive perspective. Decision Support Systems, 48(1), 202–211.

    Google Scholar 

  • Xu, Y., Wang, L., Xu, B., Jiang, W., Deng, C., Ji, F., & Xu, X. (2019). An information integration and transmission model of multi-source data for product quality and safety. Information Systems Frontiers, 21(1), 191–212.

    Google Scholar 

  • Zak, Y., & Even, A. (2017). Development and evaluation of a continuous-time markov chain model for detecting and handling data currency declines. Decision Support Systems, 103, 82–93.

    Google Scholar 

  • Zhu, H.-J., Jiang, T.-H., Wang, Y., Cheng, L., Ma, B., & Zhao, F. (2019). A data cleaning method for heterogeneous attribute fusion and record linkage. International Journal of Computational Science and Engineering, 19(3), 311–324.

    Google Scholar 

  • Zong, W., Wu, F., & Feng, P. (2019). Improving data quality during erp implementation based on information product map. Enterprise Information Systems, 1–17. https://doi.org/10.1080/17517575.2019.1644669.

Download references

Acknowledgements

The research presented in this paper is supported by the National Natural Science Foundation Project of China (71572145).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gengzhong Feng.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendixes

Appendixes

1.1 Appendix A: Previous Relevant Research on Data Quality Improvement of Data Warehouse

Table 7 Summary of the Literature Review on Some Relevant Research Papers

1.2 Appendix B: Dimension Level Optimal Control Resource Allocation

Table 8 Control Allocation in the Case of DQPR = N(0.15, 0.01),ρ=0.75
Table 9 Control Allocation in the Case of DQPR = N(0.15, 0.01),ρ=0.90
Table 10 Control Allocation in the Case of DQPR = N(0.20, 0.01),ρ=0.75
Table 11 Control Allocation in the Case of “DQPR = N(0.20, 0.01),ρ=0.90”

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Q., Feng, G., Tayi, G.K. et al. Managing Data Quality of the Data Warehouse: A Chance-Constrained Programming Approach. Inf Syst Front 23, 375–389 (2021). https://doi.org/10.1007/s10796-019-09963-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-019-09963-5

Keywords

Navigation