Abstract
To make informed decisions, managers establish data warehouses that integrate multiple data sources. However, the outcomes of the data warehouse-based decisions are not always satisfactory due to low data quality. Although many studies focused on data quality management, little effort has been made to explore effective data quality control strategies for the data warehouse. In this study, we propose a chance-constrained programming model that determines the optimal strategy for allocating the control resources to mitigate the data quality problems of the data warehouse. We develop a modified Artificial Bee Colony algorithm to solve the model. Our work contributes to the literature on evaluation of data quality problem propagation in data integration process and data quality control on the data sources that make up the data warehouse. We use a data warehouse in the healthcare organization to illustrate the model and the effectiveness of the algorithm.
Similar content being viewed by others
References
Afshang, M., & Dhillon, H. S. (2018). Poisson cluster process based analysis of hetnets with correlated user and base station locations. IEEE Transactions on Wireless Communications, 17(4), 2417–2431.
Akay, B., & Karaboga, D. (2012). A modified artificial bee colony algorithm for real-parameter optimization. Information Sciences, 192, 120–142.
Allam, A., Skiadopoulos, S., & Kalnis, P. (2018). Improved suffix blocking for record linkage and entity resolution. Data & Knowledge Engineering, 117, 98–113.
Aquilani, B., Silvestri, C., Ruggieri, A., & Gatti, C. (2017). A systematic literature review on total quality management critical success factors and the identification of new avenues of research. The TQM Journal, 29(1), 184–213.
Arora, R., Pahwa, P., & Gupta, D. (2017). Data quality improvement in data warehouse: A framework. International Journal of Data Analysis Techniques & Strategies, 9(1), 17–33.
Bai, X., Krishnan, R., Padman, R., & Wang, H. J. (2013). On risk management with information flows in business processes. Information Systems Research, 24(3), 731–749.
Ballou, D. P., & Tayi, G. K. (1999). Enhancing data quality in data warehouse environments. Communications of the ACM, 42(1), 73–78.
Ballou, D. P., Chengalur-Smith, I. S. N., & Wang, R. Y. (2006). Sample-based quality estimation of query results in relational database environments. IEEE Transactions on Knowledge and Data Engineering, 18(5), 639–650.
Batini, C., & Scannapieco, M. (2016). Data and information quality: Dimensions, principles and techniques. Berlin: Springer.
Cannella, S., Framinan, J. M., Bruccoleri, M., Barbosa-Póvoa, A. P., & Relvas, S. (2015). The effect of inventory record inaccuracy in information exchange supply chains. European Journal of Operational Research, 243(1), 120–129.
Charnes, A., & Cooper, W. (1959). Chance-constrained programming. Management Science, 6(1), 73–79.
Chen, C. Y., Chi, Y. L., & Wolfe, P. (2005). An object-oriented quality framework with optimization models for managing data quality in data warehouse applications. International Journal of Operations Research, 2(2), 1–81.
Chen, L., Zhou, C., Li, X., & Dai, G. (2017). An improved differential evolution algorithm based on suboptimal solution mutation. International Journal of Computing Science and Mathematics, 8(1), 28–34.
Conforti, R., Dumas, M., García-Bañuelos, L., & La Rosa, M. (2016). Bpmn miner: Automated discovery of bpmn process models with hierarchical structure. Information Systems, 56, 284–303.
Dakrory, S. B., Mahmoud, T. M., & Ali, A. A. (2015). Automated etl testing on the data quality of a data warehouse. International Journal of Computer Applications, 131(16), 9–16.
Davidson, I., & Tayi, G. (2009). Data preparation using data quality matrices for classification mining. European Journal of Operational Research, 197(2), 764–772.
DeWitt, J. G., & Hampton, P. M. (2005). Development of a data warehouse at an academic health system: Knowing a place for the first time. Academic Medicine, 80(11), 1019–1025.
Dey, D., & Kumar, S. (2010). Reassessing data quality for information products. Management Science, 56(12), 2316–2322.
Dey, D., & Kumar, S. (2013). Data quality of query results with generalized selection conditions. Operations Research, 61(1), 17–31.
Even, A., Shankaranarayanan, G., & Berger, P. D. (2010). Evaluating a model for cost-effective data quality management in a real-world crm setting. Decision Support Systems, 50(1), 152–163.
Experian. (2016). The 2016 global data management benchmark report. Retrieved from Boston: https://www.edq.com/globalassets/white-papers/2016-global-data-management-benchmark-report.pdf
Experian. (2017). The 2017 global data management benchmark report. Retrieved from https://www.edq.com/globalassets/white-papers/2017-global-data-management-benchmark-report.pdf
Garcia-Bernardo, J., & Takes, F. W. (2018). The effects of data quality on the analysis of corporate board interlock networks. Information Systems, 78, 164–172.
Harkany, T., & Hagnermcwhirter, A. (2015). Quantitative western blotting: Improving your data quality and reproducibility. Science, 347(6225), 1022.
Hartzema, A. G., Reich, C. G., Ryan, P. B., Stang, P. E., Madigan, D., Welebob, E., & Overhage, J. M. (2013). Managing data quality for a drug safety surveillance system. Drug Safety, 36(1), 49–58.
Heinrich, B., Hristova, D., Klier, M., Schiller, A., & Szubartowicz, M. (2018). Requirements for data quality metrics. Journal of Data and Information Quality (JDIQ), 9(2), 12.
Jannot, A.-S., Zapletal, E., Avillach, P., Mamzer, M.-F., Burgun, A., & Degoulet, P. (2017). The georges pompidou university hospital clinical data warehouse: A 8-years follow-up experience. International Journal of Medical Informatics, 102, 21–28.
Jiang, Z., Sarkar, S., De, P., & Dey, D. (2007). A framework for reconciling attribute values from multiple data sources. Management Science, 53(12), 1946–1963.
Jones-Farmer, L. A., Ezell, J. D., & Hazen, B. T. (2014). Applying control chart methods to enhance data quality. Technometrics, 56(1), 29–41.
Lee, Y. W. (2006). Journey to data quality. Cambridge, MA: MIT Press.
Liu, X., Heller, A., & Nielsen, P. S. (2017). Citiesdata: A smart city data management framework. Knowledge and Information Systems, 53(3), 699–722.
Liu, Q., Feng, G., Wang, N., & Tayi, G. K. (2018). A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge. Information Systems Frontiers, 20(2), 401–416.
Lu, J., Feng, G., Lai, K. K., & Wang, N. (2017). The bullwhip effect on inventory: A perspective on information quality. Applied Economics, 49(24), 2322–2338.
Lukyanenko, R., Wiggins, A., & Rosser, H. K. (2019). Citizen science: An information quality research frontier. Information Systems Frontiers, 1–23. https://doi.org/10.1007/s10796-019-09915-z.
Manogaran, G., & Lopez, D. (2018). A gaussian process based big data processing framework in cluster computing environment. Cluster Computing, 21(1), 189–204.
Mohammed, A., & Talab, S. A. (2015). Enhanced extraction clinical data technique to improve data quality in clinical data warehouse. International Journal of Database Theory and Application, 8(3), 333–342.
Parssian, A., Sarkar, S., & Jacob, V. S. (2004). Assessing data quality for information products: Impact of selection, projection, and cartesian product. Management Science, 50(7), 967–982.
Parssian, A., Sarkar, S., & Jacob, V. S. (2009). Impact of the union and difference operations on the quality of information products. Information Systems Research, 20(1), 99–120.
Pittet, D., & Donaldson, L. (2006). Challenging the world: Patient safety and health care-associated infection. International Journal for Quality in Health Care, 18(1), 4–8.
Poojari, C. A., & Varghese, B. (2008). Genetic algorithm based technique for solving chance constrained problems. European Journal of Operational Research, 185(3), 1128–1154.
Qin, X., & Huang, G. (2009). An inexact chance-constrained quadratic programming model for stream water quality management. Water Resources Management, 23(4), 661–695.
Sagi, T., Gal, A., Barkol, O., Bergman, R., & Avram, A. (2017). Multi-source uncertain entity resolution: Transforming holocaust victim reports into people. Information Systems, 65, 124–136.
Sakalli, Ü. S. (2013). A simulated annealing approach for reliability-based chance-constrained programming. Applied Stochastic Models in Business & Industry, 30(4), 497–508.
Sebaa, A., Chikh, F., Nouicer, A., & Tari, A. (2018). Medical big data warehouse: Architecture and system design, a case study: Improving healthcare resources distribution. Journal of Medical Systems, 42(4), 59.
Subramanian, G. H., & Wang, K. (2017). Systems dynamics-based modeling of data warehouse quality. Journal of Computer Information Systems, 1–8. https://doi.org/10.1080/08874417.2017.1383863.
Szeto, W., Wu, Y., & Ho, S. C. (2011). An artificial bee colony algorithm for the capacitated vehicle routing problem. European Journal of Operational Research, 215(1), 126–135.
Wahyudi, A., Kuk, G., & Janssen, M. (2018). A process pattern model for tackling and improving big data quality. Information Systems Frontiers, 20(3), 457–469.
Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33.
Wang, Y. Y., Huang, G. H., Wang, S., Li, W., & Guan, P. B. (2016). A risk-based interactive multi-stage stochastic programming approach for water resources planning under dual uncertainties. Advances in Water Resources, 94, 217–230.
Watson, H. J., Fuller, C., & Ariyachandra, T. (2004). Data warehouse governance: Best practices at blue cross and blue shield of North Carolina. Decision Support Systems, 38(3), 435–450.
Watts, S., Shankaranarayanan, G., & Even, A. (2009). Data quality assessment in context: A cognitive perspective. Decision Support Systems, 48(1), 202–211.
Xu, Y., Wang, L., Xu, B., Jiang, W., Deng, C., Ji, F., & Xu, X. (2019). An information integration and transmission model of multi-source data for product quality and safety. Information Systems Frontiers, 21(1), 191–212.
Zak, Y., & Even, A. (2017). Development and evaluation of a continuous-time markov chain model for detecting and handling data currency declines. Decision Support Systems, 103, 82–93.
Zhu, H.-J., Jiang, T.-H., Wang, Y., Cheng, L., Ma, B., & Zhao, F. (2019). A data cleaning method for heterogeneous attribute fusion and record linkage. International Journal of Computational Science and Engineering, 19(3), 311–324.
Zong, W., Wu, F., & Feng, P. (2019). Improving data quality during erp implementation based on information product map. Enterprise Information Systems, 1–17. https://doi.org/10.1080/17517575.2019.1644669.
Acknowledgements
The research presented in this paper is supported by the National Natural Science Foundation Project of China (71572145).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendixes
Appendixes
1.1 Appendix A: Previous Relevant Research on Data Quality Improvement of Data Warehouse
1.2 Appendix B: Dimension Level Optimal Control Resource Allocation
Rights and permissions
About this article
Cite this article
Liu, Q., Feng, G., Tayi, G.K. et al. Managing Data Quality of the Data Warehouse: A Chance-Constrained Programming Approach. Inf Syst Front 23, 375–389 (2021). https://doi.org/10.1007/s10796-019-09963-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10796-019-09963-5