Managing Data Quality of the Data Warehouse: A Chance-Constrained Programming Approach

Liu, Qi; Feng, Gengzhong; Tayi, Giri Kumar; Tian, Jun

doi:10.1007/s10796-019-09963-5

Managing Data Quality of the Data Warehouse: A Chance-Constrained Programming Approach

Published: 02 December 2019

Volume 23, pages 375–389, (2021)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

Qi Liu^1,2,
Gengzhong Feng^1,2,
Giri Kumar Tayi³ &
…
Jun Tian^1,2

1147 Accesses
11 Citations
Explore all metrics

Abstract

To make informed decisions, managers establish data warehouses that integrate multiple data sources. However, the outcomes of the data warehouse-based decisions are not always satisfactory due to low data quality. Although many studies focused on data quality management, little effort has been made to explore effective data quality control strategies for the data warehouse. In this study, we propose a chance-constrained programming model that determines the optimal strategy for allocating the control resources to mitigate the data quality problems of the data warehouse. We develop a modified Artificial Bee Colony algorithm to solve the model. Our work contributes to the literature on evaluation of data quality problem propagation in data integration process and data quality control on the data sources that make up the data warehouse. We use a data warehouse in the healthcare organization to illustrate the model and the effectiveness of the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Hybrid Algorithm for Stochastic Multiobjective Programming Problem

Study of the strategy for agricultural machinery maintenance in China based on the improved genetic-bee colony algorithm

Article 07 January 2023

Autonomous Tuning for Constraint Programming via Artificial Bee Colony Optimization

References

Afshang, M., & Dhillon, H. S. (2018). Poisson cluster process based analysis of hetnets with correlated user and base station locations. IEEE Transactions on Wireless Communications, 17(4), 2417–2431.
Google Scholar
Akay, B., & Karaboga, D. (2012). A modified artificial bee colony algorithm for real-parameter optimization. Information Sciences, 192, 120–142.
Google Scholar
Allam, A., Skiadopoulos, S., & Kalnis, P. (2018). Improved suffix blocking for record linkage and entity resolution. Data & Knowledge Engineering, 117, 98–113.
Google Scholar
Aquilani, B., Silvestri, C., Ruggieri, A., & Gatti, C. (2017). A systematic literature review on total quality management critical success factors and the identification of new avenues of research. The TQM Journal, 29(1), 184–213.
Google Scholar
Arora, R., Pahwa, P., & Gupta, D. (2017). Data quality improvement in data warehouse: A framework. International Journal of Data Analysis Techniques & Strategies, 9(1), 17–33.
Google Scholar
Bai, X., Krishnan, R., Padman, R., & Wang, H. J. (2013). On risk management with information flows in business processes. Information Systems Research, 24(3), 731–749.
Google Scholar
Ballou, D. P., & Tayi, G. K. (1999). Enhancing data quality in data warehouse environments. Communications of the ACM, 42(1), 73–78.
Google Scholar
Ballou, D. P., Chengalur-Smith, I. S. N., & Wang, R. Y. (2006). Sample-based quality estimation of query results in relational database environments. IEEE Transactions on Knowledge and Data Engineering, 18(5), 639–650.
Google Scholar
Batini, C., & Scannapieco, M. (2016). Data and information quality: Dimensions, principles and techniques. Berlin: Springer.
Google Scholar
Cannella, S., Framinan, J. M., Bruccoleri, M., Barbosa-Póvoa, A. P., & Relvas, S. (2015). The effect of inventory record inaccuracy in information exchange supply chains. European Journal of Operational Research, 243(1), 120–129.
Google Scholar
Charnes, A., & Cooper, W. (1959). Chance-constrained programming. Management Science, 6(1), 73–79.
Google Scholar
Chen, C. Y., Chi, Y. L., & Wolfe, P. (2005). An object-oriented quality framework with optimization models for managing data quality in data warehouse applications. International Journal of Operations Research, 2(2), 1–81.
Google Scholar
Chen, L., Zhou, C., Li, X., & Dai, G. (2017). An improved differential evolution algorithm based on suboptimal solution mutation. International Journal of Computing Science and Mathematics, 8(1), 28–34.
Google Scholar
Conforti, R., Dumas, M., García-Bañuelos, L., & La Rosa, M. (2016). Bpmn miner: Automated discovery of bpmn process models with hierarchical structure. Information Systems, 56, 284–303.
Google Scholar
Dakrory, S. B., Mahmoud, T. M., & Ali, A. A. (2015). Automated etl testing on the data quality of a data warehouse. International Journal of Computer Applications, 131(16), 9–16.
Google Scholar
Davidson, I., & Tayi, G. (2009). Data preparation using data quality matrices for classification mining. European Journal of Operational Research, 197(2), 764–772.
Google Scholar
DeWitt, J. G., & Hampton, P. M. (2005). Development of a data warehouse at an academic health system: Knowing a place for the first time. Academic Medicine, 80(11), 1019–1025.
Google Scholar
Dey, D., & Kumar, S. (2010). Reassessing data quality for information products. Management Science, 56(12), 2316–2322.
Google Scholar
Dey, D., & Kumar, S. (2013). Data quality of query results with generalized selection conditions. Operations Research, 61(1), 17–31.
Google Scholar
Even, A., Shankaranarayanan, G., & Berger, P. D. (2010). Evaluating a model for cost-effective data quality management in a real-world crm setting. Decision Support Systems, 50(1), 152–163.
Google Scholar
Experian. (2016). The 2016 global data management benchmark report. Retrieved from Boston: https://www.edq.com/globalassets/white-papers/2016-global-data-management-benchmark-report.pdf
Experian. (2017). The 2017 global data management benchmark report. Retrieved from https://www.edq.com/globalassets/white-papers/2017-global-data-management-benchmark-report.pdf
Garcia-Bernardo, J., & Takes, F. W. (2018). The effects of data quality on the analysis of corporate board interlock networks. Information Systems, 78, 164–172.
Google Scholar
Harkany, T., & Hagnermcwhirter, A. (2015). Quantitative western blotting: Improving your data quality and reproducibility. Science, 347(6225), 1022.
Google Scholar
Hartzema, A. G., Reich, C. G., Ryan, P. B., Stang, P. E., Madigan, D., Welebob, E., & Overhage, J. M. (2013). Managing data quality for a drug safety surveillance system. Drug Safety, 36(1), 49–58.
Google Scholar
Heinrich, B., Hristova, D., Klier, M., Schiller, A., & Szubartowicz, M. (2018). Requirements for data quality metrics. Journal of Data and Information Quality (JDIQ), 9(2), 12.
Google Scholar
Jannot, A.-S., Zapletal, E., Avillach, P., Mamzer, M.-F., Burgun, A., & Degoulet, P. (2017). The georges pompidou university hospital clinical data warehouse: A 8-years follow-up experience. International Journal of Medical Informatics, 102, 21–28.
Google Scholar
Jiang, Z., Sarkar, S., De, P., & Dey, D. (2007). A framework for reconciling attribute values from multiple data sources. Management Science, 53(12), 1946–1963.
Google Scholar
Jones-Farmer, L. A., Ezell, J. D., & Hazen, B. T. (2014). Applying control chart methods to enhance data quality. Technometrics, 56(1), 29–41.
Google Scholar
Lee, Y. W. (2006). Journey to data quality. Cambridge, MA: MIT Press.
Google Scholar
Liu, X., Heller, A., & Nielsen, P. S. (2017). Citiesdata: A smart city data management framework. Knowledge and Information Systems, 53(3), 699–722.
Google Scholar
Liu, Q., Feng, G., Wang, N., & Tayi, G. K. (2018). A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge. Information Systems Frontiers, 20(2), 401–416.
Google Scholar
Lu, J., Feng, G., Lai, K. K., & Wang, N. (2017). The bullwhip effect on inventory: A perspective on information quality. Applied Economics, 49(24), 2322–2338.
Google Scholar
Lukyanenko, R., Wiggins, A., & Rosser, H. K. (2019). Citizen science: An information quality research frontier. Information Systems Frontiers, 1–23. https://doi.org/10.1007/s10796-019-09915-z.
Manogaran, G., & Lopez, D. (2018). A gaussian process based big data processing framework in cluster computing environment. Cluster Computing, 21(1), 189–204.
Google Scholar
Mohammed, A., & Talab, S. A. (2015). Enhanced extraction clinical data technique to improve data quality in clinical data warehouse. International Journal of Database Theory and Application, 8(3), 333–342.
Google Scholar
Parssian, A., Sarkar, S., & Jacob, V. S. (2004). Assessing data quality for information products: Impact of selection, projection, and cartesian product. Management Science, 50(7), 967–982.
Google Scholar
Parssian, A., Sarkar, S., & Jacob, V. S. (2009). Impact of the union and difference operations on the quality of information products. Information Systems Research, 20(1), 99–120.
Google Scholar
Pittet, D., & Donaldson, L. (2006). Challenging the world: Patient safety and health care-associated infection. International Journal for Quality in Health Care, 18(1), 4–8.
Google Scholar
Poojari, C. A., & Varghese, B. (2008). Genetic algorithm based technique for solving chance constrained problems. European Journal of Operational Research, 185(3), 1128–1154.
Google Scholar
Qin, X., & Huang, G. (2009). An inexact chance-constrained quadratic programming model for stream water quality management. Water Resources Management, 23(4), 661–695.
Google Scholar
Sagi, T., Gal, A., Barkol, O., Bergman, R., & Avram, A. (2017). Multi-source uncertain entity resolution: Transforming holocaust victim reports into people. Information Systems, 65, 124–136.
Google Scholar
Sakalli, Ü. S. (2013). A simulated annealing approach for reliability-based chance-constrained programming. Applied Stochastic Models in Business & Industry, 30(4), 497–508.
Google Scholar
Sebaa, A., Chikh, F., Nouicer, A., & Tari, A. (2018). Medical big data warehouse: Architecture and system design, a case study: Improving healthcare resources distribution. Journal of Medical Systems, 42(4), 59.
Google Scholar
Subramanian, G. H., & Wang, K. (2017). Systems dynamics-based modeling of data warehouse quality. Journal of Computer Information Systems, 1–8. https://doi.org/10.1080/08874417.2017.1383863.
Szeto, W., Wu, Y., & Ho, S. C. (2011). An artificial bee colony algorithm for the capacitated vehicle routing problem. European Journal of Operational Research, 215(1), 126–135.
Google Scholar
Wahyudi, A., Kuk, G., & Janssen, M. (2018). A process pattern model for tackling and improving big data quality. Information Systems Frontiers, 20(3), 457–469.
Google Scholar
Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33.
Google Scholar
Wang, Y. Y., Huang, G. H., Wang, S., Li, W., & Guan, P. B. (2016). A risk-based interactive multi-stage stochastic programming approach for water resources planning under dual uncertainties. Advances in Water Resources, 94, 217–230.
Google Scholar
Watson, H. J., Fuller, C., & Ariyachandra, T. (2004). Data warehouse governance: Best practices at blue cross and blue shield of North Carolina. Decision Support Systems, 38(3), 435–450.
Google Scholar
Watts, S., Shankaranarayanan, G., & Even, A. (2009). Data quality assessment in context: A cognitive perspective. Decision Support Systems, 48(1), 202–211.
Google Scholar
Xu, Y., Wang, L., Xu, B., Jiang, W., Deng, C., Ji, F., & Xu, X. (2019). An information integration and transmission model of multi-source data for product quality and safety. Information Systems Frontiers, 21(1), 191–212.
Google Scholar
Zak, Y., & Even, A. (2017). Development and evaluation of a continuous-time markov chain model for detecting and handling data currency declines. Decision Support Systems, 103, 82–93.
Google Scholar
Zhu, H.-J., Jiang, T.-H., Wang, Y., Cheng, L., Ma, B., & Zhao, F. (2019). A data cleaning method for heterogeneous attribute fusion and record linkage. International Journal of Computational Science and Engineering, 19(3), 311–324.
Google Scholar
Zong, W., Wu, F., & Feng, P. (2019). Improving data quality during erp implementation based on information product map. Enterprise Information Systems, 1–17. https://doi.org/10.1080/17517575.2019.1644669.

Download references

Acknowledgements

The research presented in this paper is supported by the National Natural Science Foundation Project of China (71572145).

Author information

Authors and Affiliations

School of Management, Xi’an JiaoTong University, NO. 28 Xianning Road, Xi’an, 710049, Shaanxi, China
Qi Liu, Gengzhong Feng & Jun Tian
The Key Lab of the Ministry of Education for Process Control and Efficiency Engineering, NO.28 Xianning Road, Xi’an, 710049, Shaanxi, China
Qi Liu, Gengzhong Feng & Jun Tian
School of Business, SUNY at Albany, Albany, NY, 12222, USA
Giri Kumar Tayi

Authors

Qi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Gengzhong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Giri Kumar Tayi
View author publications
You can also search for this author in PubMed Google Scholar
Jun Tian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gengzhong Feng.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendixes

1.1 Appendix A: Previous Relevant Research on Data Quality Improvement of Data Warehouse

Table 7 Summary of the Literature Review on Some Relevant Research Papers

Full size table

1.2 Appendix B: Dimension Level Optimal Control Resource Allocation

Table 8 Control Allocation in the Case of DQP_R = N(0.15, 0.01),ρ=0.75

Full size table

Table 9 Control Allocation in the Case of DQP_R = N(0.15, 0.01),ρ=0.90

Full size table

Table 10 Control Allocation in the Case of DQP_R = N(0.20, 0.01),ρ=0.75

Full size table

Table 11 Control Allocation in the Case of “DQP_R = N(0.20, 0.01),ρ=0.90”

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Q., Feng, G., Tayi, G.K. et al. Managing Data Quality of the Data Warehouse: A Chance-Constrained Programming Approach. Inf Syst Front 23, 375–389 (2021). https://doi.org/10.1007/s10796-019-09963-5

Download citation

Published: 02 December 2019
Issue Date: April 2021
DOI: https://doi.org/10.1007/s10796-019-09963-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Managing Data Quality of the Data Warehouse: A Chance-Constrained Programming Approach

Abstract

Access this article

Similar content being viewed by others

A Hybrid Algorithm for Stochastic Multiobjective Programming Problem

Study of the strategy for agricultural machinery maintenance in China based on the improved genetic-bee colony algorithm

Autonomous Tuning for Constraint Programming via Artificial Bee Colony Optimization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendixes

1.1 Appendix A: Previous Relevant Research on Data Quality Improvement of Data Warehouse

1.2 Appendix B: Dimension Level Optimal Control Resource Allocation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Managing Data Quality of the Data Warehouse: A Chance-Constrained Programming Approach

Abstract

Access this article

Similar content being viewed by others

A Hybrid Algorithm for Stochastic Multiobjective Programming Problem

Study of the strategy for agricultural machinery maintenance in China based on the improved genetic-bee colony algorithm

Autonomous Tuning for Constraint Programming via Artificial Bee Colony Optimization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendixes

Appendixes

1.1 Appendix A: Previous Relevant Research on Data Quality Improvement of Data Warehouse

1.2 Appendix B: Dimension Level Optimal Control Resource Allocation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation