Abstract
Master data sets are an important asset for organizations and their quality must be high to ensure organizational success. At the same time, data migrations are complex projects and they often result in impaired data sets of lower quality. In particular, data quality issues that involve multiple attributes are difficult to identify and can only be resolved with manual data quality checks. In this paper, we are investigating a real-world migration of material master data. Our goal is to ensure data quality by mining the target data set for data quality rules. In a data migration, incoming data sets must comply with these rules to be migrated. For generating data quality rules, we used a SVM for rules at a schema level and Association Rule Learning for rules at the instance level. We found that both methods produce valuable rules and are suitable for ensuring quality in data migrations. As an ensemble, the two methods are adequate to manage common real-world data characteristics such as sparsity or mixed values.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abedjan, Z., et al.: Detecting data errors: where are we and what needs to be done? Proc. VLDB Endowment 9(12), 993–1004 (2016)
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings 20th International Conference Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Altendeitering, M., Fraunhofer, I., Guggenberger, T.: Designing data quality tools: findings from an action design research project at Boehringer Ingelheim, pp. 1–16 (2021)
Barateiro, J., Galhardas, H.: A survey of data quality tools. Datenbank-Spektrum 14, 15–21 (2005)
Borgelt, C.: An implementation of the FP-growth algorithm. In: Proceedings of the 1st International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations, pp. 1–5 (2005)
Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2(2), 121–167 (1998)
Chiang, F., Miller, R.J.: Discovering data quality rules. Proc. VLDB Endowment 1(1), 1166–1177 (2008)
Drumm, C., Schmitt, M., Do, H.H., Rahm, E.: QuickMig: automatic schema matching for data migration projects. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM 2007, pp. 107–116. Association for Computing Machinery (2007)
Ehrlinger, L., Rusz, E., Wöß, W.: A survey of data quality measurement and monitoring tools. arXiv preprint arXiv:1907.08138 (2019)
Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23(5), 683–698 (2010)
Hipp, J., Güntzer, U., Grimmer, U.: Data quality mining-making a virute of necessity. In: DMKD, p. 6 (2001)
Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)
Kaitoua, A., Rabl, T., Katsifodimos, A., Markl, V.: Muses: distributed data migration system for polystores. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1602–1605. IEEE (2019)
Kruse, S., et al.: Fast approximate discovery of inclusion dependencies. In: Datenbanksysteme für Business, Technologie und Web (BTW 2017) (2017)
Matthes, F., Schulz, C., Haller, K.: Testing quality assurance in data migration projects. In: 2011 27th IEEE International Conference on Software Maintenance (ICSM), pp. 438–447 (2011)
Morris, J.: Practical data migration. BCS, The Chartered Institute (2012)
Papenbrock, T., et al.: Functional dependency discovery: an experimental evaluation of seven algorithms. Proc. VLDB Endowment 8(10), 1082–1093 (2015)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Sarmah, S.S.: Data migration. Sci. Technol. 8(1), 1–10 (2018)
Shao, Y.H., Chen, W.J., Deng, N.Y.: Nonparallel hyperplane support vector machine for binary classification problems. Inf. Sci. 263, 22–35 (2014)
Shrivastava, S., Patel, D., Zhou, N., Iyengar, A., Bhamidipaty, A.: DQLearn: a toolkit for structured data quality learning. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 1644–1653. IEEE (2020)
Wang, P., He, Y.: Uni-detect: a unified approach to automated error detection in tables. In: Proceedings of the 2019 International Conference on Management of Data, pp. 811–828 (2019)
Zou, J., Liu, X., Sun, H., Zeng, J.: Live instance migration with data consistency in composite service evolution. In: 2010 6th World Congress on Services, pp. 653–656. IEEE (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Altendeitering, M. (2021). Mining Data Quality Rules for Data Migrations: A Case Study on Material Master Data. In: Margaria, T., Steffen, B. (eds) Leveraging Applications of Formal Methods, Verification and Validation. ISoLA 2021. Lecture Notes in Computer Science(), vol 13036. Springer, Cham. https://doi.org/10.1007/978-3-030-89159-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-89159-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89158-9
Online ISBN: 978-3-030-89159-6
eBook Packages: Computer ScienceComputer Science (R0)