Abstract
Assessing the quality of information system schemas is crucial, because an unoptimized or erroneous schema design has a strong impact on the quality of the stored data, e.g., it may lead to inconsistencies and anomalies at the data-level. Even if the initial schema had an ideal design, changes during the life cycle can negatively affect the schema quality and have to be tackled. Especially in Big Data environments there are two major challenges: large schemas, where manual verification of schema and data quality is very arduous, and the integration of heterogeneous schemas from different data models, whose quality cannot be compared directly. Thus, we present a domain-independent approach for automatically measuring the quality of large and heterogeneous (logical) schemas. In contrast to existing approaches, we provide a fully automatable workflow that also enables regular reassessment. Our implementation allows to measure the quality dimensions correctness, completeness, pertinence, minimality, readability, and normalization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
https://www.w3.org/OWL [December, 2018].
- 2.
https://www.w3c.org/TR/turtle [December, 2018].
- 3.
https://dev.mysql.com/doc/employee/en [December, 2018].
- 4.
https://dev.mysql.com/doc/sakila/en [December, 2018].
- 5.
https://archive.codeplex.com/?p=chinookdatabase [December, 2018].
- 6.
- 7.
https://www.alphavantage.co [December, 2018].
- 8.
http://dqm.faw.jku.at [December, 2018].
References
Redman, T.C.: The impact of poor data quality on the typical enterprise. Commun. ACM 41(2), 79–82 (1998)
Otto, B., Österle, H.: Corporate Data Quality: Prerequisite for Successful Business Models. Springer Gabler, Berlin (2016)
Moore, S.: How to Create a Business Case for Data Quality Improvement. Gartner Research (2017). http://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-quality-improvement. Accessed Dec 2018
Wand, Y., Wang, R.Y.: Anchoring data quality dimensions in ontological foundations. Commun. ACM 39(11), 86–95 (1996)
Batini, C., Scannapieco, M.: Data and Information Quality: Concepts, Methodologies and Techniques. Springer (2016)
Vossen, G.: Datenmodelle, Datenbanksprachen und Datenbankmanagementsysteme [Data Models, Database Languages, and Database Management Systems]. Oldenbourg Verlag (2008)
Kruse, S.: Scalable data profiling - distributed discovery and analysis of structural metadata. Ph.D. thesis, Universität Potsdam (2018)
Coelho, F., Aillos, A., Pilot, S., Valeev, S.: On the quality of relational database schemas in open-source software. Int. J. Adv. Softw. 4(3 & 4), 11 (2012)
Batista, M.C.M., Salgado, A.C.: Information quality measurement in data integration schemas. In: Proceedings of the Fifth International Workshop on Quality in Databases, QDB 2007, at the VLDB 2007 Conference, pp. 61–72. ACM (2007)
Ehrlinger, L., Werth, B., Wöß, W.: QuaIIe: a data quality assessment tool for integrated information systems. In: Proceedings of the Tenth International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA 2018), pp. 21–31 (2018)
Herden, O.: Measuring quality of database schema by reviewing - concept, criteria and tool. In: Proceedings of 5th International Workshop on Quantitative Approaches in Object-Oriented Software Engineering, pp. 59–70 (2001)
Duchateau, F., Bellahsene, Z.: Measuring the quality of an integrated schema. In: Parsons, J., Saeki, M., Shoval, P., Woo, C., Wand, Y. (eds.) ER 2010. LNCS, vol. 6412, pp. 261–273. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16373-9_19
Feilmayr, C., Wöß, W.: An analysis of ontologies and their success factors for application to business. Data Knowl. Eng. 101, 1–23 (2016)
Euzenat, J., Shvaiko, P.: Ontology Matching. Springer-Verlag New York Inc., Secaucus (2007)
Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th International Conference on Data Engineering, ICDE 2002, pp. 117–128. IEEE Computer Society, Washington, DC (2002)
Ehrlinger, L., Wöß, W.: Semi-automatically generated hybrid ontologies for information integration. In: Joint Proceedings of the Posters and Demos Track of 11th International Conference on Semantic Systems, pp. 100–104. CEUR Workshop Proceedings (2015)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)
Logan, J.R., Gorman, P.N., Middleton, B.: Measuring the quality of medical records: a method for comparing completeness and correctness of clinical encounter data. In: American Medical Informatics Association Annual Symposium, AMIA 2001, Washington, DC, USA, 3–7 November 2001, pp. 408–4012 (2001)
Naumann, F., Freytag, J.C., Leser, U.: Completeness of integrated information sources. Inf. Syst. 29(7), 583–615 (2004)
Heinrich, B., Hristova, D., Klier, M., Schiller, A., Szubartowicz, M.: Requirements for data quality metrics. J. Data Inf. Qual. 9(2), 12:1–12:32 (2018)
Ehrlinger, L., Wöß, W.: A novel data quality metric for minimality. In: Hacid, H., Sheng, Q.Z., Yoshida, T., Sarkheyli, A., Zhou, R. (eds.) WISE 2018. LNCS, vol. 10042, pp. 1–15. Springer, Cham (2019)
W3C Working Group: Data on the Web Best Practices: Data Quality Vocabulary. (2016). https://www.w3.org/TR/vocab-dqv. Accessed Dec 2018
Sadiq, S., et al.: Data quality: the role of empiricism. ACM SIGMOD Rec. 46(4), 35–43 (2018)
Batini, C., Lenzerini, M., Navathe, S.B.: A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18(4), 323–364 (1986)
Acknowledgments
The research reported in this paper has been supported by the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry for Digital and Economic Affairs, and the Province of Upper Austria in the frame of the COMET center SCCH.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Ehrlinger, L., Wöß, W. (2019). Automated Schema Quality Measurement in Large-Scale Information Systems. In: Hacid, H., Sheng, Q., Yoshida, T., Sarkheyli, A., Zhou, R. (eds) Data Quality and Trust in Big Data. QUAT 2018. Lecture Notes in Computer Science(), vol 11235. Springer, Cham. https://doi.org/10.1007/978-3-030-19143-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-19143-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19142-9
Online ISBN: 978-3-030-19143-6
eBook Packages: Computer ScienceComputer Science (R0)