Abstract
Existing methodologies for identifying data quality problems are typically user-centric, where data quality requirements are first determined in a top-down manner following well-established design guidelines, organizational structures and data governance frameworks. In the current data landscape, however, users are often confronted with new, unexplored datasets that they may not have any ownership of, but that are perceived to have relevance and potential to create value for them. Such repurposed datasets can be found in government open data portals, data markets and several publicly available data repositories. In such scenarios, applying top-down data quality checking approaches is not feasible, as the consumers of the data have no control over its creation and governance. Hence, data consumers – data scientists and analysts – need to be empowered with data exploration capabilities that allow them to investigate and understand the quality of such datasets to facilitate well-informed decisions on their use. This research aims to develop such an approach for discovering data quality problems using generic exploratory methods that can be effectively applied in settings where data creation and use is separated. The approach, named LANG, is developed through a Design Science approach on the basis of semiotics theory and data quality dimensions. LANG is empirically validated in terms of soundness of the approach, its repeatability and generalizability.

Source: Sadiq and Indulska (2017), data available from https://catalog.data.gov

Source: Peffers et al. (2007)

Source: Authors, informed by Peffers et al. (2007)

Source: Authors
Similar content being viewed by others
Notes
The researchers named the approach as ‘LANG’ – ‘Lang’ conveys the meaning of ‘becoming clear’ in the Chinese language, which fits with the aim of the approach, that is, to make clear the data quality requirements of a dataset.
The mapping is omitted due to length considerations but is available from the authors upon request.
The download period is between June and August 2016. We note that the datasets are frequently updated in the respective open data portals including change of meta-data, such as adding or removing columns as well as providing or removing other documentation related to the dataset. Hence, the current versions of the datasets may not have the same data quality problems as those identified in our study.
In this paper we have demonstrated the application of LANG with the help of relational database (MySQL). We present the overall approach in the body of the paper, and present the SQL instantiation of the method in Appendix A.
Some detail is abstracted in this figure for visual simplicity; in particular sequences between some of the individual checks, which may result in skipping certain checks/stages (as relevant on the basis of analysis results).
“The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modelling, data visualization, machine learning, and much more.” (jupyter.org).
References
Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J Int J Very Large Data Bases 24(4):557–581
Almars A (2016) Automated data quality discovery tool. Master Thesis, The University of Queensland
Batini C, Scannapieco M (2006) Data quality—concepts, methodologies and techniques. Springer, Heidelberg
Batini C, Francalanci C, Cappiello C, Maurino A (2009) Methodologies for data quality assessment and improvement. ACM Comput Surv 41(3):1–52
Belkin R, Patil D (2013) Everything we wish we’d known about building data products. http://firstround.com/review/everything-we-wish-wed-known-about-building-data-products/. Accessed 14 Nov 2018
Bohannon P, Fan W, Geerts F, Jia X, Kementsietsidis A (2007) Conditional functional dependencies for data cleaning. In: IEEE 23rd international conference on data engineering, pp 746–755
Byrne B, Kling J, Mccarty D, Sauter G, Smith H, Worcester P (2008) The information perspective of SOA design, part 6: the value of applying the data quality analysis pattern in SOA. IBM Corporation
Caballero I, Verbo E, Calero C, Piattini M (2007) A data quality measurement information model based on ISO/IEC 15939. In: Proceedings of the 12th international conference on information quality, pp 393–408
Caballero I, Verbo E, Calero C, Piattini M (2008) MMPRO: a methodology based on ISO/IEC 15939 to draw up data quality measurement processes. In: Proceedings of the 13th international conference on information quality, pp 326–340
Chakraborti S, Dey S (2019) Analysis of competitor intelligence in the era of big data. Bus Inf Syst Eng 61(3):345–355
Clarke R (2016) Big data, big risks. Inf Syst J 26(1):77–90
Corsar D, Edwards P (2017) Challenges of open data qality: more than just license, format, and customer support. ACM J Data Inf Qual 9(1):3:1–3:4
Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. Wiley, New York
Duus R, Cooray M (2016) The future will be built on open data—here’s why. http://theconversation.com/the-future-will-be-built-on-open-data-heres-why-52785. Accessed 14 Nov 2018
Ehling M, Körner T (2007) Handbook on data quality assessment methods and tools. European Commission, Eurostat
Elbaz G (2012) Data markets: the emerging data economy. http://techcrunch.com/2012/09/30/data-markets-the-emerging-data-economy/. Accessed 14 Nov 2018
English LP (1999) Improving data warehouse and business information quality. Wiley
English LP (2009) Information quality applied. Best practices for improving business information, processes and systems. Wiley, New York
Eppler MJ (2001) The concept of information quality. Stud Commun Sci 1(2):167–182
Fan W, Geerts F (2012) Foundations of data quality management. Synth Lect Data Manag 4(5):1–217
Fisher T (2009) The data asset: how smart companies govern their data for business success. Wiley, New York
Gatling GCBR, Champlin R, Stefani H, Weigel G (2007) Enterprise information management with SAP. Galileo, Boston
Gregor S, Jones D (2007) The anatomy of a design theory. J Assoc Inf Syst 8(5):312–335
Hernández MA, Stolfo SJ (1998) Real-world data is dirty. Data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1):9–37
Hevner AR, March ST, Park J, Ram S (2004) Design science in information systems research. MIS Q 28(1):75–105
Hey AJG, Trefethen AE (2003) The data deluge. An e-science perspective. https://eprints.soton.ac.uk/257648/1/The_Data_Deluge.pdf. Accessed 3 July 2019, pp 809–824
HIQA (2011) International review of data quality. Health Information and Quality Authority (HIQA), Ireland. http://www.hiqa.ie/press-release/2011-04-28-international-review-data-quality. Accessed 2 Oct 2017
ISO (2011) ISO/TS 8000-1 data quality part 1: overview. ISO
ISO (2012) ISO 8000-2 data quality-part 2-vocabulary. ISO
Jayawardene V, Sadiq S, Indulska M (2013a) An analysis of data quality dimensions. School of Information Technology and Electrical Engineering, The University of Queensland, ITEE Technical Report
Jayawardene V, Sadiq S, Indulska M (2013b) The curse of dimensionality in data quality. In: 24th Australasian conference on information systems. RMIT University, pp 1–11
Judah S, Friedman T (2015) Magic quadrant for data quality tools. Gartner
Kenett RS, Shmueli G (2014) On information quality. J R Stat Soc Ser A 177(1):3–38
Kim J, Hausenblas M (2012) 5 * Open Data. https://5stardata.info/en/. Accessed 14 Nov 2018
Köhler H, Leck U, Link S (2013) Possible and certain SQL keys. Department of Computer Science, The University of Auckland
Köhler H, Link S, Zhou X (2015) Possible and certain SQL keys. Proc VLDB Endow 8(11):1118–1129
Krogstie J (2002) A semiotic approach to quality in requirements specifications. In: Proceedings of the IFIP TC8/WG8 (1), pp 231–249
Krogstie J, Lindland OI, Sindre G (1995a) Defining quality aspects for conceptual models. In: Falkenberg ED, Hesse W, Olivé A (eds) Information system concepts. Springer, Boston, pp 216–231
Krogstie J, Lindland OI, Sindre G (1995b) Towards a deeper understanding of quality in requirements engineering. In: International conference on advanced information systems engineering. Springer, Heidelberg, pp 82–95
Krueger R, Casey M (1994) Focus groups. A practical guide for applied research. Sage Publications, Thousand Oaks
Lee YW, Strong DM, Kahn BK, Wang RY (2002) AIMQ: a methodology for information quality assessment. Inf Manag 40(2):133–146
Lindland OI, Sindre G, Solvberg A (1994) Understanding quality in conceptual modeling. IEEE Softw 11(2):42–49
Loshin D (2001) Enterprise knowledge management. The data quality approach. Morgan Kaufmann, Burlington
Loshin D (2006) Monitoring data quality performance using data quality metrics. Informatica Corporation, Redwood City
Maydanchik A (2007) Data quality assessment. Technics Publications, New Jersey
McGilvray D (2008) Executing data quality projects: ten steps to quality data and trusted information. Morgan Kaufmann, Burlington
Morgan DL (ed) (1993) Sage focus editions. Successful focus groups: advancing the state of the art, vol 156. Sage Publications, Thousand Oaks
Morris CW (1938) Foundations of the theory of signs. In: Langford CH (ed) International encyclopedia of unified science. University of Chicago Press, London
Naumann F, Rolker C (2000) Assessment methods for information quality criteria. Humboldt-Universität zu Berlin, Informatik-Berichte, Berlin
OMB U (2002) Guidelines for ensuring and maximizing the quality, objectivity, utility, and integrity of information disseminated by federal agencies, part IX. Office of Management and Budget
Peffers K, Tuunanen T, Rothenberger MA, Chatterjee S (2007) A design science research methodology for information systems research. J Manag Inf Syst 24(3):45–77
Pierce CS (1931–1935) Collected papers. Harvard University Press, Cambridge
Pipino L, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218
Powell RA, Single HM (1996) Focus groups. Int J Qual Health Care 8:499–504. https://doi.org/10.1093/intqhc/8.5.499
Prat N (2019) Augmented analytics. Bus Inf Syst Eng 61(3):375–380
Price R, Shanks G (2004) A semiotic information quality framework. In: Proceedings of the international conference on decision support systems, pp 658–672
Price R, Shanks G (2005a) A semiotic information quality framework: development and comparative analysis. J Inf Technol 20(2):88–102
Price R. J, Shanks G (2005b) Empirical refinement of a semiotic information quality framework. In: Proceedings of the 38th annual Hawaii international conference on system sciences, Big Island, pp 216a
Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: Proceedings of the 27th VLDB conference, Rome, pp 381–390
Rosemann M, Vessey I (2008) Toward improving the relevance of information systems research to practice: the role of applicability checks. MIS Q 32(1):1–22
Sadiq S, Indulska M (2017) Open data: quality over quantity. Int J Inf Manag 37(3):150–154
Sadiq S, Yeganeh NK, Indulska M (2011) 20 years of data quality research: themes, trends and synergies. In: 22nd Australasian database conference, Perth, pp 153–162
Scannapieco M, Virgillito A, Marchetti C, Mecella M, Baldoni R (2004) The Daquincis architecture: a platform for exchanging and improving data quality in cooperative information systems. Inf Syst 29(7):551–582
Selvage M, Saul J, Jain A (2017) Magic quadrant for data quality tools. Gartner
Shanks GG, Darke P (1998) Understanding data quality and data warehousing: a semiotic approach. IQ, pp 292–309
Shanks G, Tansley E (2002) Data quality tagging and decision outcomes. An experimental study. IFIP Working Group, pp 399–410
Sismanis Y, Brown P, Haas PJ, Reinwald B (2006) Gordian: efficient and scalable discovery of composite keys. In: Proceedings of the 32nd international conference on very large data bases, VLDB Endowment, pp 691–702
Song S, Chen L (2011) Differential dependencies Reasoning and discovery. ACM Trans Database Syst 36(3):16
Sonnenberg C, vom Brocke J (2012) Evaluations in the science of the artificial. Reconsidering the build-evaluate pattern in design science research. In: Peffers K, Rothenberger M, Kuechler B (eds) Design science research in information systems, vol 7286. Advances in theory and practice. DESRIST. Lecture notes in computer science. Springer, Heidelberg
Stamper RK (1992) Review of Andersen “Theory of Computer Semiotics”. Comput J 1
Stamper R (1993) A semiotic theory of information and information systems/applied semiotics. In: Invited Papers for the ICL/University of Newcastle Seminar on “Information”, September 6–10
Storey V, Wang R (2001) Extending the ER model to represent data quality requirements. Kluwer, Dordrecht
Sturm B, Sunyaev A (2019) Design principles for systematic search systems. Bus Inf Syst Eng 61(1):91–111
Stvilia B, Gasser L, Twidale MB, Smith LC (2007) A framework for information quality assessment. J Am Soc Inf Sci Technol 58(12):1720–1733
Tu SY, Wang Y-YR (1993) Modeling data quality and context through extension of the ER model. Total Data Quality Management Research Program, Sloan School of Management, Massachusetts Institute of Technology, Cambridge
Venable J, Pries-Heje J, Baskerville R (2012) A comprehensive framework for evaluation in design science research. In: Peffers K, Rothenberger M, Kuechler B (eds) Design science research in information systems, vol 786. Advances in theory and practice. Springer, Heidelberg, pp 423–438
Venable J, Pries-Heje J, Baskerville R (2016) FEDS: a framework for evaluation in design science research. Eur J Inf Syst 25(1):77–89
Wand Y, Wang RY (1996) Anchoring data quality dimensions in ontological foundations. Commun ACM 39(11):86–95
Wang R (1998) A product perspective on total data quality management. Commun ACM 41(2):58–65
Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst 12(4):5–33
Wang R, Ziad M, Lee Y (2001) Data quality. Kluwer, Dordrecht
Zhang R, Jayawardene V, Indulska M, Sadiq S, Zhou X (2014) A data driven approach for discovering data quality requirements. In: 35th international conference on information systems, Auckland
Author information
Authors and Affiliations
Corresponding author
Additional information
Accepted after two revisions by Matthias Jarke.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Zhang, R., Indulska, M. & Sadiq, S. Discovering Data Quality Problems. Bus Inf Syst Eng 61, 575–593 (2019). https://doi.org/10.1007/s12599-019-00608-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12599-019-00608-0