skip to main content
research-article

The Effects and Interactions of Data Quality and Problem Complexity on Classification

Published: 01 February 2011 Publication History

Abstract

Data quality remains a persistent problem in practice and a challenge for research. In this study we focus on the four dimensions of data quality noted as the most important to information consumers, namely accuracy, completeness, consistency, and timeliness. These dimensions are of particular concern for operational systems, and most importantly for data warehouses, which are often used as the primary data source for analyses such as classification, a general type of data mining. However, the definitions and conceptual models of these dimensions have not been collectively considered with respect to data mining in general or classification in particular. Nor have they been considered for problem complexity. Conversely, these four dimensions of data quality have only been indirectly addressed by data mining research. Using definitions and constructs of data quality dimensions, our research evaluates the effects of both data quality and problem complexity on generated data and tests the results in a real-world case. Six different classification outcomes selected from the spectrum of classification algorithms show that data quality and problem complexity have significant main and interaction effects. From the findings of significant effects, the economics of higher data quality are evaluated for a frequent application of classification and illustrated by the real-world case.

References

[1]
Ali, S. and Smith, K. A. 2006. On learning algorithm selection for classification. Appl. Soft Comput. J. 6, 119--138.
[2]
Apte, C., Liu, B., Pednault, E. P. D., and Smyth, P. 2002. Business applications of data mining. Comm. ACM 45, 49--53.
[3]
Ballou, D., Wang, R., Pazer, H., and Kumar, T. G. 1998. Modeling information manufacturing systems to determine information product quality. Manag. Sci. 44, 462--484.
[4]
Ballou, D. P. and Pazer, H. L. 1985. Modeling data and process quality in multi-input, multi-output information systems. Manag. Sci. 31, 150--162.
[5]
Ballou, D. P. and Pazer, H. L. 2003. Modeling completeness versus consistency tradeoffs in information decision contexts. IEEE Trans. Knowl. Data Engin. 15, 240--243.
[6]
Davenport, T. H. and Harris, J. G. 2007. Competing on Analytics: The New Science of Winning. Harvard Business School Publishing Company, Boston, MA.
[7]
Dillard, R. A. 1992. Using data quality measures in decision-making algorithms. IEEE Intell. Syst. Appl. 7, 63--72.
[8]
Eckerson, W. W. 2002. Data warehousing special report: Data quality and the bottom line. In Applications Development Trends.
[9]
Even, A. and Shankaranarayanan, G. 2007. Utility-driven configuration of data quality in data repositories. Int. J. Inf. Quality 1, 22--40.
[10]
Even, A. and Shankaranarayanan, G. 2009. Dual assessment of data quality in customer databases. ACM J. Inf. Data Quality 1, 3.
[11]
Fisher, C., Lauria, E., and Matheus, C. 2007. In search of an accuracy metric. In Proceedings of the 12th International Conference on Information Quality.
[12]
Ge, M. and Helfert, M. 2006. A framework to assess decision quality using information quality dimensions. In Proceedings of the International Conference on Information Quality.
[13]
Gomes, P., Farinha, J., and Trigueiros, M. J. 2007. A data quality metamodel extension to CWM. In Proceedings of the 4th Asia-Pacific Conference on Conceptual Modeling. 17--26.
[14]
Hadden, J., Tiwari, A., Roy, R., and Ruta, D. 2007. Computer assisted customer churn management: State-of-the-Art and future trends. Comput. Oper. Res. 34, 2902--2917.
[15]
Heinrich, B., Klier, M., and Kaiser, M. 2009. A procedure to develop metrics for currency and its application in CRM. ACM J. Inf. Data Quality 1, 3.
[16]
Hickey, R. 1996. Noise modelling and evaluating learning from examples. Artif. Intell. 82, 157--179.
[17]
Kahn, B. K., Strong, D. M., and Wang, R. Y. 2002. Information quality benchmarks: Product and service performance. Comm. ACM 45, 185--192.
[18]
Karr, A. F., Sanil, A. P., and Banks, D. L. 2006. Data quality: A statistical perspective. Statist. Method. 3, 137--173.
[19]
Klein, B. D., Goodhue, D. L., and Davis, G. B. 1997. Can humans detect errors in data? Impact of base rates, incentives, and goals. MIS Quart. 21, 169--194.
[20]
Kohavi, R., Rothleder, N. J., and Simoudis, E. 2002. Emerging trends in business analytics. Comm. ACM 45, 45--48.
[21]
Lakshminarayan, K., Harp, S. A., and Samad, T. 1999. Imputation of missing data in industrial databases. Appl. Intell. 11, 259--275.
[22]
Lee, Y. W., Pipino, L., Strong, D. M., and Wang, R. Y. 2004. Process-embedded data integrity. J. Datab. Manag. 15, 87--103.
[23]
Lee, Y. W., Pipino, L. L., Funk, J. D., and Wang, R. Y. 2006. Journey to Data Quality. The MIT Press.
[24]
Lee, Y. W., Strong, D. M., Kahn, B. K., and Wang, R. Y. 2002. AIMQ: A methodology for information quality assessment. Inf. Manag. 40, 133--146.
[25]
Madnick, S. and Wang, R. Y. 1992. Introduction to total data quality management (TDQM). Research Program TDQM-92-01, Total Data Quality Management Program, MIT Sloan School of Management.
[26]
March, S. T. and Hevner, A. R. 2007. Integrated decision support systems: A data warehousing perspective. Decis. Support Syst. 43, 1031--1043.
[27]
Oates, T. and Jensen, D. 1997. The effects of training set size on decision tree complexity. In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann Publishers, 254--262.
[28]
Ordonez, C. and García-García, J. 2008. Referential integrity quality metrics. Decis. Support Syst. 44, 495--508.
[29]
Parssian, A. 2006. Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions. Decis. Support Syst. 42, 1494--1502.
[30]
Parssian, A., Sarkar, S., and Jacob, V. S. 2004. Assessing data quality for information products: Impact of selection, projection, and cartesian product. Manag. Sci. 50, 967--982.
[31]
Pipino, L. L., Lee, Y. W., and Wang, R. Y. 2002. Data quality assessment. Comm. ACM 45, 211--218.
[32]
Quinlan, J. R. 1986. Induction of decision trees. Mach. Learn. 1, 81--106.
[33]
Redman, T. C. 2004. Data: An unfolding quality disaster. DM Rev. 6.
[34]
Reichheld, F. F. and Sasser, W. E. 1990. Zero defections. Harvard Bus. Rev. 68, 105--111.
[35]
Sessions, V. and Valtorta, M. 2006. Learning Bayesian networks from inaccurate data. In Proceedings of the 11th International Conference on Information Quality.
[36]
Shankaranarayanan, G. and Cai, Y. 2006. Supporting data quality management in decision-making. Decis. Support Syst. 42, 302--317.
[37]
Su, Y. and Jin, Z. 2007. Assessment and improvement of data and information quality. In Information Quality Management: Theory and Applications. Idea Group, Inc.
[38]
Swait, J. and Adamowicz, W. 2001. Choice environment, market complexity, and consumer behavior: A theoretical and empirical approach for incorporating decision complexity into models of consumer choice. Organiz. Behav. Hum. Decis. Process. 86, 141--167.
[39]
Wand, Y. and Wang, R. Y. 1996. Anchoring data quality dimensions in ontological foundations. Comm. ACM 39, 86--95.
[40]
Wang, R. Y. and Strong, D. M. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 5--33.
[41]
Wang, R. Y., Ziad, M., and Lee, Y. W. 2000. Data Quality. Kluwer Academic Publishers.
[42]
Wu, Y., Frizelle, G., and Efstathiou, J. 2007. A study on the cost of operational complexity in customer-supplier systems. Int. J. Product. Econom. 106, 217--229.
[43]
Zhu, X. and Wu, X. 2004. Class noise vs. attribute noise: A quantitative study. Artif. Intell. Rev. 22, 177--210.

Cited By

View all
  • (2024)Software Engineering Approach for Designing Apparel Business Data AnalyticsData-Driven Business Intelligence Systems for Socio-Technical Organizations10.4018/979-8-3693-1210-0.ch006(128-151)Online publication date: 23-Feb-2024
  • (2024)A Validation Framework for Bulk Distribution Logistics Simulation ModelsLogistics10.3390/logistics90100039:1(3)Online publication date: 25-Dec-2024
  • (2024)AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AIProceedings of the 36th International Conference on Scientific and Statistical Database Management10.1145/3676288.3676296(1-12)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 2, Issue 2
February 2011
102 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/1891879
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2011
Accepted: 01 November 2010
Revised: 01 September 2010
Received: 01 December 2008
Published in JDIQ Volume 2, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data quality
  2. data mining
  3. data quality metrics and measurements
  4. information quality

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)117
  • Downloads (Last 6 weeks)11
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Software Engineering Approach for Designing Apparel Business Data AnalyticsData-Driven Business Intelligence Systems for Socio-Technical Organizations10.4018/979-8-3693-1210-0.ch006(128-151)Online publication date: 23-Feb-2024
  • (2024)A Validation Framework for Bulk Distribution Logistics Simulation ModelsLogistics10.3390/logistics90100039:1(3)Online publication date: 25-Dec-2024
  • (2024)AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AIProceedings of the 36th International Conference on Scientific and Statistical Database Management10.1145/3676288.3676296(1-12)Online publication date: 10-Jul-2024
  • (2024)Unified Data Framework for Enhanced Data Management, Consumption, Provisioning, Processing and MovementProceedings of the 7th International Conference on Networking, Intelligent Systems and Security10.1145/3659677.3659836(1-7)Online publication date: 18-Apr-2024
  • (2024)Towards an End-to-End Data Quality Optimizer2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00039(262-266)Online publication date: 13-May-2024
  • (2024)The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic reviewnpj Digital Medicine10.1038/s41746-024-01196-47:1Online publication date: 3-Aug-2024
  • (2024)Integration of data science with product design towards data-driven designCIRP Annals10.1016/j.cirp.2024.06.00373:2(509-532)Online publication date: 2024
  • (2024)Classification of functional and nonfunctional requirements based on convolutional neural network with flower pollination optimizerInnovations in Systems and Software Engineering10.1007/s11334-024-00592-zOnline publication date: 4-Nov-2024
  • (2024)Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced DataIntelligent Systems and Applications10.1007/978-3-031-47721-8_33(496-507)Online publication date: 10-Jan-2024
  • (2023)Empowering Patient Similarity Networks through Innovative Data-Quality-Aware Federated ProfilingSensors10.3390/s2314644323:14(6443)Online publication date: 16-Jul-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media