research-article

Estimating Degradation of Machine Learning Data Assets

Authors:

Ernesto DamianiAuthors Info & Claims

ACM Journal of Data and Information Quality (JDIQ), Volume 14, Issue 2

Article No.: 9, Pages 1 - 15

https://doi.org/10.1145/3446331

Published: 11 December 2021 Publication History

Abstract

Large-scale adoption of Artificial Intelligence and Machine Learning (AI-ML) models fed by heterogeneous, possibly untrustworthy data sources has spurred interest in estimating degradation of such models due to spurious, adversarial, or low-quality data assets. We propose a quantitative estimate of the severity of classifiers’ training set degradation: an index expressing the deformation of the convex hulls of the classes computed on a held-out dataset generated via an unsupervised technique. We show that our index is computationally light, can be calculated incrementally and complements well existing ML data assets’ quality measures. As an experimentation, we present the computation of our index on a benchmark convolutional image classifier.

References

[1]

C. Bradford Barber, David P. Dobkin, and Hannu Huhdanpaa. 1996. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software 22, 4 (Dec. 1996), 469–483. DOI:https://doi.org/10.1145/235815.235821

Digital Library

[2]

Marco Barreno, Blaine Nelson, Anthony D. Joseph, and J. D. Tygar. 2010. The security of machine learning. Machine Learning 81, 2 (2010), 121–148. DOI:https://doi.org/10.1007/s10994-010-5188-5

Digital Library

[3]

Battista Biggio and Fabio Roli. 2018. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition 84 (2018), 317–331. DOI:https://doi.org/10.1016/j.patcog.2018.07.023

Digital Library

[4]

James E. Bobrow. 1989. A direct minimization approach for obtaining the distance between convex polyhedra. The International Journal of Robotics Research 8, 3 (1989), 65–76. DOI:https://doi.org/10.1177/027836498900800304 arXiv:https://doi.org/10.1177/027836498900800304

[5]

Marco E. G. V. Cattaneo. 2016. Conditional probability estimation. In Proceedings of the 8th International Conference on Probabilistic Graphical Models. Vol. 52. JMLR.org, 86–97. Retrieved from http://proceedings.mlr.press/v52/cattaneo16.html.

[6]

Chengliang Chai, Lei Cao, Guoliang Li, Jian Li, Yuyu Luo, and Samuel Madden. 2020. Human-in-the-loop outlier detection. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, New York, NY, 19–33. DOI:https://doi.org/10.1145/3318464.3389772

Digital Library

[7]

Peter Cheeseman and John Stutz. 1996. Bayesian Classification (AutoClass): Theory and Results. American Association for Artificial Intelligence, 153–180.

Digital Library

[8]

Ernesto Damiani and Claudio A. Ardagna. 2020. Certified machine-learning models. SOFSEM 2020: Theory and Practice of Computer Science. Springer International Publishing, Cham, 3–15.

[9]

Ernesto Damiani, Paolo Ceravolo, Fulvio Frati, Valerio Bellandi, Ronald Maier, Isabella Seeber, and Gabriela Waldhart. 2015. Applying recommender systems in collaboration environments. Computers in Human Behavior 51, PB (Oct. 2015), 1124–1133. DOI:https://doi.org/10.1016/j.chb.2015.02.045

Digital Library

[10]

Ernesto Damiani, Sabrina De Capitani di Vimercati, Pierangela Samarati, and Marco Viviani. 2006. A WOWA-based aggregation technique on trust values connected to metadata. Electronic Notes in Theoretical Computer Science 157, 3 (May 2006), 131–142. DOI:https://doi.org/10.1016/j.entcs.2005.09.036

Digital Library

[11]

ENISA. December 2020. AI Cybersecurity Challenges – Threat Landscape for Artificial Intelligence. Retrieved 15 Dec., 2020 from https://www.enisa.europa.eu/publications/artificial-intelligence-cybersecurity-challenges.

[12]

Leo A. Goodman. 1974. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61, 2 (1974), 215–231. DOI:https://doi.org/10.2307/2334349

[13]

E. T. Jaynes. 2003. Probability Theory: The Logic of Science. Cambridge University Press. Retrieved from https://doi.org/10.1017/CBO9780511790423

[14]

Aimad Karkouch, Hajar Mousannif, Hassan Al Moatassime, and Thomas Noel. 2016. Data quality in Internet of Things. Journal of Network Computer Applications 73, C (Sept. 2016), 57–81. DOI:https://doi.org/10.1016/j.jnca.2016.08.002

Digital Library

[15]

Paul F. Lazarsfeld. 1950. Studies in social psychology in world war II Vol. IV: Measurement and prediction. Journal of Information Security and Applications 45, 3 (1950), 934–935. DOI:https://doi.org/10.1017/S0003055400062882

[16]

Qiangkui Leng, Zuowei He, Yuqing Liu, Yuping Qin, and Yujian Li. 2020. A soft-margin convex polyhedron classifier for nonlinear task with noise tolerance. Applied Intelligence 51, 1 (2020), 453–466. DOI:https://doi.org/10.1007/s10489-020-01854-6

[17]

Yang Liu, Lei Ma, and Jianjun Zhao. 2019. Secure deep learning engineering: A road towards quality assurance of intelligent systems. Formal Methods and Software Engineering. Springer International Publishing, Cham, 3–15.

Digital Library

[18]

L. Mauri, E. Damiani, and S. Cimato. 2020. Be your neighbor’s miner: Building trust in ledger content via reciprocally useful work. In Proceedings of the 2020 IEEE 13th International Conference on Cloud Computing. 53–62. DOI:https://doi.org/10.1109/CLOUD49709.2020.00021

[19]

Luis Muñoz-González and Emil C. Lupu. 2019. The Security of Machine Learning Systems. Springer International Publishing, Cham, 47–79. DOI:https://doi.org/10.1007/978-3-319-98842-9_3

[20]

Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, 1 (1979), 62–66. DOI:https://doi.org/10.1109/TSMC.1979.4310076

[21]

Horacio Paggi, Javier Soriano, Juan A. Lara, and Ernesto Damiani. 2021. Towards the definition of an information quality metric for information fusion models. Computers & Electrical Engineering 89 (2021), 106907. DOI:https://doi.org/10.1016/j.compeleceng.2020.106907

[22]

Roland Roller and Mark Stevenson. 2015. Held-out versus gold standard: Comparison of evaluation strategies for distantly supervised relation extraction from medline abstracts. In Proceedings of the 6th International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics, 97–102. DOI:https://doi.org/10.18653/v1/W15-2612

[23]

Abdulhadi Shoufan and Ernesto Damiani. 2017. On inter-rater reliability of information security experts. J. Inf. Secur. Appl. 37, C (Dec. 2017), 101–111. DOI:https://doi.org/10.1016/j.jisa.2017.10.006

Digital Library

[24]

M. C. K. Tweedie. 1947. Functions of a statistical variate with given means, with special reference to laplacian distributions. Mathematical Proceedings of the Cambridge Philosophical Society 43, 1 (1947), 41–49. DOI:https://doi.org/10.1017/S0305004100023185

[25]

S. Yadav and S. Shukla. 2016. Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification. In Proceedings of the 2016 IEEE 6th International Conference on Advanced Computing. 78–83. DOI:https://doi.org/10.1109/IACC.2016.25

[26]

R. Yager. 1988. On ordered weighted averaging aggregation operators in multicriteria decision making. IEEE Transactions on Systems, Man and Cybernetics 18, 1 (Dec. 1988), 183–190.

Digital Library

[27]

Wei Zhang and Jana Kosecka. 2006. A new inlier identification scheme for robust estimation problems. Robotics: Science and Systems II. University of Pennsylvania, Philadelphia, Pennsylvania. The MIT Press. DOI:https://doi.org/10.15607/RSS.2006.II.018

Cited By

Puthal DYeun CDamiani EMishra AYelamarthi KPradhan B(2024)Blockchain Data Structures and Integrated Adaptive Learning: Features and FuturesIEEE Consumer Electronics Magazine10.1109/MCE.2023.326882713:2(72-80)Online publication date: Mar-2024
https://doi.org/10.1109/MCE.2023.3268827
Zhang ZLi PAl Hammadi AGuo FDamiani EYeun C(2024)Reputation-Based Federated Learning Defense to Mitigate Threats in EEG Signal Classification2024 16th International Conference on Computer and Automation Engineering (ICCAE)10.1109/ICCAE59995.2024.10569874(173-180)Online publication date: 14-Mar-2024
https://doi.org/10.1109/ICCAE59995.2024.10569874
von der Assen JSharif JFeng CKiller CBovet GStiller B(2024)Asset-Centric Threat Modeling for AI-Based Systems2024 IEEE International Conference on Cyber Security and Resilience (CSR)10.1109/CSR61664.2024.10679445(437-444)Online publication date: 2-Sep-2024
https://doi.org/10.1109/CSR61664.2024.10679445
Show More Cited By

Index Terms

Estimating Degradation of Machine Learning Data Assets
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems

Recommendations

Portfolio Choice with Illiquid Assets

We present a model of optimal allocation to liquid and illiquid assets, where illiquidity risk results from the restriction that an asset cannot be traded for intervals of uncertain duration. Illiquidity risk leads to increased and state-dependent risk ...
Valuing Thinly Traded Assets

We model illiquidity as a restriction on the stopping rules investors can follow in selling assets, and apply this framework to the valuation of thinly traded investments. We find that discounts for illiquidity can be surprisingly large, approaching 30%-...
Difficulties and Countermeasures in Data Asset Pricing
Cloud Computing – CLOUD 2023
Abstract
On August 16th, the Ministry of Finance issued the “Interim Regulations on the Accounting Treatment of Enterprise Data Resources Entering the Table”, officially marking the beginning of data assets entering the financial accounting subject assets. ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 14, Issue 2

June 2022

150 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/3505186

Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2021

Accepted: 01 December 2020

Received: 01 December 2020

Published in JDIQ Volume 14, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
687
Total Downloads

Downloads (Last 12 months)112
Downloads (Last 6 weeks)7

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Puthal DYeun CDamiani EMishra AYelamarthi KPradhan B(2024)Blockchain Data Structures and Integrated Adaptive Learning: Features and FuturesIEEE Consumer Electronics Magazine10.1109/MCE.2023.326882713:2(72-80)Online publication date: Mar-2024
https://doi.org/10.1109/MCE.2023.3268827
Zhang ZLi PAl Hammadi AGuo FDamiani EYeun C(2024)Reputation-Based Federated Learning Defense to Mitigate Threats in EEG Signal Classification2024 16th International Conference on Computer and Automation Engineering (ICCAE)10.1109/ICCAE59995.2024.10569874(173-180)Online publication date: 14-Mar-2024
https://doi.org/10.1109/ICCAE59995.2024.10569874
von der Assen JSharif JFeng CKiller CBovet GStiller B(2024)Asset-Centric Threat Modeling for AI-Based Systems2024 IEEE International Conference on Cyber Security and Resilience (CSR)10.1109/CSR61664.2024.10679445(437-444)Online publication date: 2-Sep-2024
https://doi.org/10.1109/CSR61664.2024.10679445
Anisetti MArdagna CBalestrucci ABena NDamiani EYeun C(2023)On the Robustness of Random Forest Against Untargeted Data Poisoning: An Ensemble-Based ApproachIEEE Transactions on Sustainable Computing10.1109/TSUSC.2023.32932698:4(540-554)Online publication date: Oct-2023
https://doi.org/10.1109/TSUSC.2023.3293269
Yun TChoi JHan MJung WChoi SYoo RHwang I(2023)Deep learning based automatic detection algorithm for acute intracranial haemorrhage: a pivotal randomized clinical trialnpj Digital Medicine10.1038/s41746-023-00798-86:1Online publication date: 7-Apr-2023
https://doi.org/10.1038/s41746-023-00798-8
Viswanathan KGoel MLaghuvarapu SVarma GPriyakumar U(2023)Streamlining pipeline efficiency: a novel model-agnostic technique for accelerating conditional generative and virtual screening pipelinesScientific Reports10.1038/s41598-023-42952-y13:1Online publication date: 29-Nov-2023
https://doi.org/10.1038/s41598-023-42952-y
Mauri LDamiani E(2022)Modeling Threats to AI-ML Systems Using STRIDESensors10.3390/s2217666222:17(6662)Online publication date: 3-Sep-2022
https://doi.org/10.3390/s22176662

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents