skip to main content
research-article

Estimating Degradation of Machine Learning Data Assets

Published: 11 December 2021 Publication History

Abstract

Large-scale adoption of Artificial Intelligence and Machine Learning (AI-ML) models fed by heterogeneous, possibly untrustworthy data sources has spurred interest in estimating degradation of such models due to spurious, adversarial, or low-quality data assets. We propose a quantitative estimate of the severity of classifiers’ training set degradation: an index expressing the deformation of the convex hulls of the classes computed on a held-out dataset generated via an unsupervised technique. We show that our index is computationally light, can be calculated incrementally and complements well existing ML data assets’ quality measures. As an experimentation, we present the computation of our index on a benchmark convolutional image classifier.

References

[1]
C. Bradford Barber, David P. Dobkin, and Hannu Huhdanpaa. 1996. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software 22, 4 (Dec. 1996), 469–483. DOI:https://doi.org/10.1145/235815.235821
[2]
Marco Barreno, Blaine Nelson, Anthony D. Joseph, and J. D. Tygar. 2010. The security of machine learning. Machine Learning 81, 2 (2010), 121–148. DOI:https://doi.org/10.1007/s10994-010-5188-5
[3]
Battista Biggio and Fabio Roli. 2018. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition 84 (2018), 317–331. DOI:https://doi.org/10.1016/j.patcog.2018.07.023
[4]
James E. Bobrow. 1989. A direct minimization approach for obtaining the distance between convex polyhedra. The International Journal of Robotics Research 8, 3 (1989), 65–76. DOI:https://doi.org/10.1177/027836498900800304arXiv:https://doi.org/10.1177/027836498900800304
[5]
Marco E. G. V. Cattaneo. 2016. Conditional probability estimation. In Proceedings of the 8th International Conference on Probabilistic Graphical Models. Vol. 52. JMLR.org, 86–97. Retrieved from http://proceedings.mlr.press/v52/cattaneo16.html.
[6]
Chengliang Chai, Lei Cao, Guoliang Li, Jian Li, Yuyu Luo, and Samuel Madden. 2020. Human-in-the-loop outlier detection. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, New York, NY, 19–33. DOI:https://doi.org/10.1145/3318464.3389772
[7]
Peter Cheeseman and John Stutz. 1996. Bayesian Classification (AutoClass): Theory and Results. American Association for Artificial Intelligence, 153–180.
[8]
Ernesto Damiani and Claudio A. Ardagna. 2020. Certified machine-learning models. SOFSEM 2020: Theory and Practice of Computer Science. Springer International Publishing, Cham, 3–15.
[9]
Ernesto Damiani, Paolo Ceravolo, Fulvio Frati, Valerio Bellandi, Ronald Maier, Isabella Seeber, and Gabriela Waldhart. 2015. Applying recommender systems in collaboration environments. Computers in Human Behavior 51, PB (Oct. 2015), 1124–1133. DOI:https://doi.org/10.1016/j.chb.2015.02.045
[10]
Ernesto Damiani, Sabrina De Capitani di Vimercati, Pierangela Samarati, and Marco Viviani. 2006. A WOWA-based aggregation technique on trust values connected to metadata. Electronic Notes in Theoretical Computer Science 157, 3 (May 2006), 131–142. DOI:https://doi.org/10.1016/j.entcs.2005.09.036
[11]
ENISA. December 2020. AI Cybersecurity Challenges – Threat Landscape for Artificial Intelligence. Retrieved 15 Dec., 2020 from https://www.enisa.europa.eu/publications/artificial-intelligence-cybersecurity-challenges.
[12]
Leo A. Goodman. 1974. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61, 2 (1974), 215–231. DOI:https://doi.org/10.2307/2334349
[13]
E. T. Jaynes. 2003. Probability Theory: The Logic of Science. Cambridge University Press. Retrieved from https://doi.org/10.1017/CBO9780511790423
[14]
Aimad Karkouch, Hajar Mousannif, Hassan Al Moatassime, and Thomas Noel. 2016. Data quality in Internet of Things. Journal of Network Computer Applications 73, C (Sept. 2016), 57–81. DOI:https://doi.org/10.1016/j.jnca.2016.08.002
[15]
Paul F. Lazarsfeld. 1950. Studies in social psychology in world war II Vol. IV: Measurement and prediction. Journal of Information Security and Applications 45, 3 (1950), 934–935. DOI:https://doi.org/10.1017/S0003055400062882
[16]
Qiangkui Leng, Zuowei He, Yuqing Liu, Yuping Qin, and Yujian Li. 2020. A soft-margin convex polyhedron classifier for nonlinear task with noise tolerance. Applied Intelligence 51, 1 (2020), 453–466. DOI:https://doi.org/10.1007/s10489-020-01854-6
[17]
Yang Liu, Lei Ma, and Jianjun Zhao. 2019. Secure deep learning engineering: A road towards quality assurance of intelligent systems. Formal Methods and Software Engineering. Springer International Publishing, Cham, 3–15.
[18]
L. Mauri, E. Damiani, and S. Cimato. 2020. Be your neighbor’s miner: Building trust in ledger content via reciprocally useful work. In Proceedings of the 2020 IEEE 13th International Conference on Cloud Computing. 53–62. DOI:https://doi.org/10.1109/CLOUD49709.2020.00021
[19]
Luis Muñoz-González and Emil C. Lupu. 2019. The Security of Machine Learning Systems. Springer International Publishing, Cham, 47–79. DOI:https://doi.org/10.1007/978-3-319-98842-9_3
[20]
Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, 1 (1979), 62–66. DOI:https://doi.org/10.1109/TSMC.1979.4310076
[21]
Horacio Paggi, Javier Soriano, Juan A. Lara, and Ernesto Damiani. 2021. Towards the definition of an information quality metric for information fusion models. Computers & Electrical Engineering 89 (2021), 106907. DOI:https://doi.org/10.1016/j.compeleceng.2020.106907
[22]
Roland Roller and Mark Stevenson. 2015. Held-out versus gold standard: Comparison of evaluation strategies for distantly supervised relation extraction from medline abstracts. In Proceedings of the 6th International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics, 97–102. DOI:https://doi.org/10.18653/v1/W15-2612
[23]
Abdulhadi Shoufan and Ernesto Damiani. 2017. On inter-rater reliability of information security experts. J. Inf. Secur. Appl. 37, C (Dec. 2017), 101–111. DOI:https://doi.org/10.1016/j.jisa.2017.10.006
[24]
M. C. K. Tweedie. 1947. Functions of a statistical variate with given means, with special reference to laplacian distributions. Mathematical Proceedings of the Cambridge Philosophical Society 43, 1 (1947), 41–49. DOI:https://doi.org/10.1017/S0305004100023185
[25]
S. Yadav and S. Shukla. 2016. Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification. In Proceedings of the 2016 IEEE 6th International Conference on Advanced Computing. 78–83. DOI:https://doi.org/10.1109/IACC.2016.25
[26]
R. Yager. 1988. On ordered weighted averaging aggregation operators in multicriteria decision making. IEEE Transactions on Systems, Man and Cybernetics 18, 1 (Dec. 1988), 183–190.
[27]
Wei Zhang and Jana Kosecka. 2006. A new inlier identification scheme for robust estimation problems. Robotics: Science and Systems II. University of Pennsylvania, Philadelphia, Pennsylvania. The MIT Press. DOI:https://doi.org/10.15607/RSS.2006.II.018

Cited By

View all
  • (2024)Blockchain Data Structures and Integrated Adaptive Learning: Features and FuturesIEEE Consumer Electronics Magazine10.1109/MCE.2023.326882713:2(72-80)Online publication date: Mar-2024
  • (2024)Reputation-Based Federated Learning Defense to Mitigate Threats in EEG Signal Classification2024 16th International Conference on Computer and Automation Engineering (ICCAE)10.1109/ICCAE59995.2024.10569874(173-180)Online publication date: 14-Mar-2024
  • (2024)Asset-Centric Threat Modeling for AI-Based Systems2024 IEEE International Conference on Cyber Security and Resilience (CSR)10.1109/CSR61664.2024.10679445(437-444)Online publication date: 2-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 14, Issue 2
June 2022
150 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3505186
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2021
Accepted: 01 December 2020
Received: 01 December 2020
Published in JDIQ Volume 14, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data assets
  2. ML models

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)112
  • Downloads (Last 6 weeks)7
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Blockchain Data Structures and Integrated Adaptive Learning: Features and FuturesIEEE Consumer Electronics Magazine10.1109/MCE.2023.326882713:2(72-80)Online publication date: Mar-2024
  • (2024)Reputation-Based Federated Learning Defense to Mitigate Threats in EEG Signal Classification2024 16th International Conference on Computer and Automation Engineering (ICCAE)10.1109/ICCAE59995.2024.10569874(173-180)Online publication date: 14-Mar-2024
  • (2024)Asset-Centric Threat Modeling for AI-Based Systems2024 IEEE International Conference on Cyber Security and Resilience (CSR)10.1109/CSR61664.2024.10679445(437-444)Online publication date: 2-Sep-2024
  • (2023)On the Robustness of Random Forest Against Untargeted Data Poisoning: An Ensemble-Based ApproachIEEE Transactions on Sustainable Computing10.1109/TSUSC.2023.32932698:4(540-554)Online publication date: Oct-2023
  • (2023)Deep learning based automatic detection algorithm for acute intracranial haemorrhage: a pivotal randomized clinical trialnpj Digital Medicine10.1038/s41746-023-00798-86:1Online publication date: 7-Apr-2023
  • (2023)Streamlining pipeline efficiency: a novel model-agnostic technique for accelerating conditional generative and virtual screening pipelinesScientific Reports10.1038/s41598-023-42952-y13:1Online publication date: 29-Nov-2023
  • (2022)Modeling Threats to AI-ML Systems Using STRIDESensors10.3390/s2217666222:17(6662)Online publication date: 3-Sep-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media