Abstract:
Construed as a shift in the distribution or structure of data over time, data drift can adversely affect the performance of machine learning models and data-driven decisi...Show MoreMetadata
Abstract:
Construed as a shift in the distribution or structure of data over time, data drift can adversely affect the performance of machine learning models and data-driven decisions. This study examines two data drift metrics, denoted as dE,PCA and dE,AE, that are derived from unsupervised ML models: the reconstruction error-based metrics of Principal Component Analysis (PCA) and Autoencoders (AE). To investigate the robustness of these metrics, we have systematically accessed time-series datasets from the European Data Portal. Our experiments have examined data versioning through three basic events: creation, update, and deletion. The results are summarised and aggregated for all datasets, and unsupervised analysis based on Robust PCA and AE has been performed to examine patterns within the impact of dataset characteristics on data drift detection and computational efficiency. Our results indicate that both metrics aligned closely in performance with new records, suggesting consistent drift detection under normal conditions with FAIR compliance. However, high-dimensional datasets posed challenges for both PCA and AE models. Update events revealed discrepancies between the two metrics, suggesting that non-linear shifts affected AE-based metrics more than PCA-based ones. Deletion events demonstrated the resilience of these metrics against data loss, but also revealed variability in the reliability of the PCA model; i.e., data drift metrics derived from PCA and AE can be effective but sensitive to certain dataset characteristics.
Date of Conference: 16-20 September 2024
Date Added to IEEE Xplore: 20 September 2024
ISBN Information: