Abstract:
Machine learning methods rely on data to uncover relationships between inputs and outputs of complex systems, making it crucial to have sufficient amounts of representati...View moreMetadata
Abstract:
Machine learning methods rely on data to uncover relationships between inputs and outputs of complex systems, making it crucial to have sufficient amounts of representative data. Therefore, recent research has focused on choosing informative input-output pairs, i.e., labeled data, to facilitate the adoption of machine learning in science and engineering applications. Despite these efforts, estimating the test error with a limited amount of labeled data still needs to be explored. Hence, this paper investigates a novel framework for selecting informative labeled samples from a set of unlabeled testing instances to evaluate regression models with the quadratic loss function. Key contributions of this work include the design of nonuniform sampling distributions over candidate testing points and the deployment of an unbiased estimator to achieve desirable tradeoffs between estimation accuracy and testing data size. Comprehensive experimental results corroborate the impressive performance and flexibility of the proposed approach in real-world applications, such as reducing the standard deviation of the resulting estimator by almost a factor of two compared to uniform sampling. The paper concludes with practical advice for researchers and practitioners who encounter difficulties related to limited labeled data.
Published in: 2023 9th International Conference on Control, Decision and Information Technologies (CoDIT)
Date of Conference: 03-06 July 2023
Date Added to IEEE Xplore: 24 October 2023
ISBN Information: