Abstract:
Machine learning systems require massive amounts of labeled training data in order to achieve high accuracy rates. Active learning uses feedback to label the most informa...Show MoreMetadata
Abstract:
Machine learning systems require massive amounts of labeled training data in order to achieve high accuracy rates. Active learning uses feedback to label the most informative data points and significantly reduce the training set size. Many heuristics for selecting data points have been developed in recent years which are usually tailored to a specific task and a general unified framework is lacking. In this work, the individual setting is considered and an active learning criterion is proposed. Motivated by universal source coding, the proposed criterion attempts to find data points which minimize the Predictive Normalized Maximum Likelihood (pNML) regret on an un-labelled test set. It is shown that for binary classification and linear regression, the resulting criterion coincides with well-known active learning criteria and thus represents a unified information theoretic active learning approach for general hypothesis classes. Finally, it is shown using real data that the proposed criterion performs better than other active learning criteria in terms of sample complexity.
Published in: IEEE Transactions on Information Theory ( Volume: 70, Issue: 8, August 2024)