Open access
Author
Date
2017Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
The massive growth of modern datasets from different sources such as videos, social networks, and sensor data, coupled with limited resources in terms of time and space, raises challenging questions for existing machine learning algorithms. From the statistical point of view, having access to more data may be viewed as a blessing, as it provides a better view of the underlying (possibly stochastic) processes generating the data. At the same time, it greatly increases the cost of storing, communicating, and processing the data. This interplay between the computational and statistical aspects is one of the key challenges in large-scale machine learning. In this dissertation we propose a general approach for addressing these challenges. We study coresets — succinct, small summaries of large datasets — so that solutions computed on the summary are provably competitive with the solution computed on the full data set. Such coresets can be constructed for a variety of important machine learning problems including k-means, maximum likelihood estimation in mixture models, as well as principal component analysis. In most cases, the resulting coresets are small – their size is independent of the original data set size, and only polynomial in other relevant quantities. Furthermore, due to their strong composability properties, coresets admit both streaming and embarrassingly parallel constructions, which lead to practical implementations in the context of large datasets. Finally, coresets can be efficiently computed for a wide range of non-convex optimization problems. We first derive a practical coreset construction framework for a variety of machine learning problems and provide a survey of the existing results. Then, we prove that small coresets can be efficiently constructed for a wide range of density estimation problems in regular exponential mixture models. We demonstrate that in practice the coreset-based approach improves the running time by several orders of magnitude, while introducing a negligible approximation error. We then investigate the resulting computational and statistical tradeoffs: how to use data as a computational resource when available beyond the sample complexity of the learning task? Instead of ignoring the excess data, we propose a data weakening mechanism which allows one to navigate the tradeoffs. Using k-means clustering as a prototypical unsupervised learning problem, we show how to strategically summarize the data in order to trade-off risk and time when the data is generated by a probabilistic model. Specifically, we show that for a fixed risk (or data size), as the data size increases (resp. risk increases) the running time decreases. We then propose a theoretical setting and a tradeoff navigation algorithm which can achieve such tradeoffs. Finally, we consider the practically relevant problem of outlier detection in large datasets. Due to noise, uncertainty and adversarial behavior, outlying observations are inherent to many real-world problems such as fraud or intrusion detection, activity monitoring, and many others. Scaling outlier detection techniques to massive datasets without sacrificing accuracy is a challenging task. We propose a novel distance-based outlier detection algorithm based on the intuition that outliers have a significant influence on the quality of distance-based clustering solutions. In an extensive experimental evaluation, we show that the proposed approach outperforms other popular distance-based approaches while being several orders of magnitude faster. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000220255Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Machine Learning; Coresets; Large-scale Machine Learning; Outlier Detection; Mixture ModelsOrganisational unit
03908 - Krause, Andreas / Krause, Andreas
More
Show all metadata
ETH Bibliography
yes
Altmetrics