Computational and Statistical Tradeoffs via Data Summarization

Lucic, Mario

doi:10.3929/ethz-b-000220255

Download

Full text (PDF, 3.854Mb)

Open access

Author

Lucic, Mario

Date

2017

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 3.854Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

The massive growth of modern datasets from different sources such as videos, social networks, and sensor data, coupled with limited resources in terms of time and space, raises challenging questions for existing machine learning algorithms. From the statistical point of view, having access to more data may be viewed as a blessing, as it provides a better view of the underlying (possibly stochastic) processes generating the data. At the same time, it greatly increases the cost of storing, communicating, and processing the data. This interplay between the computational and statistical aspects is one of the key challenges in large-scale machine learning. In this dissertation we propose a general approach for addressing these challenges. We study coresets — succinct, small summaries of large datasets — so that solutions computed on the summary are provably competitive with the solution computed on the full data set. Such coresets can be constructed for a variety of important machine learning problems including k-means, maximum likelihood estimation in mixture models, as well as principal component analysis. In most cases, the resulting coresets are small – their size is independent of the original data set size, and only polynomial in other relevant quantities. Furthermore, due to their strong composability properties, coresets admit both streaming and embarrassingly parallel constructions, which lead to practical implementations in the context of large datasets. Finally, coresets can be efficiently computed for a wide range of non-convex optimization problems. We first derive a practical coreset construction framework for a variety of machine learning problems and provide a survey of the existing results. Then, we prove that small coresets can be efficiently constructed for a wide range of density estimation problems in regular exponential mixture models. We demonstrate that in practice the coreset-based approach improves the running time by several orders of magnitude, while introducing a negligible approximation error. We then investigate the resulting computational and statistical tradeoffs: how to use data as a computational resource when available beyond the sample complexity of the learning task? Instead of ignoring the excess data, we propose a data weakening mechanism which allows one to navigate the tradeoffs. Using k-means clustering as a prototypical unsupervised learning problem, we show how to strategically summarize the data in order to trade-off risk and time when the data is generated by a probabilistic model. Specifically, we show that for a fixed risk (or data size), as the data size increases (resp. risk increases) the running time decreases. We then propose a theoretical setting and a tradeoff navigation algorithm which can achieve such tradeoffs. Finally, we consider the practically relevant problem of outlier detection in large datasets. Due to noise, uncertainty and adversarial behavior, outlying observations are inherent to many real-world problems such as fraud or intrusion detection, activity monitoring, and many others. Scaling outlier detection techniques to massive datasets without sacrificing accuracy is a challenging task. We propose a novel distance-based outlier detection algorithm based on the intuition that outliers have a significant influence on the quality of distance-based clustering solutions. In an extensive experimental evaluation, we show that the proposed approach outperforms other popular distance-based approaches while being several orders of magnitude faster. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000220255

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Krause, R. Andreas
Examiner: Capkun, Srdjan
Examiner: Ben-David, Shai

Publisher

ETH Zurich

Subject

Machine Learning; Coresets; Large-scale Machine Learning; Outlier Detection; Mixture Models

Organisational unit

03908 - Krause, Andreas / Krause, Andreas

More

Show all metadata

ETH Bibliography

yes

Altmetrics

Research Collection

Search

Computational and Statistical Tradeoffs via Data Summarization Mendeley CSV RIS BibTeX

Computational and Statistical Tradeoffs via Data Summarization

Mendeley

CSV

RIS

BibTeX