One-Pass Additive-Error Subset Selection for $$\ell _{p}$$ Subspace Approximation and (k, p)-Clustering

Deshpande, Amit; Pratap, Rameshwar

doi:10.1007/s00453-023-01124-0

One-Pass Additive-Error Subset Selection for $\ell _{p}$ Subspace Approximation and (k, p)-Clustering

Published: 11 May 2023

Volume 85, pages 3144–3167, (2023)
Cite this article

Algorithmica Aims and scope Submit manuscript

Amit Deshpande¹ &
Rameshwar Pratap²

94 Accesses
Explore all metrics

Abstract

We consider the problem of subset selection for $\ell _{p}$ subspace approximation and (k, p)-clustering. Our aim is to efficiently find a small subset of data points such that solving the problem optimally for this subset gives a good approximation to solving the problem optimally for the original input. For $\ell _{p}$ subspace approximation, previously known subset selection algorithms based on volume sampling and adaptive sampling proposed in Deshpande and Varadarajan (STOC’07, 2007), for the general case of $p \in [1, \infty )$, require multiple passes over the data. In this paper, we give a one-pass subset selection with an additive approximation guarantee for $\ell _{p}$ subspace approximation, for any $p \in [1, \infty )$. Earlier subset selection algorithms that give a one-pass multiplicative $(1+\epsilon )$ approximation work under the special cases. Cohen et al. (SODA’17, 2017) gives a one-pass subset section that offers multiplicative $(1+\epsilon )$ approximation guarantee for the special case of $\ell _{2}$ subspace approximation. Mahabadi et al. (STOC’20, 2020) gives a one-pass noisy subset selection with $(1+\epsilon )$ approximation guarantee for $\ell _{p}$ subspace approximation when $p \in \{1, 2\}$. Our subset selection algorithm gives a weaker, additive approximation guarantee, but it works for any $p \in [1, \infty )$. We also focus on (k, p)-clustering, where the task is to group the data points into k clusters such that the sum of distances from points to cluster centers (raised to the power p) is minimized for $p\in [1, \infty )$. The subset selection algorithms are based on $D^p$ sampling due to Wei (NIPS’16, 2016) which is an extension of $D^2$ sampling proposed in Arthur and Vassilvitskii (SODA’07, 2007). Due to the inherently adaptive nature of the $D^p$ sampling, these algorithms require taking multiple passes over the input. In this work, we suggest one pass subset selection for (k, p)-clustering that gives constant factor approximation with respect to the optimal solution with an additive approximation guarantee. Bachem et al. (NIPS’16, 2016) also gives one pass subset selection for k-means for $p=2$; however, our result gives a solution for a more generic problem when $p \in [1,\infty )$. At the core, our contribution lies in showing a one-pass MCMC-based subset selection algorithm such that its cost incurred due to the sampled points closely approximates the corresponding optimal cost, with high probability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Anomaly Detection Algorithms

Article 26 November 2021

Tutorial on PCA and approximate PCA and approximate kernel PCA

Article Open access 31 October 2022

Sparse Recovery of Hyperspectral Signal from Natural RGB Images

Availability of data

Not applicable.

References

Feldman, D.: Introduction to Core-sets: an Updated Survey (2020)
Frieze, A.M., Kannan, R., Vempala, S.S.: Fast monte-carlo algorithms for finding low-rank approximations. J. ACM 51(6), 1025–1041 (2004). https://doi.org/10.1145/1039488.1039494
Article MathSciNet MATH Google Scholar
Wei, K., Iyer, R., Bilmes, J.: Submodularity in data subset selection and active learning. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 1954–1963 (2015)
Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 968–977 (2009). SIAM
Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Relative-error CUR matrix decompositions. SIAM J. Matrix Anal. Appl. 30(2), 844–881 (2008). https://doi.org/10.1137/07070471X
Article MathSciNet MATH Google Scholar
Mahoney, M.W., Drineas, P.: CUR matrix decompositions for improved data analysis. Proc. Natl. Acad. Sci. USA 106(3), 697–702 (2009). https://doi.org/10.1073/pnas.0803205106
Article MathSciNet MATH Google Scholar
Wang, S., Zhang, Z.: Improving CUR matrix decomposition and the nyström approximation via adaptive sampling. J. Mach. Learn. Res. 14(1), 2729–2769 (2013)
MathSciNet MATH Google Scholar
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011). https://doi.org/10.1561/2200000035
Article MATH Google Scholar
Ida, Y., Kanai, S., Fujiwara, Y., Iwata, T., Takeuchi, K., Kashima, H.: Fast deterministic cur matrix decomposition with accuracy assurance. In: International Conference on Machine Learning, pp. 4594–4603 (2020). PMLR
Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is more: Compact matrix decomposition for large sparse graphs. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 366–377 (2007). SIAM
Mahoney, M.W., Maggioni, M., Drineas, P.: Tensor-cur decompositions for tensor-based data. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 327–336 (2006)
Drineas, P., Kerenidis, I., Raghavan, P.: Competitive recommendation systems. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, pp. 82–90 (2002)
Deshpande, A., Vempala, S.: Adaptive sampling and fast low-rank matrix approximation. In: Proceedings of the 9th International Conference on Approximation Algorithms for Combinatorial Optimization Problems, and 10th International Conference on Randomization and Computation. APPROX’06/RANDOM’06, pp. 292–303. Springer, Berlin, Heidelberg (2006). https://doi.org/10.1007/11830924_28
Deshpande, A., Rademacher, L., Vempala, S.S., Wang, G.: Matrix approximation and projective clustering via volume sampling. Theory Comput. 2(12), 225–247 (2006). https://doi.org/10.4086/toc.2006.v002a012
Article MathSciNet MATH Google Scholar
Guruswami, V., Sinop, A.K.: Optimal column-based low-rank matrix reconstruction. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’12, pp. 1207–1214. Society for Industrial and Applied Mathematics, USA (2012)
Derezinski, M., Warmuth, M.K.: Unbiased estimates for linear regression via volume sampling. Advances in Neural Information Processing Systems 30 (2017)
Broadbent, M.E., Brown, M., Penner, K., Ipsen, I., Rehman, R.: Subset selection algorithms: Randomized vs. deterministic. SIAM Undergraduate Research Online 3(01) (2010)
Dan, C., Wang, H., Zhang, H., Zhou, Y., Ravikumar, P.K.: Optimal analysis of subset-selection based $\ell _p$ low-rank approximation. Advances in Neural Information Processing Systems 32 (2019)
Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Inf. Theor. 28(2), 129–137 (2006). https://doi.org/10.1109/TIT.1982.1056489
Article MathSciNet MATH Google Scholar
Arthur, D., Vassilvitskii, S.: K-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’07, pp. 1027–1035. Society for Industrial and Applied Mathematics, USA (2007)
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012). https://doi.org/10.14778/2180912.2180915
Bachem, O., Lucic, M., Hassani, S.H., Krause, A.: Approximate k-means++ in sublinear time. Proceedings of the AAAI Conference on Artificial Intelligence 30(1) (2016). https://doi.org/10.1609/aaai.v30i1.10259
Bachem, O., Lucic, M., Hassani, S.H., Krause, A.: Fast and provably good seedings for k-means. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16, pp. 55–63. Curran Associates Inc., Red Hook (2016)
Deshpande, A., Varadarajan, K.: Sampling-based dimension reduction for subspace approximation. In: Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing. STOC ’07, pp. 641–650. Association for Computing Machinery, New York, (2007). https://doi.org/10.1145/1250790.1250884
Aloise, D., Deshpande, A., Hansen, P., Popat, P.: Np-hardness of euclidean sum-of-squares clustering. Mach. Learn. 75(2), 245–248 (2009). https://doi.org/10.1007/s10994-009-5103-0
Article MATH Google Scholar
Mahajan, M., Nimbhorkar, P., Varadarajan, K.R.: The planar k-means problem is np-hard. Theor. Comput. Sci. 442, 13–21 (2012). https://doi.org/10.1016/j.tcs.2010.05.034
Article MathSciNet MATH Google Scholar
Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive sampling for k-means clustering, pp. 15–28 (2009). https://doi.org/10.1007/978-3-642-03685-9_2
Wei, D.: A constant-factor bi-criteria approximation guarantee for k-means++. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16, pp. 604–612. Curran Associates Inc., Red Hook (2016)
Jaiswal, R., Kumar, M., Yadav, P.: Improved analysis of d${}^{\text{2 }}$-sampling based PTAS for k-means and other clustering problems. Inf. Process. Lett. 115(2), 100–103 (2015). https://doi.org/10.1016/j.ipl.2014.07.009
Article MathSciNet MATH Google Scholar
Kumar, A., Sabharwal, Y., Sen, S.: A simple linear time $(1+\epsilon )$-approximation algorithm for $k$-means clustering in any dimensions. In: 45th Annual IEEE Symposium on Foundations of Computer Science, pp. 454–462 (2004). IEEE
Cohen, M.B., Musco, C., Musco, C.: Input sparsity time low-rank approximation via ridge leverage score sampling. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’17, pp. 1758–1777. Society for Industrial and Applied Mathematics, USA (2017)
Mahabadi, S., Razenshteyn, I., Woodruff, D.P., Zhou, S.: Non-adaptive adaptive sampling on turnstile streams. In: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. STOC 2020, pp. 1251–1264. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3357713.3384331
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985). https://doi.org/10.1145/3147.3165
Article MathSciNet MATH Google Scholar
Efraimidis, P.S., Spirakis, P.P.: Weighted random sampling. In: Encyclopedia of Algorithms, pp. 2365–2367 (2016). https://doi.org/10.1007/978-1-4939-2864-4_478
Ghashami, M., Liberty, E., Phillips, J.M., Woodruff, D.P.: Frequent directions : Simple and deterministic matrix sketching. CoRR abs/1501.01711 (2015) arXiv:1501.01711
Braverman, V., Drineas, P., Musco, C., Musco, C., Upadhyay, J., Woodruff, D.P., Zhou, S.: Near optimal linear algebra in the online and sliding window models. In: 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pp. 517–528 (2020). IEEE
Ban, F., Bhattiprolu, V., Bringmann, K., Kolev, P., Lee, E., Woodruff, D.P.: A ptas for $\ell _p$-low rank approximation. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 747–766 (2019). SIAM
Chierichetti, F., Gollapudi, S., Kumar, R., Lattanzi, S., Panigrahy, R., Woodruff, D.P.: Algorithms for $\ell _p$ low-rank approximation. In: International Conference on Machine Learning, pp. 806–814 (2017). PMLR
Liberty, E.: Simple and deterministic matrix sketching. In:Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’13, pp. 581–588. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2487575.2487623
Ghashami, M., Phillips, J.M.: Relative errors for deterministic low-rank matrix approximations. In:Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’14, pp. 707–717. Society for Industrial and Applied Mathematics, USA (2014)
Ghashami, M., Liberty, E., Phillips, J.M., Woodruff, D.P.: Frequent directions: Simple and deterministic matrix sketching. SIAM J. Comput. 45(5), 1762–1792 (2016). https://doi.org/10.1137/15M1009718
Article MathSciNet MATH Google Scholar
Cormode, G., Dickens, C., Woodruff, D.: Leveraging well-conditioned bases: Streaming and distributed summaries in minkowski p-norms. In: International Conference on Machine Learning, pp. 1048–1056 (2018)
Clarkson, K.L., Woodruff, D.P.: Low-rank approximation and regression in input sparsity time. J. ACM 63(6) (2017). https://doi.org/10.1145/3019134
Feldman, D., Monemizadeh, M., Sohler, C., Woodruff, D.P.: Coresets and sketches for high dimensional subspace approximation problems. In: Proceedings of the Twenty-first Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 630–649 (2010). SIAM
Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. Adv Neural Inform Process Syst22 (2009)
Jaiswal, R., Kumar, A., Sen, S.: A simple D 2-sampling based PTAS for k-means and other clustering problems. Algorithmica 70(1), 22–46 (2014). https://doi.org/10.1007/s00453-013-9833-9
Article MathSciNet MATH Google Scholar
Anari, N., Gharan, S.O., Rezaei, A.: Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes. In: Conference on Learning Theory, pp. 103–115 (2016). PMLR
CHAO, M.T.: A general purpose unequal probability sampling plan. Biometrika 69(3), 653–656 (1982) https://academic.oup.com/biomet/article-pdf/69/3/653/591311/69-3-653.pdf. https://doi.org/10.1093/biomet/69.3.653
Cai, H.: Exact bound for the convergence of metropolis chains. Stochastic Anal Appl 18(1), 63–71 (2000). https://doi.org/10.1080/07362990008809654
Article MathSciNet MATH Google Scholar

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Microsoft Research, Bangalore, Bangalore, Karnataka, 560001, India
Amit Deshpande
Department of Computer Science and Engineering, IIT Hyderabad, Hyderabad, Telangana, 502284, India
Rameshwar Pratap

Authors

Amit Deshpande
View author publications
You can also search for this author in PubMed Google Scholar
Rameshwar Pratap
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RP Conceptualization, Methodology, Formal analysis, Validation, Writing - original draft & editing. AD Conceptualization, Methodology, Formal analysis, Validation, Writing - original draft & editing.

Corresponding author

Correspondence to Rameshwar Pratap.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

All authors participated in this study give the publisher the permission to publish this work.

Code availability

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Deshpande, A., Pratap, R. One-Pass Additive-Error Subset Selection for $\ell _{p}$ Subspace Approximation and (k, p)-Clustering. Algorithmica 85, 3144–3167 (2023). https://doi.org/10.1007/s00453-023-01124-0

Download citation

Received: 23 September 2022
Accepted: 06 April 2023
Published: 11 May 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00453-023-01124-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

One-Pass Additive-Error Subset Selection for \(\ell _{p}\) Subspace Approximation and (k, p)-Clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Anomaly Detection Algorithms

Tutorial on PCA and approximate PCA and approximate kernel PCA

Sparse Recovery of Hyperspectral Signal from Natural RGB Images

Availability of data

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

One-Pass Additive-Error Subset Selection for \(\ell _{p}\) Subspace Approximation and (k, p)-Clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Anomaly Detection Algorithms

Tutorial on PCA and approximate PCA and approximate kernel PCA

Sparse Recovery of Hyperspectral Signal from Natural RGB Images

Availability of data

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation