research-article

Clustered subset selection and its applications on it service metrics

Authors:

Christos Boutsidis,

Nikos AnerousisAuthors Info & Claims

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Pages 599 - 608

https://doi.org/10.1145/1458082.1458162

Published: 26 October 2008 Publication History

Abstract

Motivated by the enormous amounts of data collected in a large IT service provider organization, this paper presents a method for quickly and automatically summarizing and extracting meaningful insights from the data. Termed Clustered Subset Selection (CSS), our method enables program-guided data explorations of high-dimensional data matrices. CSS combines clustering and subset selection into a coherent and intuitive method for data analysis. In addition to a general framework, we introduce a family of CSS algorithms with different clustering components such as k-means and Close-to-Rank-One (CRO) clustering, and Subset Selection components such as best rank-one approximation and Rank-Revealing QR (RRQR) decomposition.

From an empirical perspective, we illustrate that CSS is achieving significant improvements over existing Subset Selection methods in terms of approximation errors. Compared to existing Subset Selection techniques, CSS is also able to provide additional insight about clusters and cluster representatives. Finally, we present a case-study of program-guided data explorations using CSS on a large amount of IT service delivery data collection.

References

[1]

http://finance.yahoo.com/q/hp?s=%5egspc.

[2]

M. Berry, S. Pulatova, and G. Stewart. Computing sparse reduced-rank approximations to sparse matrices. Technical Report UMIACS TR-2004-32 CMSC TR-4589, University of Maryland, College Park, MD, 2004.

[3]

C. H. Bischof and G. Qintana-Ortí. Algorithm 782: Codes for rank-revealing QR factorizations of dense matrices. ACM Transactions on Mathematical Software, 24:254--257, 1998.

Digital Library

[4]

C. H. Bischof and G. Quintana-Ortí. Computing rank-revealing QR factorizations of dense matrices. ACM Trans. Math. Softw, 24(2):226--253, 1998.

Digital Library

[5]

C. Boutsidis, M. Mahoney, and P. Drineas. Unsupervised feature selection for principal components analysis. In ACM Conference on Knowledge Discovery and Data Mining (KDD), 2008.

Digital Library

[6]

S. Brin and L. Page. The anatomy of a large-scale hypertextual (web) search engine. In WWW, pages 107--117, 1998.

Digital Library

[7]

T. F. Chan. Rank revealing QR factorizations. Linear Algebra and Its Applications, 88/89:67--82, 1987.

[8]

T. F. Chan and P. C. Hansen. Some applications of the rank revealing QR factorization. SIAM Journal on Scientific and Statistical Computing, 13:727--741, 1992.

Digital Library

[9]

T. F. Chan and P. C. Hansen. Low-rank revealing QR factorizations. Numerical Linear Algebra with Applications, 1:33--44, 1994.

[10]

S. Chandrasekaran and I. C. F. Ipsen. On rank-revealing factorizations. SIAM J. Matrix Anal. Appl., 15:592--622, 1994.

Digital Library

[11]

S. J. Cho and M. A. Hermsmeier. Genetic algorithm guided selection: Variable selection and subset selection. Journal of Chemical Information and Computer Sciences, 42(4):927--936, 2002.

[12]

Couvreur and Bresler. On the optimality of the backward greedy algorithm for the subset selection problem. SIJMAA: SIAM Journal on Matrix Analysis and Applications, 21, 2000.

Digital Library

[13]

A. Deshpande, L. Rademacher, S. Vempala, and G. Wang. Matrix approximation and projective clustering via volume sampling. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1117--1126, 2006.

Digital Library

[14]

A. Deshpande and S. Vempala. Adaptive sampling and fast low-rank matrix approximation. In RANDOM - APPROX, 2006.

Digital Library

[15]

I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1/2):143, 2001.

Digital Library

[16]

P. Drineas. Randomized Algorithms for Matrix Operations. PhD thesis, Yale University, 2002.

[17]

P. Drineas and R. Kannan. Pass efficient algorithms for approximating large matrices. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 223--232, 2003.

Digital Library

[18]

P. Drineas, R. Kannan, and M. Mahoney. Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal of Computing, 36(1):158--183, 2006.

Digital Library

[19]

P. W. Foltz and S. T. Dumais. Personalized information delivery: An analysis of information filtering methods. Comm. of ACM (CACM), 35(12), 1992.

Digital Library

[20]

L. V. Foster. Rank and null space calculations using matrix decomposition without column interchanges. Linear Algebra Appl., 74:47--71, 1986.

[21]

L. V. Foster and X. Liu. Comparison of rank revealing algorithms applied to matrices with well defined numerical ranks. submitted, 2006.

[22]

A. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for finding low-rank approximations. In Proceedings of the 39th Annual IEEE Symposium on Foundations of Computer Science, pages 370--378, 1998.

Digital Library

[23]

E. Gallopoulos and D. Zeimpekis. CLSI: A flexible approximation scheme from clustered term-document matrices. In SDM, 2005.

[24]

G. Golub and C. V. Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 1989.

[25]

G. H. Golub. Numerical methods for solving linear least squares problems. Numer. Math., 7:206--216, 1965.

Digital Library

[26]

G. H. Golub, V. Klema, and G. W. Stewart. Rank degeneracy and least squares problems. Technical Report TR-456, Computer Science, University of Maryland, College Park, MD, USA, 1976.

Digital Library

[27]

M. Gu and S. Eisenstat. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM Journal on Scientific Computing, 17:848--869, 1996.

Digital Library

[28]

Y. P. Hong and C. T. Pan. Rank-revealing QR factorizations and the singular value decomposition. Mathematics of Computation, 58:213--232, 1992.

[29]

G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In KDD, 2001.

Digital Library

[30]

P. Indyk, N. Koudas, and S. Muthukrishnan. Identifying representative trends in massive time series data sets using sketches. In VLDB, 2000.

Digital Library

[31]

R. Kannan, S. Vempala, and A. Vetta. On clusterings - good, bad and spectral. In FOCS, 2000.

Digital Library

[32]

G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48(1), 1998.

Digital Library

[33]

Y.-D. Kim and S. Choi. A method of initialization for nonnegative matrix factorization. In Acoustics, Speech and Signal Processing (ICASSP), 2007.

[34]

J. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 1998.

Digital Library

[35]

C. T. Pan. On the existence and computation of rank-revealing LU factorizations. Linear Algebra Appl., 316:199--222, 2000.

[36]

C. T. Pan and P. T. P. Tang. Bounds on singular values revealed by QR factorizations. BIT Numerical Mathematics, 39:740--756, 1999.

[37]

C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: A probabilistic analysis. In PODS, 1998.

Digital Library

[38]

S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern discovery in multiple time-series. In VLDB, pages 697--708, 2005.

Digital Library

[39]

Pruhs and Woeginger. Approximation schemes for a class of subset selection problems. TCS: Theoretical Computer Science, 382, 2007.

Digital Library

[40]

G. Stewart. Four algorithms for the efficient computation of truncated QR approximations to a sparse matrix. Numerische Mathematik, 83:313--323, 1999.

[41]

G. Stewart and J. Sun. Matrix Perturbation Theory. Academic Press, New York, 1990.

[42]

D. Zeimpekis and E. Gallopoulos. Linear and non-linear dimensional reduction via class representatives for text classification. In ICDM, pages 1172--1177. IEEE Computer Society, 2006.

Digital Library

Cited By

Amato ADi Lecce V(2023)Data preprocessing impact on machine learning algorithm performanceOpen Computer Science10.1515/comp-2022-027813:1Online publication date: 17-Jul-2023
https://doi.org/10.1515/comp-2022-0278
Ordozgoiti BPai SKołczyńska M(2021)Insightful Dimensionality Reduction with Very Low Rank Variable SubsetsProceedings of the Web Conference 202110.1145/3442381.3450067(3066-3075)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3450067
Chen WIshibuchi HShang K(2021)Clustering-Based Subset Selection in Evolutionary Multiobjective Optimization2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC52423.2021.9658582(468-475)Online publication date: 17-Oct-2021
https://doi.org/10.1109/SMC52423.2021.9658582
Show More Cited By

Index Terms

Clustered subset selection and its applications on it service metrics

Recommendations

A clustering validity criteria-guided unsupervised sparse subset selection algorithm
AIPR '23: Proceedings of the 2023 6th International Conference on Artificial Intelligence and Pattern Recognition

The newly proposed sparse subset selection (DS3) algorithm can effectively perform the clustering of data and the selection of representatives for each cluster (i.e. a subset of the entire data set) simultaneously. It can be formulated as a row-sparsity ...
Column Subset Selection Problem is UG-hard

We address two problems related to selecting an optimal subset of columns from a matrix. In one of these problems, we are given a matrix A R m í n and a positive integer k, and we want to select a sub-matrix C of k columns to minimize A - C A F , where ...
Secure web services using two-way authentication and three-party key establishment for service delivery

With the advance of web technologies, a large quantity of transactions have been processed through web services. Service Provider needs encryption via public communication channel in order that web services can be delivered to Service Requester. Such ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

October 2008

1562 pages

ISBN:9781595939913

DOI:10.1145/1458082

General Chair:
James G. Shanahan
Church and Duncan Group Inc, USA
,
Program Chairs:
Sihem Amer-Yahia
Yahoo! Research, USA
,
Ioana Manolescu
INRIA, France
,
Yi Zhang
University of California, Santa Cruz, USA
,
David A. Evans
JustSystems Evans Research, USA
,
Alek Kolcz
Microsoft Live Labs, USA
,
Key-Sun Choi
KAIST, Korea
,
Abdur Chowdury
Twitter, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM08

Sponsor:

CIKM08: Conference on Information and Knowledge Management

October 26 - 30, 2008

California, Napa Valley, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
309
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Amato ADi Lecce V(2023)Data preprocessing impact on machine learning algorithm performanceOpen Computer Science10.1515/comp-2022-027813:1Online publication date: 17-Jul-2023
https://doi.org/10.1515/comp-2022-0278
Ordozgoiti BPai SKołczyńska M(2021)Insightful Dimensionality Reduction with Very Low Rank Variable SubsetsProceedings of the Web Conference 202110.1145/3442381.3450067(3066-3075)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3450067
Chen WIshibuchi HShang K(2021)Clustering-Based Subset Selection in Evolutionary Multiobjective Optimization2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC52423.2021.9658582(468-475)Online publication date: 17-Oct-2021
https://doi.org/10.1109/SMC52423.2021.9658582
Bermanis ASalhov MAverbuch A(2021)Geometric component analysis and its applications to data analysisApplied and Computational Harmonic Analysis10.1016/j.acha.2021.02.00554(20-43)Online publication date: Sep-2021
https://doi.org/10.1016/j.acha.2021.02.005
Alhusain LHafez A(2018)Nonparametric approaches for population structure analysisHuman Genomics10.1186/s40246-018-0156-412:1Online publication date: 9-May-2018
https://doi.org/10.1186/s40246-018-0156-4
Rahmani MAtia G(2017)Spatial Random Sampling: A Structure-Preserving Data Sketching ToolIEEE Signal Processing Letters10.1109/LSP.2017.272347224:9(1398-1402)Online publication date: Sep-2017
https://doi.org/10.1109/LSP.2017.2723472
Rahmani MAtia G(2017)Robust and Scalable Column/Row Sampling from Corrupted Big Data2017 IEEE International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW.2017.215(1818-1826)Online publication date: Oct-2017
https://doi.org/10.1109/ICCVW.2017.215
Farahat AElgohary AGhodsi AKamel M(2015)Greedy column subset selection for large-scale data setsKnowledge and Information Systems10.1007/s10115-014-0801-845:1(1-34)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1007/s10115-014-0801-8
Farahat AElgohary AGhodsi AKamel M(2013)Distributed Column Subset Selection on MapReduce2013 IEEE 13th International Conference on Data Mining10.1109/ICDM.2013.155(171-180)Online publication date: Dec-2013
https://doi.org/10.1109/ICDM.2013.155

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten