skip to main content
10.1145/1458082.1458162acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Clustered subset selection and its applications on it service metrics

Published: 26 October 2008 Publication History

Abstract

Motivated by the enormous amounts of data collected in a large IT service provider organization, this paper presents a method for quickly and automatically summarizing and extracting meaningful insights from the data. Termed Clustered Subset Selection (CSS), our method enables program-guided data explorations of high-dimensional data matrices. CSS combines clustering and subset selection into a coherent and intuitive method for data analysis. In addition to a general framework, we introduce a family of CSS algorithms with different clustering components such as k-means and Close-to-Rank-One (CRO) clustering, and Subset Selection components such as best rank-one approximation and Rank-Revealing QR (RRQR) decomposition.
From an empirical perspective, we illustrate that CSS is achieving significant improvements over existing Subset Selection methods in terms of approximation errors. Compared to existing Subset Selection techniques, CSS is also able to provide additional insight about clusters and cluster representatives. Finally, we present a case-study of program-guided data explorations using CSS on a large amount of IT service delivery data collection.

References

[1]
http://finance.yahoo.com/q/hp?s=%5egspc.
[2]
M. Berry, S. Pulatova, and G. Stewart. Computing sparse reduced-rank approximations to sparse matrices. Technical Report UMIACS TR-2004-32 CMSC TR-4589, University of Maryland, College Park, MD, 2004.
[3]
C. H. Bischof and G. Qintana-Ortí. Algorithm 782: Codes for rank-revealing QR factorizations of dense matrices. ACM Transactions on Mathematical Software, 24:254--257, 1998.
[4]
C. H. Bischof and G. Quintana-Ortí. Computing rank-revealing QR factorizations of dense matrices. ACM Trans. Math. Softw, 24(2):226--253, 1998.
[5]
C. Boutsidis, M. Mahoney, and P. Drineas. Unsupervised feature selection for principal components analysis. In ACM Conference on Knowledge Discovery and Data Mining (KDD), 2008.
[6]
S. Brin and L. Page. The anatomy of a large-scale hypertextual (web) search engine. In WWW, pages 107--117, 1998.
[7]
T. F. Chan. Rank revealing QR factorizations. Linear Algebra and Its Applications, 88/89:67--82, 1987.
[8]
T. F. Chan and P. C. Hansen. Some applications of the rank revealing QR factorization. SIAM Journal on Scientific and Statistical Computing, 13:727--741, 1992.
[9]
T. F. Chan and P. C. Hansen. Low-rank revealing QR factorizations. Numerical Linear Algebra with Applications, 1:33--44, 1994.
[10]
S. Chandrasekaran and I. C. F. Ipsen. On rank-revealing factorizations. SIAM J. Matrix Anal. Appl., 15:592--622, 1994.
[11]
S. J. Cho and M. A. Hermsmeier. Genetic algorithm guided selection: Variable selection and subset selection. Journal of Chemical Information and Computer Sciences, 42(4):927--936, 2002.
[12]
Couvreur and Bresler. On the optimality of the backward greedy algorithm for the subset selection problem. SIJMAA: SIAM Journal on Matrix Analysis and Applications, 21, 2000.
[13]
A. Deshpande, L. Rademacher, S. Vempala, and G. Wang. Matrix approximation and projective clustering via volume sampling. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1117--1126, 2006.
[14]
A. Deshpande and S. Vempala. Adaptive sampling and fast low-rank matrix approximation. In RANDOM - APPROX, 2006.
[15]
I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1/2):143, 2001.
[16]
P. Drineas. Randomized Algorithms for Matrix Operations. PhD thesis, Yale University, 2002.
[17]
P. Drineas and R. Kannan. Pass efficient algorithms for approximating large matrices. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 223--232, 2003.
[18]
P. Drineas, R. Kannan, and M. Mahoney. Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal of Computing, 36(1):158--183, 2006.
[19]
P. W. Foltz and S. T. Dumais. Personalized information delivery: An analysis of information filtering methods. Comm. of ACM (CACM), 35(12), 1992.
[20]
L. V. Foster. Rank and null space calculations using matrix decomposition without column interchanges. Linear Algebra Appl., 74:47--71, 1986.
[21]
L. V. Foster and X. Liu. Comparison of rank revealing algorithms applied to matrices with well defined numerical ranks. submitted, 2006.
[22]
A. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for finding low-rank approximations. In Proceedings of the 39th Annual IEEE Symposium on Foundations of Computer Science, pages 370--378, 1998.
[23]
E. Gallopoulos and D. Zeimpekis. CLSI: A flexible approximation scheme from clustered term-document matrices. In SDM, 2005.
[24]
G. Golub and C. V. Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 1989.
[25]
G. H. Golub. Numerical methods for solving linear least squares problems. Numer. Math., 7:206--216, 1965.
[26]
G. H. Golub, V. Klema, and G. W. Stewart. Rank degeneracy and least squares problems. Technical Report TR-456, Computer Science, University of Maryland, College Park, MD, USA, 1976.
[27]
M. Gu and S. Eisenstat. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM Journal on Scientific Computing, 17:848--869, 1996.
[28]
Y. P. Hong and C. T. Pan. Rank-revealing QR factorizations and the singular value decomposition. Mathematics of Computation, 58:213--232, 1992.
[29]
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In KDD, 2001.
[30]
P. Indyk, N. Koudas, and S. Muthukrishnan. Identifying representative trends in massive time series data sets using sketches. In VLDB, 2000.
[31]
R. Kannan, S. Vempala, and A. Vetta. On clusterings - good, bad and spectral. In FOCS, 2000.
[32]
G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48(1), 1998.
[33]
Y.-D. Kim and S. Choi. A method of initialization for nonnegative matrix factorization. In Acoustics, Speech and Signal Processing (ICASSP), 2007.
[34]
J. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 1998.
[35]
C. T. Pan. On the existence and computation of rank-revealing LU factorizations. Linear Algebra Appl., 316:199--222, 2000.
[36]
C. T. Pan and P. T. P. Tang. Bounds on singular values revealed by QR factorizations. BIT Numerical Mathematics, 39:740--756, 1999.
[37]
C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: A probabilistic analysis. In PODS, 1998.
[38]
S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern discovery in multiple time-series. In VLDB, pages 697--708, 2005.
[39]
Pruhs and Woeginger. Approximation schemes for a class of subset selection problems. TCS: Theoretical Computer Science, 382, 2007.
[40]
G. Stewart. Four algorithms for the efficient computation of truncated QR approximations to a sparse matrix. Numerische Mathematik, 83:313--323, 1999.
[41]
G. Stewart and J. Sun. Matrix Perturbation Theory. Academic Press, New York, 1990.
[42]
D. Zeimpekis and E. Gallopoulos. Linear and non-linear dimensional reduction via class representatives for text classification. In ICDM, pages 1172--1177. IEEE Computer Society, 2006.

Cited By

View all
  • (2023)Data preprocessing impact on machine learning algorithm performanceOpen Computer Science10.1515/comp-2022-027813:1Online publication date: 17-Jul-2023
  • (2021)Insightful Dimensionality Reduction with Very Low Rank Variable SubsetsProceedings of the Web Conference 202110.1145/3442381.3450067(3066-3075)Online publication date: 19-Apr-2021
  • (2021)Clustering-Based Subset Selection in Evolutionary Multiobjective Optimization2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC52423.2021.9658582(468-475)Online publication date: 17-Oct-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
October 2008
1562 pages
ISBN:9781595939913
DOI:10.1145/1458082
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. low rank matrix approximation
  3. service delivery
  4. service provider
  5. subset selection

Qualifiers

  • Research-article

Conference

CIKM08
CIKM08: Conference on Information and Knowledge Management
October 26 - 30, 2008
California, Napa Valley, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Data preprocessing impact on machine learning algorithm performanceOpen Computer Science10.1515/comp-2022-027813:1Online publication date: 17-Jul-2023
  • (2021)Insightful Dimensionality Reduction with Very Low Rank Variable SubsetsProceedings of the Web Conference 202110.1145/3442381.3450067(3066-3075)Online publication date: 19-Apr-2021
  • (2021)Clustering-Based Subset Selection in Evolutionary Multiobjective Optimization2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC52423.2021.9658582(468-475)Online publication date: 17-Oct-2021
  • (2021)Geometric component analysis and its applications to data analysisApplied and Computational Harmonic Analysis10.1016/j.acha.2021.02.00554(20-43)Online publication date: Sep-2021
  • (2018)Nonparametric approaches for population structure analysisHuman Genomics10.1186/s40246-018-0156-412:1Online publication date: 9-May-2018
  • (2017)Spatial Random Sampling: A Structure-Preserving Data Sketching ToolIEEE Signal Processing Letters10.1109/LSP.2017.272347224:9(1398-1402)Online publication date: Sep-2017
  • (2017)Robust and Scalable Column/Row Sampling from Corrupted Big Data2017 IEEE International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW.2017.215(1818-1826)Online publication date: Oct-2017
  • (2015)Greedy column subset selection for large-scale data setsKnowledge and Information Systems10.1007/s10115-014-0801-845:1(1-34)Online publication date: 1-Oct-2015
  • (2013)Distributed Column Subset Selection on MapReduce2013 IEEE 13th International Conference on Data Mining10.1109/ICDM.2013.155(171-180)Online publication date: Dec-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media