Abstract
An algorithm is introduced that distinguishes relevant data points from randomly distributed noise. The algorithm is related to subspace clustering based on axis-parallel projections, but considers membership in any projected cluster of a given side length, as opposed to a particular cluster. An aggregate measure is introduced that is based on the total number of points that are close to the given point in all possible 2d projections of a d-dimensional hypercube. No explicit summation over subspaces is required for evaluating this measure. Attribute values are normalized based on rank order to avoid making assumptions on the distribution of random data. Effectiveness of the algorithm is demonstrated through comparison with conventional outlier detection on a real microarray data set as well as on time series subsequence data.
Similar content being viewed by others
References
Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. Lect Notes Comput Sci 1973: 420–434
Aggarwal C, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, pp 70–81
Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM-SIGMOD international conference on management of data, pp 94–105
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (2005) Automatic subspace clustering of high dimensional data. Data Mining Knowl Discov J 11(1): 5–33
Bar-Joseph Z (2004) Analyzing time series gene expression data. Bioinformatics 20(16): 2493–2503
Bar-Joseph Z, Gerber G, Jaakkola T et al (2003) Continuous representations of time series gene expression data. J Comput Biol 10(3–4)
Baumgartner C, Kailing K, Kriegel H-P et al (2004) Subspace selection for clustering high-dimensional data. In: Proceedings of the 4th IEEE international conference on data mining (ICDM’04), Brighton, UK, pp 11–18
Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) Discovering local structure in gene expression data: the order-preserving submatrix problem. In: Proceedings of the 6th international conference on computational biology, New York, NY, pp 49–57
Bolstad B, Irizarry R, Astrand M, Speed T (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics 19(2): 185–193
Breunig M, Kriegel H-P, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on mangement of data, pp 93–104
Callen HB (1985) Thermodynamics and an introduction to thermostatistics, 2nd edn. Wiley, New York
Cheng C, Fu A-C, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the 4th ACM SIGKD international conference on knowledge discovery and data mining, pp 84–93
Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology (ISMB), pp 93–103
Cho R, Campbell M, Winzeler E et al (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Molec Cell 2(1): 65–73
Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5): 603–619
Denton A (2004) Density-based clustering of time series subsequences. In: Proceedings of the third workshop on mining temporal and sequential data (TDM 04) in conj. with the 10th ACM SIGKDD International conference on knowledge discovery and data mining, Seattle, WA
Denton A (2005) Kernel-density-based clustering of time series subsequences using a continuous random-walk noise model. In: Proceedings of the 5th IEEE international conference on data mining (ICDM’05), Houston, TX, pp 122–129
Denton A, Besemann C, Dorr D (2008) Pattern-based time-series subsequence clustering using radial distribution functions. Knowl Inf Syst (online first)
Denton A, Kar A (2007) Finding differentially expressed gens through noise elimination. In: Proceedings of the workshop on data mining for biomedical informatics in conjunction with the 6th SIAM international conference on data mining, Minneapolis, MN
Ding C, Peng H (2003) Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the computational systems bioinformatics conference. IEEE Computer Society, Los Alamitos, pp 523–529
Dudoit S, Yang Y, Speed T, Callow M (2002) Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Stat Sin 12(1): 111–139
DuMouchel W, Pregibon D (2001) Empirical bayes screening for multi-item associations. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 67–76
Eisen M, Spellman P, Brown P, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95: 14863–14868
Ernst J, Nau G, Bar-Joesph Z (2005) Clustering short time series gene expression data. Bioinformatics 21(suppl 1)
Ester M, Kriegel H-P, Sander J, Xu X (1998) Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Mining Knowl Discov 2(2): 169–194
Gao B, Griffith O, Ester M, Jones S (2006) Discovering significant opsm subspace clusters in massive gene expression data. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, pp 922–928
Gionis A, Hinneburg A, Papadimitriou S, Tsaparas P (2005) Dimension induced clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, IL, pp 51–60
Goldberger A, Amaral L, Glass L et al (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220. Circulation electronic pages: http://circ.ahajournals.org/cgi/content/full/101/23/e215
Hinneburg A, Keim D (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415
Inselberg A, Dimsdale B (1990) Parallel coordinates: a tool for visualizing multi-dimensional geometry. In: Proceedings of the first IEEE conference on visualization, pp 361–378
Jiang D, Pei J, Ramanathan M et al (2007) Mining gene-sample-time microarray data: a coherent gene cluster discovery approach. Knowl Inf Syst 13: 305–335
Kailing K, Kriegel H, Kroeger P, Wanka S (2003) Ranking interesting subspaces for clustering high dimensional data. In: Proceedings of the PKDD conference, pp 241–252
Keogh E, Folias T (2003) The ucr time series data mining archive. http://www.cs.ucr.edu/~eamonn/TSDMA/index.html. Accessed 2003
Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for previous and future research. In: Proceedings of the IEEE international conference on data mining, Melbourne, FL, pp 115–122
Keogh E, Lonardi S, Chiu W (2002) Finding surprising patterns in a time series database in linear time and space. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, AB, Canada, pp 550–556
Knorr E, Ng R (1997) A unified notion of outliers: Properties and computation, pp 219–222
Kulldorff M (1997) A spatial scan statistic. Commun Stat Theory Methods 26(6): 1481–1496
Liu X, Cheng G, Wu J (2002) Analyzing outliers cautiously. IEEE Trans Knowl Data Eng 14(2): 432–437
Mendenhall W, Reinmuth J, Beaver R (1993) Statistics for management and economics. Duxbury Press, Belmont
Moller-Levet C, Cho K, Wolkenhauer O (2003) Microarray data clustering based on temporal variation: Fcv with tsd preclustering. Appl Bioinformatics 2(1): 35–45
Morrison J, Breitling R, Higham D, Gilbert DR (2005) Generank: Using search engine technology for the analysis of microarray experiments. BMC Bioinformatics 6(233)
Neill D, Moore A, Pereira F, Mitchell T (2005) Detecting significant multidimensional spatial clusters. MIT Press, Cambridge, pp 969–976
Papadimitriou S, Kitagawa H, Gibbons P, Faloutsos C (2003) Loci: Fast outlier detection using the local correlation integral. In: Proceedings of the 19th international conference on data engineering (ICDE), pp 315–326
Parsons L, Ehtesham H, Liu H (1998) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6: 90–105
Price C, Nasmyth K, Schuster T (1991) A general approach to the isolation of cell cycle-regulated genes in the budding yeast, Saccharomyces cerevisiae. J Mol Biol 218(3): 543–556
Pyle D (1999) Data Preprocessing for data mining. Morgan Kaufmann, San Francisco
Ramoni M, Sebastiani P, Kohane I (2002) Cluster analysis of gene expression dynamics. Proc Natl Acad Sci 99(14): 9121–9126
Roth V, Braun M, Lange T, Buhmann J (2002) A resampling approach to cluster validation. In: Proceedings of the international conference on computational statistics (COMPSTAT’02)
Shedden K, Cooper S (2002) Analysis of cell-cycle gene expression in Saccharomyces cerevisiae using microarrays and multiple synchronization methods. Nucleic Acids Res 30(13): 2920–2929
Spellman P, Sherlock G, Zhang M et al (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molec Biol Cell 9(12): 3273–3297
Stevenson L, Kennedy B, Harlow E (2001) A large-scale overexpression screen in Saccharomyces cerevisiae identifies previously uncharacterized cell cycle genes. Proc Natl Acad Sci 98: 3946–3951
Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Reading
Verleysen M, François D. (2005) The curse of dimensionality in data mining and time series prediction. In: Cabestany J, Prieto A, Sandoval F (eds) Computational intelligence and bioinspired systems. Lecture notes in computer science, vol 3512.. Springer, Heidelberg, pp 758–770
Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3): 304–319
Yeung K, Haynor DR, Ruzzo WL (2001) Validating clustering for gene expression data. Bioinformatics 17: 309–318
Yona G, Dirks W, Rahman S, Lin D (2006) Effective similarity measures for expression profiles. Bioinformatics 22(13): 1616–1622
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Denton, A.M. Subspace sums for extracting non-random data from massive noise. Knowl Inf Syst 20, 35–62 (2009). https://doi.org/10.1007/s10115-008-0176-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0176-9