Subspace sums for extracting non-random data from massive noise

Denton, Anne M.

doi:10.1007/s10115-008-0176-9

Subspace sums for extracting non-random data from massive noise

Regular Paper
Published: 18 October 2008

Volume 20, pages 35–62, (2009)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Anne M. Denton¹

71 Accesses
1 Citation
Explore all metrics

Abstract

An algorithm is introduced that distinguishes relevant data points from randomly distributed noise. The algorithm is related to subspace clustering based on axis-parallel projections, but considers membership in any projected cluster of a given side length, as opposed to a particular cluster. An aggregate measure is introduced that is based on the total number of points that are close to the given point in all possible 2^d projections of a d-dimensional hypercube. No explicit summation over subspaces is required for evaluating this measure. Attribute values are normalized based on rank order to avoid making assumptions on the distribution of random data. Effectiveness of the algorithm is demonstrated through comparison with conventional outlier detection on a real microarray data set as well as on time series subsequence data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. Lect Notes Comput Sci 1973: 420–434
Article Google Scholar
Aggarwal C, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, pp 70–81
Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM-SIGMOD international conference on management of data, pp 94–105
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (2005) Automatic subspace clustering of high dimensional data. Data Mining Knowl Discov J 11(1): 5–33
Article MathSciNet Google Scholar
Bar-Joseph Z (2004) Analyzing time series gene expression data. Bioinformatics 20(16): 2493–2503
Article Google Scholar
Bar-Joseph Z, Gerber G, Jaakkola T et al (2003) Continuous representations of time series gene expression data. J Comput Biol 10(3–4)
Google Scholar
Baumgartner C, Kailing K, Kriegel H-P et al (2004) Subspace selection for clustering high-dimensional data. In: Proceedings of the 4th IEEE international conference on data mining (ICDM’04), Brighton, UK, pp 11–18
Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) Discovering local structure in gene expression data: the order-preserving submatrix problem. In: Proceedings of the 6th international conference on computational biology, New York, NY, pp 49–57
Bolstad B, Irizarry R, Astrand M, Speed T (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics 19(2): 185–193
Article Google Scholar
Breunig M, Kriegel H-P, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on mangement of data, pp 93–104
Callen HB (1985) Thermodynamics and an introduction to thermostatistics, 2nd edn. Wiley, New York
Google Scholar
Cheng C, Fu A-C, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the 4th ACM SIGKD international conference on knowledge discovery and data mining, pp 84–93
Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology (ISMB), pp 93–103
Cho R, Campbell M, Winzeler E et al (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Molec Cell 2(1): 65–73
Article Google Scholar
Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5): 603–619
Article Google Scholar
Denton A (2004) Density-based clustering of time series subsequences. In: Proceedings of the third workshop on mining temporal and sequential data (TDM 04) in conj. with the 10th ACM SIGKDD International conference on knowledge discovery and data mining, Seattle, WA
Denton A (2005) Kernel-density-based clustering of time series subsequences using a continuous random-walk noise model. In: Proceedings of the 5th IEEE international conference on data mining (ICDM’05), Houston, TX, pp 122–129
Denton A, Besemann C, Dorr D (2008) Pattern-based time-series subsequence clustering using radial distribution functions. Knowl Inf Syst (online first)
Denton A, Kar A (2007) Finding differentially expressed gens through noise elimination. In: Proceedings of the workshop on data mining for biomedical informatics in conjunction with the 6th SIAM international conference on data mining, Minneapolis, MN
Ding C, Peng H (2003) Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the computational systems bioinformatics conference. IEEE Computer Society, Los Alamitos, pp 523–529
Dudoit S, Yang Y, Speed T, Callow M (2002) Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Stat Sin 12(1): 111–139
MATH MathSciNet Google Scholar
DuMouchel W, Pregibon D (2001) Empirical bayes screening for multi-item associations. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 67–76
Eisen M, Spellman P, Brown P, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95: 14863–14868
Article Google Scholar
Ernst J, Nau G, Bar-Joesph Z (2005) Clustering short time series gene expression data. Bioinformatics 21(suppl 1)
Ester M, Kriegel H-P, Sander J, Xu X (1998) Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Mining Knowl Discov 2(2): 169–194
Article Google Scholar
Gao B, Griffith O, Ester M, Jones S (2006) Discovering significant opsm subspace clusters in massive gene expression data. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, pp 922–928
Gionis A, Hinneburg A, Papadimitriou S, Tsaparas P (2005) Dimension induced clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, IL, pp 51–60
Goldberger A, Amaral L, Glass L et al (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220. Circulation electronic pages: http://circ.ahajournals.org/cgi/content/full/101/23/e215
Hinneburg A, Keim D (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415
Article Google Scholar
Inselberg A, Dimsdale B (1990) Parallel coordinates: a tool for visualizing multi-dimensional geometry. In: Proceedings of the first IEEE conference on visualization, pp 361–378
Jiang D, Pei J, Ramanathan M et al (2007) Mining gene-sample-time microarray data: a coherent gene cluster discovery approach. Knowl Inf Syst 13: 305–335
Article Google Scholar
Kailing K, Kriegel H, Kroeger P, Wanka S (2003) Ranking interesting subspaces for clustering high dimensional data. In: Proceedings of the PKDD conference, pp 241–252
Keogh E, Folias T (2003) The ucr time series data mining archive. http://www.cs.ucr.edu/~eamonn/TSDMA/index.html. Accessed 2003
Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for previous and future research. In: Proceedings of the IEEE international conference on data mining, Melbourne, FL, pp 115–122
Keogh E, Lonardi S, Chiu W (2002) Finding surprising patterns in a time series database in linear time and space. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, AB, Canada, pp 550–556
Knorr E, Ng R (1997) A unified notion of outliers: Properties and computation, pp 219–222
Kulldorff M (1997) A spatial scan statistic. Commun Stat Theory Methods 26(6): 1481–1496
Article MATH MathSciNet Google Scholar
Liu X, Cheng G, Wu J (2002) Analyzing outliers cautiously. IEEE Trans Knowl Data Eng 14(2): 432–437
Article Google Scholar
Mendenhall W, Reinmuth J, Beaver R (1993) Statistics for management and economics. Duxbury Press, Belmont
Google Scholar
Moller-Levet C, Cho K, Wolkenhauer O (2003) Microarray data clustering based on temporal variation: Fcv with tsd preclustering. Appl Bioinformatics 2(1): 35–45
Google Scholar
Morrison J, Breitling R, Higham D, Gilbert DR (2005) Generank: Using search engine technology for the analysis of microarray experiments. BMC Bioinformatics 6(233)
Neill D, Moore A, Pereira F, Mitchell T (2005) Detecting significant multidimensional spatial clusters. MIT Press, Cambridge, pp 969–976
Google Scholar
Papadimitriou S, Kitagawa H, Gibbons P, Faloutsos C (2003) Loci: Fast outlier detection using the local correlation integral. In: Proceedings of the 19th international conference on data engineering (ICDE), pp 315–326
Parsons L, Ehtesham H, Liu H (1998) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6: 90–105
Article Google Scholar
Price C, Nasmyth K, Schuster T (1991) A general approach to the isolation of cell cycle-regulated genes in the budding yeast, Saccharomyces cerevisiae. J Mol Biol 218(3): 543–556
Article Google Scholar
Pyle D (1999) Data Preprocessing for data mining. Morgan Kaufmann, San Francisco
Google Scholar
Ramoni M, Sebastiani P, Kohane I (2002) Cluster analysis of gene expression dynamics. Proc Natl Acad Sci 99(14): 9121–9126
Article MATH MathSciNet Google Scholar
Roth V, Braun M, Lange T, Buhmann J (2002) A resampling approach to cluster validation. In: Proceedings of the international conference on computational statistics (COMPSTAT’02)
Shedden K, Cooper S (2002) Analysis of cell-cycle gene expression in Saccharomyces cerevisiae using microarrays and multiple synchronization methods. Nucleic Acids Res 30(13): 2920–2929
Article Google Scholar
Spellman P, Sherlock G, Zhang M et al (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molec Biol Cell 9(12): 3273–3297
Google Scholar
Stevenson L, Kennedy B, Harlow E (2001) A large-scale overexpression screen in Saccharomyces cerevisiae identifies previously uncharacterized cell cycle genes. Proc Natl Acad Sci 98: 3946–3951
Article Google Scholar
Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Reading
Verleysen M, François D. (2005) The curse of dimensionality in data mining and time series prediction. In: Cabestany J, Prieto A, Sandoval F (eds) Computational intelligence and bioinspired systems. Lecture notes in computer science, vol 3512.. Springer, Heidelberg, pp 758–770
Google Scholar
Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3): 304–319
Article Google Scholar
Yeung K, Haynor DR, Ruzzo WL (2001) Validating clustering for gene expression data. Bioinformatics 17: 309–318
Article Google Scholar
Yona G, Dirks W, Rahman S, Lin D (2006) Effective similarity measures for expression profiles. Bioinformatics 22(13): 1616–1622
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Operations Research, North Dakota State University, Fargo, ND, 58108-6050, USA
Anne M. Denton

Authors

Anne M. Denton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anne M. Denton.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Denton, A.M. Subspace sums for extracting non-random data from massive noise. Knowl Inf Syst 20, 35–62 (2009). https://doi.org/10.1007/s10115-008-0176-9

Download citation

Received: 31 May 2007
Revised: 22 December 2007
Accepted: 19 September 2008
Published: 18 October 2008
Issue Date: July 2009
DOI: https://doi.org/10.1007/s10115-008-0176-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Subspace sums for extracting non-random data from massive noise

Abstract

Access this article

Similar content being viewed by others

A novel subspace outlier detection method by entropy-based clustering algorithm

Frequent Pattern Mining Algorithms for Data Clustering

Optimal Subspace Analysis Based on Information-Entropy Increment

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Subspace sums for extracting non-random data from massive noise

Abstract

Access this article

Similar content being viewed by others

A novel subspace outlier detection method by entropy-based clustering algorithm

Frequent Pattern Mining Algorithms for Data Clustering

Optimal Subspace Analysis Based on Information-Entropy Increment

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation