Skip to main content
Log in

Subspace sums for extracting non-random data from massive noise

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

An algorithm is introduced that distinguishes relevant data points from randomly distributed noise. The algorithm is related to subspace clustering based on axis-parallel projections, but considers membership in any projected cluster of a given side length, as opposed to a particular cluster. An aggregate measure is introduced that is based on the total number of points that are close to the given point in all possible 2d projections of a d-dimensional hypercube. No explicit summation over subspaces is required for evaluating this measure. Attribute values are normalized based on rank order to avoid making assumptions on the distribution of random data. Effectiveness of the algorithm is demonstrated through comparison with conventional outlier detection on a real microarray data set as well as on time series subsequence data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. Lect Notes Comput Sci 1973: 420–434

    Article  Google Scholar 

  2. Aggarwal C, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, pp 70–81

  3. Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data

  4. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM-SIGMOD international conference on management of data, pp 94–105

  5. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (2005) Automatic subspace clustering of high dimensional data. Data Mining Knowl Discov J 11(1): 5–33

    Article  MathSciNet  Google Scholar 

  6. Bar-Joseph Z (2004) Analyzing time series gene expression data. Bioinformatics 20(16): 2493–2503

    Article  Google Scholar 

  7. Bar-Joseph Z, Gerber G, Jaakkola T et al (2003) Continuous representations of time series gene expression data. J Comput Biol 10(3–4)

    Google Scholar 

  8. Baumgartner C, Kailing K, Kriegel H-P et al (2004) Subspace selection for clustering high-dimensional data. In: Proceedings of the 4th IEEE international conference on data mining (ICDM’04), Brighton, UK, pp 11–18

  9. Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) Discovering local structure in gene expression data: the order-preserving submatrix problem. In: Proceedings of the 6th international conference on computational biology, New York, NY, pp 49–57

  10. Bolstad B, Irizarry R, Astrand M, Speed T (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics 19(2): 185–193

    Article  Google Scholar 

  11. Breunig M, Kriegel H-P, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on mangement of data, pp 93–104

  12. Callen HB (1985) Thermodynamics and an introduction to thermostatistics, 2nd edn. Wiley, New York

    Google Scholar 

  13. Cheng C, Fu A-C, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the 4th ACM SIGKD international conference on knowledge discovery and data mining, pp 84–93

  14. Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology (ISMB), pp 93–103

  15. Cho R, Campbell M, Winzeler E et al (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Molec Cell 2(1): 65–73

    Article  Google Scholar 

  16. Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5): 603–619

    Article  Google Scholar 

  17. Denton A (2004) Density-based clustering of time series subsequences. In: Proceedings of the third workshop on mining temporal and sequential data (TDM 04) in conj. with the 10th ACM SIGKDD International conference on knowledge discovery and data mining, Seattle, WA

  18. Denton A (2005) Kernel-density-based clustering of time series subsequences using a continuous random-walk noise model. In: Proceedings of the 5th IEEE international conference on data mining (ICDM’05), Houston, TX, pp 122–129

  19. Denton A, Besemann C, Dorr D (2008) Pattern-based time-series subsequence clustering using radial distribution functions. Knowl Inf Syst (online first)

  20. Denton A, Kar A (2007) Finding differentially expressed gens through noise elimination. In: Proceedings of the workshop on data mining for biomedical informatics in conjunction with the 6th SIAM international conference on data mining, Minneapolis, MN

  21. Ding C, Peng H (2003) Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the computational systems bioinformatics conference. IEEE Computer Society, Los Alamitos, pp 523–529

  22. Dudoit S, Yang Y, Speed T, Callow M (2002) Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Stat Sin 12(1): 111–139

    MATH  MathSciNet  Google Scholar 

  23. DuMouchel W, Pregibon D (2001) Empirical bayes screening for multi-item associations. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 67–76

  24. Eisen M, Spellman P, Brown P, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95: 14863–14868

    Article  Google Scholar 

  25. Ernst J, Nau G, Bar-Joesph Z (2005) Clustering short time series gene expression data. Bioinformatics 21(suppl 1)

  26. Ester M, Kriegel H-P, Sander J, Xu X (1998) Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Mining Knowl Discov 2(2): 169–194

    Article  Google Scholar 

  27. Gao B, Griffith O, Ester M, Jones S (2006) Discovering significant opsm subspace clusters in massive gene expression data. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, pp 922–928

  28. Gionis A, Hinneburg A, Papadimitriou S, Tsaparas P (2005) Dimension induced clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, IL, pp 51–60

  29. Goldberger A, Amaral L, Glass L et al (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220. Circulation electronic pages: http://circ.ahajournals.org/cgi/content/full/101/23/e215

  30. Hinneburg A, Keim D (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415

    Article  Google Scholar 

  31. Inselberg A, Dimsdale B (1990) Parallel coordinates: a tool for visualizing multi-dimensional geometry. In: Proceedings of the first IEEE conference on visualization, pp 361–378

  32. Jiang D, Pei J, Ramanathan M et al (2007) Mining gene-sample-time microarray data: a coherent gene cluster discovery approach. Knowl Inf Syst 13: 305–335

    Article  Google Scholar 

  33. Kailing K, Kriegel H, Kroeger P, Wanka S (2003) Ranking interesting subspaces for clustering high dimensional data. In: Proceedings of the PKDD conference, pp 241–252

  34. Keogh E, Folias T (2003) The ucr time series data mining archive. http://www.cs.ucr.edu/~eamonn/TSDMA/index.html. Accessed 2003

  35. Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for previous and future research. In: Proceedings of the IEEE international conference on data mining, Melbourne, FL, pp 115–122

  36. Keogh E, Lonardi S, Chiu W (2002) Finding surprising patterns in a time series database in linear time and space. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, AB, Canada, pp 550–556

  37. Knorr E, Ng R (1997) A unified notion of outliers: Properties and computation, pp 219–222

  38. Kulldorff M (1997) A spatial scan statistic. Commun Stat Theory Methods 26(6): 1481–1496

    Article  MATH  MathSciNet  Google Scholar 

  39. Liu X, Cheng G, Wu J (2002) Analyzing outliers cautiously. IEEE Trans Knowl Data Eng 14(2): 432–437

    Article  Google Scholar 

  40. Mendenhall W, Reinmuth J, Beaver R (1993) Statistics for management and economics. Duxbury Press, Belmont

    Google Scholar 

  41. Moller-Levet C, Cho K, Wolkenhauer O (2003) Microarray data clustering based on temporal variation: Fcv with tsd preclustering. Appl Bioinformatics 2(1): 35–45

    Google Scholar 

  42. Morrison J, Breitling R, Higham D, Gilbert DR (2005) Generank: Using search engine technology for the analysis of microarray experiments. BMC Bioinformatics 6(233)

  43. Neill D, Moore A, Pereira F, Mitchell T (2005) Detecting significant multidimensional spatial clusters. MIT Press, Cambridge, pp 969–976

    Google Scholar 

  44. Papadimitriou S, Kitagawa H, Gibbons P, Faloutsos C (2003) Loci: Fast outlier detection using the local correlation integral. In: Proceedings of the 19th international conference on data engineering (ICDE), pp 315–326

  45. Parsons L, Ehtesham H, Liu H (1998) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6: 90–105

    Article  Google Scholar 

  46. Price C, Nasmyth K, Schuster T (1991) A general approach to the isolation of cell cycle-regulated genes in the budding yeast, Saccharomyces cerevisiae. J Mol Biol 218(3): 543–556

    Article  Google Scholar 

  47. Pyle D (1999) Data Preprocessing for data mining. Morgan Kaufmann, San Francisco

    Google Scholar 

  48. Ramoni M, Sebastiani P, Kohane I (2002) Cluster analysis of gene expression dynamics. Proc Natl Acad Sci 99(14): 9121–9126

    Article  MATH  MathSciNet  Google Scholar 

  49. Roth V, Braun M, Lange T, Buhmann J (2002) A resampling approach to cluster validation. In: Proceedings of the international conference on computational statistics (COMPSTAT’02)

  50. Shedden K, Cooper S (2002) Analysis of cell-cycle gene expression in Saccharomyces cerevisiae using microarrays and multiple synchronization methods. Nucleic Acids Res 30(13): 2920–2929

    Article  Google Scholar 

  51. Spellman P, Sherlock G, Zhang M et al (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molec Biol Cell 9(12): 3273–3297

    Google Scholar 

  52. Stevenson L, Kennedy B, Harlow E (2001) A large-scale overexpression screen in Saccharomyces cerevisiae identifies previously uncharacterized cell cycle genes. Proc Natl Acad Sci 98: 3946–3951

    Article  Google Scholar 

  53. Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Reading

  54. Verleysen M, François D. (2005) The curse of dimensionality in data mining and time series prediction. In: Cabestany J, Prieto A, Sandoval F (eds) Computational intelligence and bioinspired systems. Lecture notes in computer science, vol 3512.. Springer, Heidelberg, pp 758–770

    Google Scholar 

  55. Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3): 304–319

    Article  Google Scholar 

  56. Yeung K, Haynor DR, Ruzzo WL (2001) Validating clustering for gene expression data. Bioinformatics 17: 309–318

    Article  Google Scholar 

  57. Yona G, Dirks W, Rahman S, Lin D (2006) Effective similarity measures for expression profiles. Bioinformatics 22(13): 1616–1622

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anne M. Denton.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Denton, A.M. Subspace sums for extracting non-random data from massive noise. Knowl Inf Syst 20, 35–62 (2009). https://doi.org/10.1007/s10115-008-0176-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0176-9

Keywords

Navigation