Skip to main content
Log in

Sleeved co-clustering of lagged data

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The paper focuses on mining clusters that are characterized by a lagged relationship between the data objects. We call such clusters lagged co-clusters. A lagged co-cluster of a matrix is a submatrix determined by a subset of rows and their corresponding lag over a subset of columns. Extracting such subsets may reveal an underlying governing regulatory mechanism. Such a regulatory mechanism is quite common in real-life settings. It appears in a variety of fields: meteorology, seismic activity, stock market behavior, neuronal brain activity, river flow, and navigation, but a limited list of examples. Mining such lagged co-clusters not only helps in understanding the relationship between objects in the domain, but assists in forecasting their future behavior. For most interesting variants of this problem, finding an optimal lagged co-cluster is NP-complete problem. We present a polynomial-time Monte-Carlo algorithm for mining lagged co-clusters. We prove that, with fixed probability, the algorithm mines a lagged co-cluster which encompasses the optimal lagged co-cluster by a maximum 2 ratio columns overhead and completely no rows overhead. Moreover, the algorithm handles noise, anti-correlations, missing values, and overlapping patterns. The algorithm is extensively evaluated using both artificial and real-world test environments. The first enable the evaluation of specific, isolated properties of the algorithm. The latter (river flow and topographic data) enable the evaluation of the algorithm to efficiently mine relevant and coherent lagged co-clusters in environments that are temporal, i.e., time reading data and non-temporal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abraham T, Roddick J (1999) Survey of spatio-temporal databases. GeoInformatica 3(1): 61–99

    Article  Google Scholar 

  2. Anil Kumar V, Ramesh H (2003) Covering rectilinear polygons with axis-parallel rectangles. SIAM J Comput 32(6): 1509–1541

    Article  MathSciNet  MATH  Google Scholar 

  3. Ayadi W, Elloumi M, Hao J (2011) BicFinder: a biclustering algorithm for microarray data analysis. Knowl Inf Syst, pp 1–18

  4. Bar-Joseph Z, Gifford D, Jaakkola T, Simon I (2002) A new approach to analyzing gene expression time series data. In: Proceedings of the sixth annual international conference on Computational biology. ACM, pp 39–48

  5. Baralis E, Bruno G, Fiori A (2011) Measuring gene similarity by means of the classification distance. Knowl Inf Syst, pp 1–21

  6. Barash Y, Friedman N (2002) Context-specific Bayesian clustering for gene expression data. J Comput Biol 9(2): 169–191

    Article  Google Scholar 

  7. Bellman R (1966) Dynamic programming. Science 153(3731): 34–37

    Article  Google Scholar 

  8. Berman P, DasGupta B (1997) Complexities of efficient solutions of rectilinear polygon cover problems. Algorithmica 17(4): 331–356

    Article  MathSciNet  MATH  Google Scholar 

  9. Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, vol 8, AAAI, pp 93–103

  10. Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 493–498

  11. Chuang C, Jen C, Chen C, Shieh G (2008) A pattern recognition approach to infer time-lagged genetic interactions. Bioinformatics 24(9): 1183–1190

    Article  Google Scholar 

  12. Dantzig G (1998) Linear programming and extensions. Princeton University Press, Princeton

    MATH  Google Scholar 

  13. Erdal S, Ozturk O, Armbruster D, Ferhatosmanoglu H, Ray W (2004) A time series analysis of microarray data. In: Proceedings of the 4th IEEE symposium on bioinformatics and bioengineering. IEEE, pp 366–378

  14. Faloutsos M, Faloutsos P, Faloutsos C (1999) On power-law relationships of the internet topology. In: Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication. ACM, pp 251–262

  15. Getz G, Levine E, Domany E (2000) Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci 97(22): 12079–12084

    Article  Google Scholar 

  16. Granger C (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometr J Econometr Soc 37(3): 424–438

    Google Scholar 

  17. Håstad J (1999) Clique is hard to approximate within 1- ε. Acta Math 182(1): 105–142

    Article  MathSciNet  MATH  Google Scholar 

  18. Huang J (2006) Identifying co-regulated gene group from time-lagged gene cluster using cell cycle expression data. PhD thesis, National Central University, Taiwan

  19. Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323

    Article  Google Scholar 

  20. Ji L, Tan K (2005) Identifying time-lagged gene clusters using gene expression data. Bioinformatics 21(4): 509–516

    Article  Google Scholar 

  21. Jiang D, Pei J, Ramanathan M, Tang C, Zhang A (2004) Mining coherent gene clusters from gene-sample-time microarray data. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 430–439

  22. Jiang D, Pei J, Zhang A (2003) Interactive exploration of coherent patterns in time-series gene expression data. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 565–570

  23. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11): 1370–1386

    Article  Google Scholar 

  24. Kang U, Tsourakakis C, Faloutsos C (2010) Pegasus: mining peta-scale graphs. Knowl Inf Syst, pp 1–23

  25. Kenett D, Shapira Y, Ben-Jacob E (2009) RMT assessments of the market latent information embedded in the stocks’ raw, normalized and partial correlations. J Probab Stat

  26. Khot S (2002) Improved inapproximability results for maxclique, chromatic number and approximate graph coloring. In: Proceedings of the 42nd IEEE symposium on foundations of computer science. IEEE, pp 600–609

  27. Kluger Y, Basri R, Chang J, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13(4): 703–716

    Article  Google Scholar 

  28. Kumar V (1992) Algorithms for constraint-satisfaction problems: a survey. AI Mag 13(1): 32–44

    Google Scholar 

  29. Lonardi S, Szpankowski W, Yang Q (2006) Finding biclusters by random projections. Theor Comput Sci 368(3): 217–230

    Article  MathSciNet  MATH  Google Scholar 

  30. Madeira SC, Gonçalves JP, Oliveira AL (2007) Efficient biclustering algorithms for identifying transcriptional regulation relationships using time series gene expression data. Technical Report 22/2007, INESC-ID

  31. Madeira S, Oliveira A (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1): 24–45

    Article  Google Scholar 

  32. Mei C, Stiassnie M, Dick K (2005) Theory and applications of ocean surface waves: nonlinear aspects. World Scientific, Singapore

    Google Scholar 

  33. Melkman A, Shaham E (2004) Sleeved CoClustering. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 635–640

  34. Moise G, Zimek A, Kroeger P, Kriegel H, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3): 299–326

    Article  Google Scholar 

  35. Moller-Levet C, Klawonn F, Cho K, Yin H, Wolkenhauer O (2005) Clustering of unevenly sampled gene expression time-series data. Fuzzy Sets Syst 152: 49–66

    Article  MathSciNet  Google Scholar 

  36. Procopiuc C, Jones, M, Agarwal P, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: Proceedings of the 2002 ACM SIGMOD international conference on management of data. ACM, pp 418–427

  37. Ramsey S, Klemm S, Zak D, Kennedy K, Thorsson V, Li B, Gilchrist M, Gold E, Johnson C, Litvak V, et al (2008) Uncovering a macrophage transcriptional program by integrating evidence from motif scanning and expression dynamics. PLoS Comput Biol 4(3)

  38. Roddick J, Spiliopoulou M (2002) A survey of temporal knowledge discovery paradigms and methods. IEEE Trans Knowl Data Eng, pp 750–767

  39. Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 1(1): 1–9

    Google Scholar 

  40. Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. Handbook Comput Mol Biol 9: 26–31

    Google Scholar 

  41. USGS: Real Time Water Information System (2010) U.S. Geological Survey, National Water Information System. http://waterdata.usgs.gov/nwis/

  42. Wang G, Yin L, Zhao Y, Mao K (2010) Efficiently mining time-delayed gene expression patterns. IEEE Trans Syst Man Cybern B Cybern 40(2): 400–411

    Article  Google Scholar 

  43. Wolfram|Alpha (access Dec 31, 2010) Wolfram Alpha LLC. http://www.wolframalpha.com/

  44. Wu W, Li W, Chen B (2007) Identifying regulatory targets of cell cycle transcription factors using gene expression and ChIP-chip data. BMC Bioinform 8(1): 188

    Article  Google Scholar 

  45. Xu X, Lu Y, Tan K, Tung A (2008) Finding time-lagged 3D clusters. In: Proceedings of the 24th international conference on data engineering, pp 445–456

  46. Xu X, Lu Y, Tung A, Wang W (2006) Mining shifting-and-scaling co-regulation patterns on gene expression profiles. In: Proceedings of the 22nd international conference on data engineering. IEEE Computer Society, pp 89–98

  47. Yang J, Wang H, Wang W, Yu P (2003) Enhanced biclustering on expression data. In: Proceedings of the 3rd IEEE symposium on bioinformatics and bioengineering. IEEE, pp 321–327

  48. Yilmaz O, Doherty S (2001) Seismic data analysis. Society of Exploration Geophysicists

  49. Yin Y, Zhao Y, Zhang B, Wang G (2007) Mining time-shifting co-regulation patterns from gene expression data. Adv Data Web Manage, pp 62–73

  50. Zakov S (2007) Power coclustering: a model guided approach for automated recognition of trascription reguratory mechanism by gene expression data analysis. PhD thesis, Ben Gurion University, Israel

  51. Zeng T, Liu J (2008) Analysis on time-lagged gene clusters in time series gene expression data. In: Proceedings of the 2007 international conference on computational intelligence and security. IEEE, pp 181–185

  52. Zipf G (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley press, Reading

    Google Scholar 

  53. Zuckerman D (2007) Linear degree extractors and the inapproximability of max clique and chromatic number. Theory Comput 3(1): 103–128

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eran Shaham.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shaham, E., Sarne, D. & Ben-Moshe, B. Sleeved co-clustering of lagged data. Knowl Inf Syst 31, 251–279 (2012). https://doi.org/10.1007/s10115-011-0420-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0420-6

Keywords

Navigation