Abstract
The paper focuses on mining clusters that are characterized by a lagged relationship between the data objects. We call such clusters lagged co-clusters. A lagged co-cluster of a matrix is a submatrix determined by a subset of rows and their corresponding lag over a subset of columns. Extracting such subsets may reveal an underlying governing regulatory mechanism. Such a regulatory mechanism is quite common in real-life settings. It appears in a variety of fields: meteorology, seismic activity, stock market behavior, neuronal brain activity, river flow, and navigation, but a limited list of examples. Mining such lagged co-clusters not only helps in understanding the relationship between objects in the domain, but assists in forecasting their future behavior. For most interesting variants of this problem, finding an optimal lagged co-cluster is NP-complete problem. We present a polynomial-time Monte-Carlo algorithm for mining lagged co-clusters. We prove that, with fixed probability, the algorithm mines a lagged co-cluster which encompasses the optimal lagged co-cluster by a maximum 2 ratio columns overhead and completely no rows overhead. Moreover, the algorithm handles noise, anti-correlations, missing values, and overlapping patterns. The algorithm is extensively evaluated using both artificial and real-world test environments. The first enable the evaluation of specific, isolated properties of the algorithm. The latter (river flow and topographic data) enable the evaluation of the algorithm to efficiently mine relevant and coherent lagged co-clusters in environments that are temporal, i.e., time reading data and non-temporal.
Similar content being viewed by others
References
Abraham T, Roddick J (1999) Survey of spatio-temporal databases. GeoInformatica 3(1): 61–99
Anil Kumar V, Ramesh H (2003) Covering rectilinear polygons with axis-parallel rectangles. SIAM J Comput 32(6): 1509–1541
Ayadi W, Elloumi M, Hao J (2011) BicFinder: a biclustering algorithm for microarray data analysis. Knowl Inf Syst, pp 1–18
Bar-Joseph Z, Gifford D, Jaakkola T, Simon I (2002) A new approach to analyzing gene expression time series data. In: Proceedings of the sixth annual international conference on Computational biology. ACM, pp 39–48
Baralis E, Bruno G, Fiori A (2011) Measuring gene similarity by means of the classification distance. Knowl Inf Syst, pp 1–21
Barash Y, Friedman N (2002) Context-specific Bayesian clustering for gene expression data. J Comput Biol 9(2): 169–191
Bellman R (1966) Dynamic programming. Science 153(3731): 34–37
Berman P, DasGupta B (1997) Complexities of efficient solutions of rectilinear polygon cover problems. Algorithmica 17(4): 331–356
Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, vol 8, AAAI, pp 93–103
Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 493–498
Chuang C, Jen C, Chen C, Shieh G (2008) A pattern recognition approach to infer time-lagged genetic interactions. Bioinformatics 24(9): 1183–1190
Dantzig G (1998) Linear programming and extensions. Princeton University Press, Princeton
Erdal S, Ozturk O, Armbruster D, Ferhatosmanoglu H, Ray W (2004) A time series analysis of microarray data. In: Proceedings of the 4th IEEE symposium on bioinformatics and bioengineering. IEEE, pp 366–378
Faloutsos M, Faloutsos P, Faloutsos C (1999) On power-law relationships of the internet topology. In: Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication. ACM, pp 251–262
Getz G, Levine E, Domany E (2000) Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci 97(22): 12079–12084
Granger C (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometr J Econometr Soc 37(3): 424–438
Håstad J (1999) Clique is hard to approximate within 1- ε. Acta Math 182(1): 105–142
Huang J (2006) Identifying co-regulated gene group from time-lagged gene cluster using cell cycle expression data. PhD thesis, National Central University, Taiwan
Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323
Ji L, Tan K (2005) Identifying time-lagged gene clusters using gene expression data. Bioinformatics 21(4): 509–516
Jiang D, Pei J, Ramanathan M, Tang C, Zhang A (2004) Mining coherent gene clusters from gene-sample-time microarray data. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 430–439
Jiang D, Pei J, Zhang A (2003) Interactive exploration of coherent patterns in time-series gene expression data. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 565–570
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11): 1370–1386
Kang U, Tsourakakis C, Faloutsos C (2010) Pegasus: mining peta-scale graphs. Knowl Inf Syst, pp 1–23
Kenett D, Shapira Y, Ben-Jacob E (2009) RMT assessments of the market latent information embedded in the stocks’ raw, normalized and partial correlations. J Probab Stat
Khot S (2002) Improved inapproximability results for maxclique, chromatic number and approximate graph coloring. In: Proceedings of the 42nd IEEE symposium on foundations of computer science. IEEE, pp 600–609
Kluger Y, Basri R, Chang J, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13(4): 703–716
Kumar V (1992) Algorithms for constraint-satisfaction problems: a survey. AI Mag 13(1): 32–44
Lonardi S, Szpankowski W, Yang Q (2006) Finding biclusters by random projections. Theor Comput Sci 368(3): 217–230
Madeira SC, Gonçalves JP, Oliveira AL (2007) Efficient biclustering algorithms for identifying transcriptional regulation relationships using time series gene expression data. Technical Report 22/2007, INESC-ID
Madeira S, Oliveira A (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1): 24–45
Mei C, Stiassnie M, Dick K (2005) Theory and applications of ocean surface waves: nonlinear aspects. World Scientific, Singapore
Melkman A, Shaham E (2004) Sleeved CoClustering. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 635–640
Moise G, Zimek A, Kroeger P, Kriegel H, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3): 299–326
Moller-Levet C, Klawonn F, Cho K, Yin H, Wolkenhauer O (2005) Clustering of unevenly sampled gene expression time-series data. Fuzzy Sets Syst 152: 49–66
Procopiuc C, Jones, M, Agarwal P, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: Proceedings of the 2002 ACM SIGMOD international conference on management of data. ACM, pp 418–427
Ramsey S, Klemm S, Zak D, Kennedy K, Thorsson V, Li B, Gilchrist M, Gold E, Johnson C, Litvak V, et al (2008) Uncovering a macrophage transcriptional program by integrating evidence from motif scanning and expression dynamics. PLoS Comput Biol 4(3)
Roddick J, Spiliopoulou M (2002) A survey of temporal knowledge discovery paradigms and methods. IEEE Trans Knowl Data Eng, pp 750–767
Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 1(1): 1–9
Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. Handbook Comput Mol Biol 9: 26–31
USGS: Real Time Water Information System (2010) U.S. Geological Survey, National Water Information System. http://waterdata.usgs.gov/nwis/
Wang G, Yin L, Zhao Y, Mao K (2010) Efficiently mining time-delayed gene expression patterns. IEEE Trans Syst Man Cybern B Cybern 40(2): 400–411
Wolfram|Alpha (access Dec 31, 2010) Wolfram Alpha LLC. http://www.wolframalpha.com/
Wu W, Li W, Chen B (2007) Identifying regulatory targets of cell cycle transcription factors using gene expression and ChIP-chip data. BMC Bioinform 8(1): 188
Xu X, Lu Y, Tan K, Tung A (2008) Finding time-lagged 3D clusters. In: Proceedings of the 24th international conference on data engineering, pp 445–456
Xu X, Lu Y, Tung A, Wang W (2006) Mining shifting-and-scaling co-regulation patterns on gene expression profiles. In: Proceedings of the 22nd international conference on data engineering. IEEE Computer Society, pp 89–98
Yang J, Wang H, Wang W, Yu P (2003) Enhanced biclustering on expression data. In: Proceedings of the 3rd IEEE symposium on bioinformatics and bioengineering. IEEE, pp 321–327
Yilmaz O, Doherty S (2001) Seismic data analysis. Society of Exploration Geophysicists
Yin Y, Zhao Y, Zhang B, Wang G (2007) Mining time-shifting co-regulation patterns from gene expression data. Adv Data Web Manage, pp 62–73
Zakov S (2007) Power coclustering: a model guided approach for automated recognition of trascription reguratory mechanism by gene expression data analysis. PhD thesis, Ben Gurion University, Israel
Zeng T, Liu J (2008) Analysis on time-lagged gene clusters in time series gene expression data. In: Proceedings of the 2007 international conference on computational intelligence and security. IEEE, pp 181–185
Zipf G (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley press, Reading
Zuckerman D (2007) Linear degree extractors and the inapproximability of max clique and chromatic number. Theory Comput 3(1): 103–128
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shaham, E., Sarne, D. & Ben-Moshe, B. Sleeved co-clustering of lagged data. Knowl Inf Syst 31, 251–279 (2012). https://doi.org/10.1007/s10115-011-0420-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0420-6