Skip to main content
Log in

Mining gene–sample–time microarray data: a coherent gene cluster discovery approach

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Extensive studies have shown that mining microarray data sets is important in bioinformatics research and biomedical applications. In this paper, we explore a novel type of gene–sample–time microarray data sets that records the expression levels of various genes under a set of samples during a series of time points. In particular, we propose the mining of coherent gene clusters from such data sets. Each cluster contains a subset of genes and a subset of samples such that the genes are coherent on the samples along the time series. The coherent gene clusters may identify the samples corresponding to some phenotypes (e.g., diseases), and suggest the candidate genes correlated to the phenotypes. We present two efficient algorithms, namely the Sample-Gene Search and the GeneSample Search, to mine the complete set of coherent gene clusters. We empirically evaluate the performance of our approaches on both a real microarray data set and synthetic data sets. The test results have shown that our approaches are both efficient and effective to find meaningful coherent gene clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alizadeh AA, Eisen MB, Davis RE et al (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403:503–511

    Article  Google Scholar 

  2. Alon U, Barkai N, Notterman DA et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array. Proc Natl Acad Sci USA 96(12):6745–6750

    Article  Google Scholar 

  3. Alter O, Brown PO, Bostein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97(18):10101–10106

    Article  Google Scholar 

  4. Bayardo RJ (1998) Efficiently mining long patterns from databases. In: Proceeding of the 1998 ACM-SIGMOD international conference management of data (SIGMOD'98), Seattle, WA, pp 85–93

  5. Ben-Dor A, Friedman N, Yakhini Z (2001) Class discovery in gene expression data. In: Proceeding of the fifth annual international conference on computational molecular biology (RECOMB 2001) ACM Press, pp 31–38

  6. Blake JA, Harris M (2003) The gene ontology project: structured vocabularies for molecular biology and their application to genome and expression analysis. In: Current protocols in bioinformatics Wiley, New York

    Google Scholar 

  7. Cheng Y, Church GM (2000) Biclustering of expression data. Proc ISMB'00 8:93–103

    Google Scholar 

  8. Der SD, Zhou A, Williams BR, Silverman RH (1998) Identification of genes differentially regulated by interferon alpha, beta, or gamma using oligonucleotide arrays. Proc Natl Acad Sci USA 95(26):15623–15628

    Article  Google Scholar 

  9. Ding C (2002) Analysis of gene expression profiles: class discovery and leaf ordering. In: Proceeding of the international conference on computational molecular biology (RECOMB). Washington, DC, pp 127–136

  10. Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 77–87

  11. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95(25):14863–14868

    Article  Google Scholar 

  12. Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588

    Article  MATH  Google Scholar 

  13. Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(15):531–537

    Article  Google Scholar 

  14. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceeding of 2000 ACM-SIGMOD international conference management of data (SIGMOD'00), Dallas, TX, pp 1–12

  15. Hartuv E, Shamir R (2000) A clustering algorithm based on graph connectivity. Inf Process Lett 76(4–6):175–181

    Article  MATH  MathSciNet  Google Scholar 

  16. Herrero J, Valencia A, Dopazo J (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17:126–136

    Article  Google Scholar 

  17. Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9(11):1106–1115

    Article  Google Scholar 

  18. Holter NS, Mitra M, Maritan A, Cieplak M, Banavar JR, Fedoroff NV (2000) Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc Natl Acad Sci USA 97(15):8409–8414

    Article  Google Scholar 

  19. Jiang D, Pei J, Zhang A (2003) Interactive exploration of coherent patterns in time-series gene expression data. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD'03), Washington, DC, USA

  20. Jiang D, Pei J, Zhang A (2005) A general approach to mining quality pattern-based clusters from gene expression data. In: Proceedings of the 10th international conference on database systems for advanced applications (DASFAA'05), Beijing, China

  21. Jiang D, Pei J, Ramanathan M, Tang C, Zhang A (2004) Mining coherent gene clusters from gene-sample-time microarray data. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD'04) ACM Press, pp 430–439

  22. Kerr K, Churchill G (2001) Statistical design and the analysis of gene expression microarrays. Genet Res 77:123–128

    Article  Google Scholar 

  23. Liu J, Wang W (2003) Op-cluster: clustering by tendency in high dimensional space. In: Proceedings of the third IEEE international conference on data mining (ICDM'03), IEEE, Melbourne, Florida

    Google Scholar 

  24. Moler EJ, Chow ML, Mian IS (2000) Analysis of molecular profile data using generative and discriminative methods. Physiol Genomics 4(2):109–126

    Google Scholar 

  25. Pei J, Han J, Mao R (2000) CLOSET: an efficient algorithm for mining frequent closed itemsets. In: Proceeding of 2000 ACM-SIGMOD international workshop data mining and knowledge discovery (DMKD'00), Dallas, TX, pp 11–20

  26. Pei J, Zhang X, Cho M, Wang H, Yu PS (2003) MaPle: a fast algorithm for maximal pattern-based clusterin. In: Proceedings of the third IEEE international conference on data mining (ICDM'03)

  27. Ralf-Herwig PA, Muller C, Bull C, Lehrach H, O'Brien J (1999) Large-scale clustering of cDNA-fingerprinting data. Genome Res 9:1093–1105

    Article  Google Scholar 

  28. Rymon R (1992) Search through systematic set enumeration. In: Proceeding of 1992 international conference principle of knowledge representation and reasoning (KR'92), Cambridge, MA, pp 539–550

  29. Seo J, Shneiderman B (2002) Interactively exploring hierarchical clustering results. IEEE Comput 35(7):80–86

    Google Scholar 

  30. Shamir R, Sharan R (2000) Click: a clustering algorithm for gene expression analysis. In: Proceedings of ISMB '00

  31. Smet FD, Mathys J, Marchal K et al (2002) Adaptive quality-based clustering of gene expression profiles. Bioinformatics 18:735–746

    Article  Google Scholar 

  32. Stark GR, Kerr IM, Williams BR, Silverman RH, Schreiber RD (1998) How cells respond to interferons. Ann Rev Biochem 67:227–264

    Article  Google Scholar 

  33. Tamayo P, Solni D, Mesirov J et al (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96(6):2907–2912

    Article  Google Scholar 

  34. Tang C, Zhang A, Pei J (2003) Mining phenotypes and informative genes from gene expression data. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD'03), Washington, DC, USA

  35. Tavazoie S, Hughes D, Campbell MJ et al (1999) Systematic determination of genetic network architecture. Nature Genet 281–285

  36. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525

    Article  Google Scholar 

  37. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98(9):5116–5121

    Article  MATH  Google Scholar 

  38. Wang W, Yang J, Wang H, Yu PS (2002) Clustering by pattern similarity in large data sets. In: Proceeding of 2002 ACM-SIGMOD international conference on management of data (SIGMOD'02), Madison, WI

  39. Weinstock-Guttman B, Badgett D, Patrick K, Hartrich L, Hall D, Baier M, Feichter J, Ramanathan M (2003) Genomic effects of interferon-b in multiple sclerosis patients. J Immun 171(5):2694–2702

    Google Scholar 

  40. Xing EP, Karp RM (2001) Cliff: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics 17(1):306–315

    Google Scholar 

  41. Yang J, Wang W, Wang H, Yu PS (2002) δ-cluster: capturing subspace correlation in a large data set. In: Proceedings of 18th international conference on data engineering (ICDE 2002), pp 517–528

  42. Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17:977–987

    Article  Google Scholar 

  43. Zaki MJ, Hsiao CJ (2002) CHARM: an efficient algorithm for closed itemset mining. In: Proceeding of 2002 SIAM international conference on data mining, Arlington, VA, pp 457–473

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daxin Jiang.

Additional information

Daxin Jiang received the Ph.D. degree in computer science and engineering from the State University of New York at Buffalo in 2005. He received the B.S. degree in computer science from the University of Science and Technology of China. From 1998 to 2000, he was a M.S. student in Software Institute, Chinese Academy of Sciences. He is currently an assistant professor at the School of Computer Engineering, Nanyang Technology University, Singapore. His research interests include data mining, bioinformatics, machine learning, and information retrieval.

Jian Pei received the Ph.D. degree in computing science from Simon Fraser University, Canada, in 2002, under Dr. Jiawei Han's supervision. He also received the B.Eng. and the M.Eng. degrees from Shanghai Jiao Tong University, China, in 1991 and 1993, respectively, both in Computer Science. He is currently an assistant professor of computing science at Simon Fraser University. His research interests include developing effective and efficient data analysis techniques for novel data intensive applications. He is currently interested in various techniques of data mining, data warehousing, online analytical processing, and database systems, as well as their applications in bioinformatics. His current research is supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the National Science Foundation (NSF) of the United States. Since 2000, he has published over 70 research papers in refereed journals, conferences, and workshops, has served in the organization committees and the program committees of over 60 international conferences and workshops, and has been a reviewer for some leading academic journals. He is a member of the ACM, the ACM SIGMOD, and the ACM SIGKDD.

Murali Ramanathan is an associate professor of pharmaceutical sciences and neurology. He received the B.Tech. (Honors) in chemical engineering from the Indian Institute of Technology, India, in 1983. After a 4-year stint in the chemical industry, he obtained the M.S. degree in chemical engineering from Iowa State University, Ames, IA, in 1987, and the Ph.D. degree in bioengineering from the University of California-San Francisco and University of California-Berkeley Joint Program in Bioengineering in 1994. Dr. Ramanathan research interests are primarily focused on the treatment of multiple sclerosis (MS), an inflammatory-demyelinating disease of the central nervous system that affects over 1 million patients worldwide. MS is a complex, variable disease that causes physical and cognitive disability and nearly 50% of patients diagnosed with MS are unable to walk after 15 years. The etiology and pathogenesis of MS remains poorly understood. Dr. Ramanathan's research interests include stochastic modeling of pharmaceutical systems and novel approaches to analyzing and using genetic and genomic data for improving patient care and optimizing therapy.

Chuan Lin is currently a Ph.D. student in the Department of Computer Science and Engineering, State University of New York at Buffalo. She received the B.E. and the M.S. degrees in computer science and technology from Tsinghua University in China. Her research interests include bioinformatics, data mining, and machine learning.

Chun Tang received the B.S. and M.S. degrees from Peking University, China, in 1996 and 1999, respectively, and the Ph.D. degree from State University of New York at Buffalo, USA, in 2005, all in computer science. Currently, she is a postdoctoral associate of Center for Medical Informatics, Yale University. Her research interests include bioinformatics, data mining, machine learning, database, and information retrieval.

Aidong Zhang received the Ph.D. degree in computer science from Purdue University, West Lafayette, Indiana, in 1994. She was an assistant professor from 1994 to 1999, an associate professor from 1999 to 2002, and has been a professor since 2002 in the Department of Computer Science and Engineering at State University of New York at Buffalo. Her research interests include multimedia systems, content-based image retrieval, bioinformatics, and data mining. She is an author of over 140 research publications in these areas. Dr. Zhang's research has been funded by NSF, NIH, NIMA, and Xerox. Zhang serves on the editorial boards of International Journal of Bioinformatics Research and Applications (IJBRA), ACM Multimedia Systems, International Journal of Multimedia Tools and Applications, and International Journal of Distributed and Parallel Databases. She was the editor for ACM SIGMOD DiSC (Digital Symposium Collection) from 2001 to 2003. She was co-chair of the technical program committee for ACM Multimedia in 2001. She has also served on various conference program committees. Dr. Zhang is a recipient of the National Science Foundation CAREER award and SUNY Chancellor's Research Recognition award.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, D., Pei, J., Ramanathan, M. et al. Mining gene–sample–time microarray data: a coherent gene cluster discovery approach. Knowl Inf Syst 13, 305–335 (2007). https://doi.org/10.1007/s10115-006-0031-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-006-0031-9

Keywords

Navigation