Microarray gene expression data association rules mining based on BSC-tree and FIS-tree

https://doi.org/10.1016/j.datak.2004.06.011Get rights and content

Abstract

In this paper we propose to use association rules to mine the association relationships among different genes under the same experimental conditions. These kinds of relations may also exist across many different experiments with various experimental conditions. In this paper, a new approach, called FIS-tree mining, is proposed for mining the microarray data. Our approach uses two new data structures, BSC-tree and FIS-tree, and a data partition format for gene expression level data. Based on these two new data structures it is possible to mine the association rules efficiently and quickly from the gene expression database. Our algorithm was tested using the two real-life gene expression databases available at Stanford University and Harvard Medical School and was shown to perform better than the two existing algorithms, Apriori and FP-Growth.

Introduction

DNA (deoxyribo nucleic acid) microarrays [25], [27] enable scientists to study an entire genome’s expression under a variety of conditions. The advent of DNA microarrays has facilitated a fundamental shift from gene-centric science to genome-centric science [5], [6]. With several eukaryotic genomes completed and the draft human genome published [30], we are now entering the post genomic age. The main focus in genomic research is switching from sequencing to using the genome sequences in order to understand how genomes are functioning. Some questions we would like to ask are the following:

  • What are the functional roles of different genes?

  • In what cellular processes do genes participate?

  • How are genes regulated?

  • How do genes and gene products interact and what are these interaction networks?

  • How does gene expression level differ in various cell types and states?

  • How is gene expression changed by various diseases or compound treatments?


With a tremendous increase of gene expression data collected by microarray technology, it is possible to answer these questions. However, one question raised is how we can analyze these data quickly and efficiently because traditional methods that biologists have employed to process and interpret their biological data are not suitable to deal with the huge amount of DNA microarray data. It is like what Brown wrote in [7]: “Perhaps the greatest challenge now is to develop efficient methods for organizing, distributing, interpreting, and extracting insights from the large volumes of data these experiments will provide”.

With the development of data mining methods and software, it is possible to analyze the DNA microarray data. Data mining is an information extraction activity, the goal of which is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. The knowledge found from DNA microarray data by data mining may answer the questions mentioned above.

So far many data mining methods have been used to mine gene expression data, such as clustering [4], [8] and classification [18], [9]. However these methods mainly focus on the gene expression profiles which are the sets of the expression values for a single gene across many experimental conditions. An example is the clustering of gene expression data that groups unknown genes with known genes in the same cluster and provides clues to their functions [16], [3], [10], [19], [17], [26]. This is based on the hypothesis that genes that have similar gene expression trends at various conditions may have similar functions.

In this paper, we propose using association rules to mine association relationships, such as one gene being the regulator of another gene, among different genes under the same experimental condition. For example, in the database in [29], we may find that gene TUP1 is the regulator of gene YDR533C. These kinds of relations may also exist across many different experiments with various experimental conditions. This type of information is very important to answer the questions posed previously. It will help us understand gene regulation, metabolic and signaling pathways, and gene regulatory networks.

When applying an association rule mining algorithm on the microarray gene expression data, the following characteristics must be taken into consideration:

  • The large search space. A microarray gene expression database consists of the data obtained from many microarray slides under various experimental conditions. Each microarray slide can be considered as one database transaction containing the values of genes in one experimental condition, and each gene can be considered as one data item. For human beings there are 50,000–100,000 genes. There would be a tremendous number of candidate itemsets that must be identified by an association rule mining algorithm. For such an algorithm to work effectively, it must be able to deal robustly with the dimensionality of this feature space.

  • Uninteresting genes. Not all genes are interesting to biologists. Sometimes biologists may be interested in some special genes. So they may just want to mine the association rules among these interesting genes and do not want to waste time to mine all other genes’ possible association rules.

  • Data normalization. Due to technical limitations, the constant of proportionality between the actual number of mRNA [20] samples per cell and the relative amount measured by a microarray experiment is unknown, and varies across microarray experiments. This variance introduces noise into experiments and requires that we normalize microarray data by an appropriate factor.


The existing data association rule mining works [1], [28], [11], [21], [22] do not use the datasets similar to the microarray gene expression data and do not consider all the above characteristics of the microarray gene expression data even though they perform well when analyzing other data. The objective of this research is to develop an efficient association rule mining algorithm to analyze the microarray gene expression data by taking all their characteristics into consideration. This paper presents the proposed algorithm called FIS-tree mining and a performance evaluation comparing FIS-tree mining with the existing algorithms using two real life microarray gene expression databases from Stanford University [29] and Harvard Medical School [12]. The rest of the paper is organized as follows. Section 2 provides some background information on association rule mining and reviews the three existing association rule mining algorithms, Apriori [1], [28], FP-Growth [11] and P-tree [21], [22]. Section 3 describes the proposed algorithm, FIS-tree mining. Section 4 presents the performance evaluation. Finally Section 5 concludes the paper and discusses future research.

Section snippets

Related work

Association rule mining (ARM) is a widely used technique for large-scale data mining. Originally, proposed for market basket data to study consumer-purchasing patterns in retail stores, it has potential applications in many areas. Microarray data is one of the promising application areas. Very complex and highly interlinked data such as a spot in a microarray slide not only provides the information about its intensity of expression but also its interaction with other genes. Extracting

The proposed FIS-tree mining algorithm

In this section, a new algorithm, FIS-tree (Frequent ItemSet-tree) mining, is proposed for mining the microarray data. It considers the characteristics of microarray gene expression data discussed in Section 1. It attempts to incorporate all advantages of the three association rule mining algorithms reviewed in Section 2 and, at the same time, remove their disadvantages. It uses a data format for gene expression data where each value can be represented by a sign bit, fraction bits and exponent

Performance evaluation

In this section we introduce two real-life microarray gene expression datasets which we use to measure the execution time of our proposed algorithm and the two other existing association rule mining algorithms, Apriori and FP-growth. We then present our comparison results. We did not implement the P-tree association rule mining algorithm because its detailed algorithm is not available to public (the algorithm has been patented by its authors).

Conclusions and future research

In this paper, we proposed a new association rule mining algorithm called the FIS-tree mining algorithm that makes use of a bit string data partition format and two new data structures, the BSC-tree and FIS-tree. The FIS-tree mining algorithm takes microarray gene expression data’s characteristics into consideration. A BSC-tree is a compression tree. It can be built on the fly for each gene to compute frequent 1-itemsets. Frequent 2 to n-itemsets are computed by performing the logical AND

Xiang-Rong Jiang received his Ph.D. degree in Synthetic Organic Chemistry from Shanghai Institute of Organic Chemistry, Shanghai, China, in 1995. He is a Senior Research Associate at College of Pharmacy, University of South Carolina. His current research interests include design, synthesis of the selective Estrogen Receptor modulators.

References (30)

  • A. Brazma et al.

    Gene expression data analysis

    FEBS Letters

    (2000)
  • R. Agrawal et al.

    Fast algorithms for mining association rules

    20th VLDB

    (1994)
  • R. Agrawal, T. Imielinski, A.N. Swami, Mining association rules between sets of items in large databases, in: Proc. ACM...
  • K. Alsabti, S. Ranka, V. Singh, An efficient K-means clustering algorithm, in: Proc. IPPS/SPDP Workshop on High...
  • M. Chee

    Accessing genetic information with high-density DNA

    Science

    (1996)
  • J. DeRisi

    Use of a cDNA microarray to analyze gene expression patterns in human cancer

    Nature Genet.

    (1996)
  • J.L. DeRisi et al.

    Exploring the metabolic and genetic control of gene expression on a genomic scale

    Science

    (1997)
  • M. Eisen, P.T. Spellman, D. Botstein, P.O. Brown, Cluster analysis and display of genome-wide expression patterns, in:...
  • J. Fridlyand, S. Dudoit, Comparison of supervised learning methods for the classification of tumors using gene...
  • B. Fritzke

    A growing neural gas network learns topologies

  • J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in: Proc. ACM-SIGMOD Int. Conf....
  • Harvard Medical School, Available from http://arep.med.harvard.edu/ExpressDB/ November...
  • J. Hipp, C. Guntzen, et al., Mining association rules: deriving a superior algorithm by analyzing today’s approaches,...
  • H.D. Huang, H.L. Chang, T.S. Tsou, B.J. Liu, C.Y. Kao, J.T. Horng, A data Mining Method to Predict Transcriptional...
  • X.R. Jiang, Y.Z. Wu, The project of a Method of the Real Time Data Compression for Bit...
  • Cited by (0)

    Xiang-Rong Jiang received his Ph.D. degree in Synthetic Organic Chemistry from Shanghai Institute of Organic Chemistry, Shanghai, China, in 1995. He is a Senior Research Associate at College of Pharmacy, University of South Carolina. His current research interests include design, synthesis of the selective Estrogen Receptor modulators.

    Le Gruenwald is a Professor in the School of Computer Science at University of Oklahoma. She received her Ph.D. in Computer Science from Southern Methodist University in 1990. She was a Software Engineer at White River Technologies, a Lecturer in the Computer Science and Engineering Department at Southern Methodist University, and a Member of Technical Staff in the Database Management Group at the Advanced Switching Laboratory of NEC, America. Her major research interests include Web-enabled Databases, Mobile Databases, Real-Time Main Memory Databases, Multimedia Databases, Data Warehouse and Data Mining. She is a member of ACM, SIGMOD, and IEEE Computer Society.

    View full text