Abstract
The rapid developments of technologies that generate arrays of gene data enable a global view of the transcription levels of hundreds of thousands of genes simultaneously. The outlier detection problem for gene data has its importance but together with the difficulty of high dimensionality. The sparsity of data in high-dimensional space makes each point a relatively good outlier in the view of traditional distance-based definitions. Thus, finding outliers in high dimensional data is more complex. In this paper, some basic outier analysis algorithms are discussed and a new genetic algorithm is presented. This algorithm is to find best dimension projections based on a revised cell-based algorithm and to give explantations to solutions. It can solve the outlier detection problem for gene expression data and for other high dimensional data as well.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Schena M. Genome analysis with gene expression microarrays.Bioessays, 1996, 18: 427–431.
Schena M, Shalon K, Heller Ret al. Parallel human genome analysis: Microarray-based expression monitoring of 1,000 genes. InProc. Natl. Acad. Sci., USA, 93, pp.10614–10619.
Marshall A, Hodgson J. DNA chips: An array of possibilities.Nat. Biotechnol., 1998, 16: 27–31
Ramsay G. DNA chips: State-of-the art.Nat. Biotechnol., 1998, 16: 40–44.
Fodor S P, Rava R P, Huang X Cet al. Multiplexed biochemical assays with biological chips.Nature, 1993, 364: 555–556.
Lipshutz R J, Fodor S P A, Gingeras T Ret al. High density synthetic oligonucleotide arrays.Nature Genet. Suppl., 2000, 21: 20–24.
Harrington C A, Rosenow C, Retief J. Monitoring gene expression using DNA microarrays.Curr. Opin. Microbiol., 2000, 3(3): 285–291.
Lennon G S, Lehrach H. Hybridization analysis of arrayed cDNA libraries.Trends Genet., 1991, 7: 60–75.
Drmanac S, Drmanac R. Processing of cDNA and genomic kilobase-size clones for massive screening mapping and sequencing by hybridization.Biotechniques, 1994, 17: 328–336.
Drmanac R, Lennon G, Drmanac Set al. Partial sequencing by oligo hybridization: Concept and applications in genome analysis. InProc. the First International Conference of Electrophoresis Supercomputing and the Human Genome, Cantor C, Lim H (Eds.), Singapore: World Scientific, 1991, pp.60–75.
Drmanac S, Stavropoulos N A, Labat Iet al. Generepresenting cDNA clusters defined by hybridization of 57419 clones from infant brain libraries with short oligonucleotide probes.Genomics, 1996, 37: 29–40.
Vicentic A, Gemmell A. Sequencing by hybridization" Towards an automated sequencing of one million M13 clones arrayed on membranes.Electrophoresis, 1992, 13: 566–573.
Meier-Ewert S, Mott R, Lehrach H. Gene identification by oligonucleotide fingerprinting — A pilot study. 1995, MPI, technical report.
Milosavljevic A, Strezoska Z, Zeremski Met al. Clone clustering by hybridization.Genomics, 1995, 27: 83–89.
Jiang T, Xu Y, Zhang M Q. Current Topics in Computational Molecular Biology. Tsinghua University Press and the MIT Press, 2002.
Hartuv E, Schmitt A, Lange Jet al. An algorithm for clustering cDNA fingerprints.Genomics, 2000, 66(3): 249–256.
Sharan R, Shamir R. CLICK: A clustering algorithm with applications to gene expression analysis. InProc. 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), 2000, pp.307–316.
Han J W. Data Mining—Concepts and Techniques. High Education Press, 2001.
Knorr E M, Ng R T. Algorithms for mining distance-based outliers in large datasets. InProc. 1998 Int. Conf. Very Large Data Bases (VLDB'98), 1998, pp.392–403.
Hawkins D. Identification of Outliers. Chapman and Hall, London. 1980.
Ng R T, Han J. Efficient effective clustering methods for spatial data mining. InProc. the 20th. International Conference on Very Large Data Bases. Bocca J B, Jarke M, Zaniolo C (Eds), Sautiago: Morgan Kaufmann, 1994, pp. 144–155.
Ester M, Kriefel H P, Sander Jet al. A density-based algorithm for discovering clusters in large spatial databases with noise. InProc. the 2nd International Conference on Knowledge Discovery and Data Mining, Simoudis E, Han J, Fayyad U M (Eds.), Portland, Oregon: AAAI Press, 1996, pp.226–231.
Zhang T, Ramakrishnan R, Linvy M. BIRCH: An efficient data clustering method for very large databases. InProc. the ACM SIGMOD International Conference on Management of Data, Jagadish H V, Mumick I S (Eds.), Montreal: ACM Press, 1996, pp.103–114.
Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. InProc. the 23rd International Conference on Very Large Data Bases, Jarke M, Carey M J, Dittrich K Ret al. (Eds.), Athens, Greece: Morgan Kaufmann, 1997, pp. 186–195.
Sheikholeslami G, Chatterjee S, Zhang A. WaveCluster: A multi-resolution clustering approach for very large spatial databases. InProc. the 24th International Conference on Very Large Data Bases, Gupta A, Shmueli O, Widom J (Eds.), New York: Morgan Kaufmann, 1998, pp.428–439.
Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. InProc. the 4th International Conference on Knowledge Discovery and Data Mining. Agrawal R, Stolorz P E, Piatetsky-Shapiro G (Eds.), New York: AAAI Press, 1998, pp.58–65.
Agrawal R, Gehrke J, Gunopulos Det al. Automatic subspace clustering of high dimensional data for data mining applications. InProc. the ACM SIGMOD International Conference on Mangement of Data, Haas L M, Tiwary A (Eds.), Seattle: ACM Press, 1998, pp.94–105.
Barnett V, Lewis T. Outliers in Statistical Data. New York: John Wiley & Sons, 1994.
Breunig M M, Kriegel H P, Ng R Tet al. OPTICSOF: Identifying density-based local outliers. InProc. the 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases, Zytkow J M, Rauch J (Eds.) Lecture Notes in Computer Science 1704, Prague: Springer, 1999, pp.262–270.
Breunig M M, Kriegel H P, Ng R Tet al. LOF: Identifying density-based local outliers. InProc. the ACM SIGMOD International Conference on Management of Data, Chen W, Naughton J F, Bernstein P A (Eds.), Dallas, Texas: ACM Press, 2000 pp.93–104.
Arning A, Agrawal R, Raghavan P. A linear method for deviation detection in large databases. InProc. 1996 Int. Conf. Data Mining and Knowledge (Special Issue on High Performance Data Mining), 2000.
Sarawagi S, Agrawal R, Megiddo N. Discovery-driven exploration of OLAP data cubes. InProc. Int. Conf. Extending Database Technology (EDBT'98) Valencia, Spain, 1998, pp.168–182.
Jagadish H V, Koudas N, Muthukrishnan S. Mining deviants in a time series databases. InProc. 1999 Int. Conf. Very Large Data Bases (VLDB'99), Edinburgh, UK, 1999, pp.102–113.
Knorr E, Ng R. A unified notion of outliers: Properties and computation. InProc. 1997 Int. Conf. Knowledge Discovery and Data Mining (KDD'97) Newport Beach. 1997, pp.219–222.
Aggarwal C C, Yu P. Outlier detection for high dimensional data. InProc. the ACM SIGMOD International Conference on Management of Data, Aref W G (Ed.), Santa Barbara, CA: ACM Press, 2001, pp.37–47.
Hinneburg A, Aggarwal C C, Keim D A. What is the nearest neighbor in high dimensional spaces? InProc. the 26th International Conference on Very Large Data Bases, Abbadi A E, Brodie M L, Chakravarthy Set al. (Eds.), Cairo: Morgan Kaufmann, 2000, pp.506–515.
Aggarwal C C, Yu P. Finding generalized projected clusters in high dimensional spaces. InProc. the ACM SIGMOD International Conference on Management of Data, Chen W, Naughton J f, Bernstein P A (Eds.), Dallas, Texas: ACM Press, 2000, pp.70–81.
Holland J H. Adaptation in natural and artificial systems. University of Michigan Press. Ann Arbor MI 1975.
Chen G L. Genetic algorithm and its applications. National Defense Press, 1996. (in Chinese)
De Jong K A. Analysis of the behavior of a class of genetic adaptive systems [Dissertation]. University of Michigan, Ann Arbor, MI, 1975.
Aggarwal C C, Orlin J B, Tai R P. Optimized crossover for the independent set problem.Operations Research, 1997, 45(2).
http://arep.med.harvard.edu/biclustering/.
Alizadeh A A, Eisen M Bet al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.Nature, 2000, 403: 503–511.
http://llmpp.nih.gov/lymphoma/
Author information
Authors and Affiliations
Additional information
This work was supported by the National High-Tech Development 863 Program of China under the Grant No.2001AA111041 2002AA104560, and High-Standard Universities Construction Project in CAS.
Chao Yan is a Ph.D. candidate of Dept. of Computer Sci. & Tech. University of Science and Technology of China. His major interests include cluster, bioinformatics and algorithm.
Guo-Liang Chen is a professor, academician of the Chinese Academy of Sciences. He works with the Dept. of Computer Sci. & Tech., University of Science and Technology of China. His major Research areas include parallel theory and algorithm.
Yi-Fei Shen is a Ph.D. candidate of Dept. of Computer Sci. & Tech., University of Science and Technology of China. His major interests include bioinformatics and algorithm.
Rights and permissions
About this article
Cite this article
Yan, C., Chen, GL. & Shen, YF. Outlier analysis for gene expression data. J. Comput. Sci. & Technol. 19, 13–21 (2004). https://doi.org/10.1007/BF02944782
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02944782