Outlier analysis for gene expression data

Yan, Chao; Chen, Guo-Liang; Shen, Yi-Fei

doi:10.1007/BF02944782

Outlier analysis for gene expression data

Published: January 2004

Volume 19, pages 13–21, (2004)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Chao Yan¹,
Guo-Liang Chen¹ &
Yi-Fei Shen¹

376 Accesses
Explore all metrics

Abstract

The rapid developments of technologies that generate arrays of gene data enable a global view of the transcription levels of hundreds of thousands of genes simultaneously. The outlier detection problem for gene data has its importance but together with the difficulty of high dimensionality. The sparsity of data in high-dimensional space makes each point a relatively good outlier in the view of traditional distance-based definitions. Thus, finding outliers in high dimensional data is more complex. In this paper, some basic outier analysis algorithms are discussed and a new genetic algorithm is presented. This algorithm is to find best dimension projections based on a revised cell-based algorithm and to give explantations to solutions. It can solve the outlier detection problem for gene expression data and for other high dimensional data as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identification of differentially expressed genes by means of outlier detection

Article Open access 10 September 2018

Identification of Outliers in Gene Expression Data

Automated multigroup outlier identification in molecular high-throughput data using bagplots and gemplots

Article Open access 02 May 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Schena M. Genome analysis with gene expression microarrays.Bioessays, 1996, 18: 427–431.
Article Google Scholar
Schena M, Shalon K, Heller Ret al. Parallel human genome analysis: Microarray-based expression monitoring of 1,000 genes. InProc. Natl. Acad. Sci., USA, 93, pp.10614–10619.
Marshall A, Hodgson J. DNA chips: An array of possibilities.Nat. Biotechnol., 1998, 16: 27–31
Article Google Scholar
Ramsay G. DNA chips: State-of-the art.Nat. Biotechnol., 1998, 16: 40–44.
Article Google Scholar
Fodor S P, Rava R P, Huang X Cet al. Multiplexed biochemical assays with biological chips.Nature, 1993, 364: 555–556.
Article Google Scholar
Lipshutz R J, Fodor S P A, Gingeras T Ret al. High density synthetic oligonucleotide arrays.Nature Genet. Suppl., 2000, 21: 20–24.
Article Google Scholar
Harrington C A, Rosenow C, Retief J. Monitoring gene expression using DNA microarrays.Curr. Opin. Microbiol., 2000, 3(3): 285–291.
Article Google Scholar
Lennon G S, Lehrach H. Hybridization analysis of arrayed cDNA libraries.Trends Genet., 1991, 7: 60–75.
Google Scholar
Drmanac S, Drmanac R. Processing of cDNA and genomic kilobase-size clones for massive screening mapping and sequencing by hybridization.Biotechniques, 1994, 17: 328–336.
Google Scholar
Drmanac R, Lennon G, Drmanac Set al. Partial sequencing by oligo hybridization: Concept and applications in genome analysis. InProc. the First International Conference of Electrophoresis Supercomputing and the Human Genome, Cantor C, Lim H (Eds.), Singapore: World Scientific, 1991, pp.60–75.
Google Scholar
Drmanac S, Stavropoulos N A, Labat Iet al. Generepresenting cDNA clusters defined by hybridization of 57419 clones from infant brain libraries with short oligonucleotide probes.Genomics, 1996, 37: 29–40.
Article Google Scholar
Vicentic A, Gemmell A. Sequencing by hybridization" Towards an automated sequencing of one million M13 clones arrayed on membranes.Electrophoresis, 1992, 13: 566–573.
Article Google Scholar
Meier-Ewert S, Mott R, Lehrach H. Gene identification by oligonucleotide fingerprinting — A pilot study. 1995, MPI, technical report.
Milosavljevic A, Strezoska Z, Zeremski Met al. Clone clustering by hybridization.Genomics, 1995, 27: 83–89.
Article Google Scholar
Jiang T, Xu Y, Zhang M Q. Current Topics in Computational Molecular Biology. Tsinghua University Press and the MIT Press, 2002.
Hartuv E, Schmitt A, Lange Jet al. An algorithm for clustering cDNA fingerprints.Genomics, 2000, 66(3): 249–256.
Article Google Scholar
Sharan R, Shamir R. CLICK: A clustering algorithm with applications to gene expression analysis. InProc. 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), 2000, pp.307–316.
Han J W. Data Mining—Concepts and Techniques. High Education Press, 2001.
Knorr E M, Ng R T. Algorithms for mining distance-based outliers in large datasets. InProc. 1998 Int. Conf. Very Large Data Bases (VLDB'98), 1998, pp.392–403.
Hawkins D. Identification of Outliers. Chapman and Hall, London. 1980.
MATH Google Scholar
Ng R T, Han J. Efficient effective clustering methods for spatial data mining. InProc. the 20th. International Conference on Very Large Data Bases. Bocca J B, Jarke M, Zaniolo C (Eds), Sautiago: Morgan Kaufmann, 1994, pp. 144–155.
Google Scholar
Ester M, Kriefel H P, Sander Jet al. A density-based algorithm for discovering clusters in large spatial databases with noise. InProc. the 2nd International Conference on Knowledge Discovery and Data Mining, Simoudis E, Han J, Fayyad U M (Eds.), Portland, Oregon: AAAI Press, 1996, pp.226–231.
Google Scholar
Zhang T, Ramakrishnan R, Linvy M. BIRCH: An efficient data clustering method for very large databases. InProc. the ACM SIGMOD International Conference on Management of Data, Jagadish H V, Mumick I S (Eds.), Montreal: ACM Press, 1996, pp.103–114.
Google Scholar
Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. InProc. the 23rd International Conference on Very Large Data Bases, Jarke M, Carey M J, Dittrich K Ret al. (Eds.), Athens, Greece: Morgan Kaufmann, 1997, pp. 186–195.
Google Scholar
Sheikholeslami G, Chatterjee S, Zhang A. WaveCluster: A multi-resolution clustering approach for very large spatial databases. InProc. the 24th International Conference on Very Large Data Bases, Gupta A, Shmueli O, Widom J (Eds.), New York: Morgan Kaufmann, 1998, pp.428–439.
Google Scholar
Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. InProc. the 4th International Conference on Knowledge Discovery and Data Mining. Agrawal R, Stolorz P E, Piatetsky-Shapiro G (Eds.), New York: AAAI Press, 1998, pp.58–65.
Google Scholar
Agrawal R, Gehrke J, Gunopulos Det al. Automatic subspace clustering of high dimensional data for data mining applications. InProc. the ACM SIGMOD International Conference on Mangement of Data, Haas L M, Tiwary A (Eds.), Seattle: ACM Press, 1998, pp.94–105.
Google Scholar
Barnett V, Lewis T. Outliers in Statistical Data. New York: John Wiley & Sons, 1994.
Google Scholar
Breunig M M, Kriegel H P, Ng R Tet al. OPTICSOF: Identifying density-based local outliers. InProc. the 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases, Zytkow J M, Rauch J (Eds.) Lecture Notes in Computer Science 1704, Prague: Springer, 1999, pp.262–270.
Google Scholar
Breunig M M, Kriegel H P, Ng R Tet al. LOF: Identifying density-based local outliers. InProc. the ACM SIGMOD International Conference on Management of Data, Chen W, Naughton J F, Bernstein P A (Eds.), Dallas, Texas: ACM Press, 2000 pp.93–104.
Chapter Google Scholar
Arning A, Agrawal R, Raghavan P. A linear method for deviation detection in large databases. InProc. 1996 Int. Conf. Data Mining and Knowledge (Special Issue on High Performance Data Mining), 2000.
Sarawagi S, Agrawal R, Megiddo N. Discovery-driven exploration of OLAP data cubes. InProc. Int. Conf. Extending Database Technology (EDBT'98) Valencia, Spain, 1998, pp.168–182.
Jagadish H V, Koudas N, Muthukrishnan S. Mining deviants in a time series databases. InProc. 1999 Int. Conf. Very Large Data Bases (VLDB'99), Edinburgh, UK, 1999, pp.102–113.
Knorr E, Ng R. A unified notion of outliers: Properties and computation. InProc. 1997 Int. Conf. Knowledge Discovery and Data Mining (KDD'97) Newport Beach. 1997, pp.219–222.
Aggarwal C C, Yu P. Outlier detection for high dimensional data. InProc. the ACM SIGMOD International Conference on Management of Data, Aref W G (Ed.), Santa Barbara, CA: ACM Press, 2001, pp.37–47.
Chapter Google Scholar
Hinneburg A, Aggarwal C C, Keim D A. What is the nearest neighbor in high dimensional spaces? InProc. the 26th International Conference on Very Large Data Bases, Abbadi A E, Brodie M L, Chakravarthy Set al. (Eds.), Cairo: Morgan Kaufmann, 2000, pp.506–515.
Google Scholar
Aggarwal C C, Yu P. Finding generalized projected clusters in high dimensional spaces. InProc. the ACM SIGMOD International Conference on Management of Data, Chen W, Naughton J f, Bernstein P A (Eds.), Dallas, Texas: ACM Press, 2000, pp.70–81.
Chapter Google Scholar
Holland J H. Adaptation in natural and artificial systems. University of Michigan Press. Ann Arbor MI 1975.
Google Scholar
Chen G L. Genetic algorithm and its applications. National Defense Press, 1996. (in Chinese)
De Jong K A. Analysis of the behavior of a class of genetic adaptive systems [Dissertation]. University of Michigan, Ann Arbor, MI, 1975.
Google Scholar
Aggarwal C C, Orlin J B, Tai R P. Optimized crossover for the independent set problem.Operations Research, 1997, 45(2).
http://arep.med.harvard.edu/biclustering/.
Alizadeh A A, Eisen M Bet al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.Nature, 2000, 403: 503–511.
Article Google Scholar
http://llmpp.nih.gov/lymphoma/

Download references

Author information

Authors and Affiliations

National High Performance Computational Center, University of Science and Technology of China, 230027, Hefei, P.R. China
Chao Yan, Guo-Liang Chen & Yi-Fei Shen

Authors

Chao Yan
View author publications
You can also search for this author in PubMed Google Scholar
Guo-Liang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Fei Shen
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

This work was supported by the National High-Tech Development 863 Program of China under the Grant No.2001AA111041 2002AA104560, and High-Standard Universities Construction Project in CAS.

Chao Yan is a Ph.D. candidate of Dept. of Computer Sci. & Tech. University of Science and Technology of China. His major interests include cluster, bioinformatics and algorithm.

Guo-Liang Chen is a professor, academician of the Chinese Academy of Sciences. He works with the Dept. of Computer Sci. & Tech., University of Science and Technology of China. His major Research areas include parallel theory and algorithm.

Yi-Fei Shen is a Ph.D. candidate of Dept. of Computer Sci. & Tech., University of Science and Technology of China. His major interests include bioinformatics and algorithm.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yan, C., Chen, GL. & Shen, YF. Outlier analysis for gene expression data. J. Comput. Sci. & Technol. 19, 13–21 (2004). https://doi.org/10.1007/BF02944782

Download citation

Received: 26 May 2003
Revised: 13 August 2003
Issue Date: January 2004
DOI: https://doi.org/10.1007/BF02944782

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Outlier analysis for gene expression data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Identification of differentially expressed genes by means of outlier detection

Identification of Outliers in Gene Expression Data

Automated multigroup outlier identification in molecular high-throughput data using bagplots and gemplots

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Outlier analysis for gene expression data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Identification of differentially expressed genes by means of outlier detection

Identification of Outliers in Gene Expression Data

Automated multigroup outlier identification in molecular high-throughput data using bagplots and gemplots

Explore related subjects

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation