Skip to main content
Log in

Outlier analysis for gene expression data

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

The rapid developments of technologies that generate arrays of gene data enable a global view of the transcription levels of hundreds of thousands of genes simultaneously. The outlier detection problem for gene data has its importance but together with the difficulty of high dimensionality. The sparsity of data in high-dimensional space makes each point a relatively good outlier in the view of traditional distance-based definitions. Thus, finding outliers in high dimensional data is more complex. In this paper, some basic outier analysis algorithms are discussed and a new genetic algorithm is presented. This algorithm is to find best dimension projections based on a revised cell-based algorithm and to give explantations to solutions. It can solve the outlier detection problem for gene expression data and for other high dimensional data as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Schena M. Genome analysis with gene expression microarrays.Bioessays, 1996, 18: 427–431.

    Article  Google Scholar 

  2. Schena M, Shalon K, Heller Ret al. Parallel human genome analysis: Microarray-based expression monitoring of 1,000 genes. InProc. Natl. Acad. Sci., USA, 93, pp.10614–10619.

  3. Marshall A, Hodgson J. DNA chips: An array of possibilities.Nat. Biotechnol., 1998, 16: 27–31

    Article  Google Scholar 

  4. Ramsay G. DNA chips: State-of-the art.Nat. Biotechnol., 1998, 16: 40–44.

    Article  Google Scholar 

  5. Fodor S P, Rava R P, Huang X Cet al. Multiplexed biochemical assays with biological chips.Nature, 1993, 364: 555–556.

    Article  Google Scholar 

  6. Lipshutz R J, Fodor S P A, Gingeras T Ret al. High density synthetic oligonucleotide arrays.Nature Genet. Suppl., 2000, 21: 20–24.

    Article  Google Scholar 

  7. Harrington C A, Rosenow C, Retief J. Monitoring gene expression using DNA microarrays.Curr. Opin. Microbiol., 2000, 3(3): 285–291.

    Article  Google Scholar 

  8. Lennon G S, Lehrach H. Hybridization analysis of arrayed cDNA libraries.Trends Genet., 1991, 7: 60–75.

    Google Scholar 

  9. Drmanac S, Drmanac R. Processing of cDNA and genomic kilobase-size clones for massive screening mapping and sequencing by hybridization.Biotechniques, 1994, 17: 328–336.

    Google Scholar 

  10. Drmanac R, Lennon G, Drmanac Set al. Partial sequencing by oligo hybridization: Concept and applications in genome analysis. InProc. the First International Conference of Electrophoresis Supercomputing and the Human Genome, Cantor C, Lim H (Eds.), Singapore: World Scientific, 1991, pp.60–75.

    Google Scholar 

  11. Drmanac S, Stavropoulos N A, Labat Iet al. Generepresenting cDNA clusters defined by hybridization of 57419 clones from infant brain libraries with short oligonucleotide probes.Genomics, 1996, 37: 29–40.

    Article  Google Scholar 

  12. Vicentic A, Gemmell A. Sequencing by hybridization" Towards an automated sequencing of one million M13 clones arrayed on membranes.Electrophoresis, 1992, 13: 566–573.

    Article  Google Scholar 

  13. Meier-Ewert S, Mott R, Lehrach H. Gene identification by oligonucleotide fingerprinting — A pilot study. 1995, MPI, technical report.

  14. Milosavljevic A, Strezoska Z, Zeremski Met al. Clone clustering by hybridization.Genomics, 1995, 27: 83–89.

    Article  Google Scholar 

  15. Jiang T, Xu Y, Zhang M Q. Current Topics in Computational Molecular Biology. Tsinghua University Press and the MIT Press, 2002.

  16. Hartuv E, Schmitt A, Lange Jet al. An algorithm for clustering cDNA fingerprints.Genomics, 2000, 66(3): 249–256.

    Article  Google Scholar 

  17. Sharan R, Shamir R. CLICK: A clustering algorithm with applications to gene expression analysis. InProc. 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), 2000, pp.307–316.

  18. Han J W. Data Mining—Concepts and Techniques. High Education Press, 2001.

  19. Knorr E M, Ng R T. Algorithms for mining distance-based outliers in large datasets. InProc. 1998 Int. Conf. Very Large Data Bases (VLDB'98), 1998, pp.392–403.

  20. Hawkins D. Identification of Outliers. Chapman and Hall, London. 1980.

    MATH  Google Scholar 

  21. Ng R T, Han J. Efficient effective clustering methods for spatial data mining. InProc. the 20th. International Conference on Very Large Data Bases. Bocca J B, Jarke M, Zaniolo C (Eds), Sautiago: Morgan Kaufmann, 1994, pp. 144–155.

    Google Scholar 

  22. Ester M, Kriefel H P, Sander Jet al. A density-based algorithm for discovering clusters in large spatial databases with noise. InProc. the 2nd International Conference on Knowledge Discovery and Data Mining, Simoudis E, Han J, Fayyad U M (Eds.), Portland, Oregon: AAAI Press, 1996, pp.226–231.

    Google Scholar 

  23. Zhang T, Ramakrishnan R, Linvy M. BIRCH: An efficient data clustering method for very large databases. InProc. the ACM SIGMOD International Conference on Management of Data, Jagadish H V, Mumick I S (Eds.), Montreal: ACM Press, 1996, pp.103–114.

    Google Scholar 

  24. Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. InProc. the 23rd International Conference on Very Large Data Bases, Jarke M, Carey M J, Dittrich K Ret al. (Eds.), Athens, Greece: Morgan Kaufmann, 1997, pp. 186–195.

    Google Scholar 

  25. Sheikholeslami G, Chatterjee S, Zhang A. WaveCluster: A multi-resolution clustering approach for very large spatial databases. InProc. the 24th International Conference on Very Large Data Bases, Gupta A, Shmueli O, Widom J (Eds.), New York: Morgan Kaufmann, 1998, pp.428–439.

    Google Scholar 

  26. Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. InProc. the 4th International Conference on Knowledge Discovery and Data Mining. Agrawal R, Stolorz P E, Piatetsky-Shapiro G (Eds.), New York: AAAI Press, 1998, pp.58–65.

    Google Scholar 

  27. Agrawal R, Gehrke J, Gunopulos Det al. Automatic subspace clustering of high dimensional data for data mining applications. InProc. the ACM SIGMOD International Conference on Mangement of Data, Haas L M, Tiwary A (Eds.), Seattle: ACM Press, 1998, pp.94–105.

    Google Scholar 

  28. Barnett V, Lewis T. Outliers in Statistical Data. New York: John Wiley & Sons, 1994.

    Google Scholar 

  29. Breunig M M, Kriegel H P, Ng R Tet al. OPTICSOF: Identifying density-based local outliers. InProc. the 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases, Zytkow J M, Rauch J (Eds.) Lecture Notes in Computer Science 1704, Prague: Springer, 1999, pp.262–270.

    Google Scholar 

  30. Breunig M M, Kriegel H P, Ng R Tet al. LOF: Identifying density-based local outliers. InProc. the ACM SIGMOD International Conference on Management of Data, Chen W, Naughton J F, Bernstein P A (Eds.), Dallas, Texas: ACM Press, 2000 pp.93–104.

    Chapter  Google Scholar 

  31. Arning A, Agrawal R, Raghavan P. A linear method for deviation detection in large databases. InProc. 1996 Int. Conf. Data Mining and Knowledge (Special Issue on High Performance Data Mining), 2000.

  32. Sarawagi S, Agrawal R, Megiddo N. Discovery-driven exploration of OLAP data cubes. InProc. Int. Conf. Extending Database Technology (EDBT'98) Valencia, Spain, 1998, pp.168–182.

  33. Jagadish H V, Koudas N, Muthukrishnan S. Mining deviants in a time series databases. InProc. 1999 Int. Conf. Very Large Data Bases (VLDB'99), Edinburgh, UK, 1999, pp.102–113.

  34. Knorr E, Ng R. A unified notion of outliers: Properties and computation. InProc. 1997 Int. Conf. Knowledge Discovery and Data Mining (KDD'97) Newport Beach. 1997, pp.219–222.

  35. Aggarwal C C, Yu P. Outlier detection for high dimensional data. InProc. the ACM SIGMOD International Conference on Management of Data, Aref W G (Ed.), Santa Barbara, CA: ACM Press, 2001, pp.37–47.

    Chapter  Google Scholar 

  36. Hinneburg A, Aggarwal C C, Keim D A. What is the nearest neighbor in high dimensional spaces? InProc. the 26th International Conference on Very Large Data Bases, Abbadi A E, Brodie M L, Chakravarthy Set al. (Eds.), Cairo: Morgan Kaufmann, 2000, pp.506–515.

    Google Scholar 

  37. Aggarwal C C, Yu P. Finding generalized projected clusters in high dimensional spaces. InProc. the ACM SIGMOD International Conference on Management of Data, Chen W, Naughton J f, Bernstein P A (Eds.), Dallas, Texas: ACM Press, 2000, pp.70–81.

    Chapter  Google Scholar 

  38. Holland J H. Adaptation in natural and artificial systems. University of Michigan Press. Ann Arbor MI 1975.

    Google Scholar 

  39. Chen G L. Genetic algorithm and its applications. National Defense Press, 1996. (in Chinese)

  40. De Jong K A. Analysis of the behavior of a class of genetic adaptive systems [Dissertation]. University of Michigan, Ann Arbor, MI, 1975.

    Google Scholar 

  41. Aggarwal C C, Orlin J B, Tai R P. Optimized crossover for the independent set problem.Operations Research, 1997, 45(2).

  42. http://arep.med.harvard.edu/biclustering/.

  43. Alizadeh A A, Eisen M Bet al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.Nature, 2000, 403: 503–511.

    Article  Google Scholar 

  44. http://llmpp.nih.gov/lymphoma/

Download references

Author information

Authors and Affiliations

Authors

Additional information

This work was supported by the National High-Tech Development 863 Program of China under the Grant No.2001AA111041 2002AA104560, and High-Standard Universities Construction Project in CAS.

Chao Yan is a Ph.D. candidate of Dept. of Computer Sci. & Tech. University of Science and Technology of China. His major interests include cluster, bioinformatics and algorithm.

Guo-Liang Chen is a professor, academician of the Chinese Academy of Sciences. He works with the Dept. of Computer Sci. & Tech., University of Science and Technology of China. His major Research areas include parallel theory and algorithm.

Yi-Fei Shen is a Ph.D. candidate of Dept. of Computer Sci. & Tech., University of Science and Technology of China. His major interests include bioinformatics and algorithm.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yan, C., Chen, GL. & Shen, YF. Outlier analysis for gene expression data. J. Comput. Sci. & Technol. 19, 13–21 (2004). https://doi.org/10.1007/BF02944782

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02944782

Keywords

Navigation