Abstract
Due to the imprecise nature of biological experiments, biological data are often characterized by the presence of redundant and noisy data, which are usually derived from errors associated with data collection, such as contaminations in laboratorial samples. Gene expression data represent an example of noisy biological data that suffer from this problem. Machine Learning algorithms have been successfully used in gene expression analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from data can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques in gene expression data, analyzing the effectiveness of these techniques and combinations of them in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data. The results obtained indicate that the pre-processing techniques employed were effective for noise detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2000)
Cohen, W.W.: Fast effective rule induction. In: Proc. 12th Int. Conf. on Machine Learning, pp. 115–123 (1995)
Collobert, R., Bengio, S.: SVMTorch: Support vector machines for large-scale regression problems. J. Machine Learning Res. 1, 143–160 (2001)
Demsar, J.: Statistical comparisons of classifiers over multiple datasets. J. Machine Learning Research 7, 1–30 (2006)
Frank, E., Witten, I.H.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artificial Intelligence Review 22, 85–126 (2004)
Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications. The VLDB Journal 8(3-4), 237–253 (2000)
Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Tang, J., Chen, Z., Fu, A.W., Cheung, D.: A robust outlier detection scheme in large data sets. In: Proc. 6th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (2002)
Tomek, I.: Two modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics 7(11), 769–772 (1976)
Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, Heidelberg (1995)
Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Machine Learning 38(3), 257–286 (2000)
Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artificial Intelligence Research 6(1), 1–34 (1997)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics 2(3), 408–421 (1972)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Libralon, G.L., Carvalho, A.C.P.L.F., Lorena, A.C. (2009). Ensembles of Pre-processing Techniques for Noise Detection in Gene Expression Data. In: Köppen, M., Kasabov, N., Coghill, G. (eds) Advances in Neuro-Information Processing. ICONIP 2008. Lecture Notes in Computer Science, vol 5506. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02490-0_60
Download citation
DOI: https://doi.org/10.1007/978-3-642-02490-0_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02489-4
Online ISBN: 978-3-642-02490-0
eBook Packages: Computer ScienceComputer Science (R0)