Abstract
Most of the screening methods have always struggled to deal with the high dimensionality of data in virtual screening task. One of the most commonly used techniques to reduce the high dimensional data is principal component analysis (PCA). PCA and its variants have been introduced and re-introduced to solve the problems in particular tasks in real world applications. In this paper, PCA and four variants of it are compared and analyzed together in virtual screening task in particular using fingerprint representation. Fingerprint is one of the most regularly used descriptors in virtual screening task. None of these methods have never been compared and studied together with high dimensional and binary-valued data elsewhere. The results show superiority of the variants of PCA to PCA on the most heterogeneous classes, while the methods are competitive to PCA on the homogeneous classes. Supervised PCA is found to be the best technique and is competitive to Fisher discriminant analysis. It should be noted that Fisher discriminant analysis uses all the provided information while Supervised PCA uses only few components.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Leach, A.R., Gillet, V.J.: An Introduction to Chemoinformatics. Kluwer Academic Publishers, Dordrecht (2003)
Bair, E., Hastie, T., Paul, D., Tibshirani, R.: Prediction by Supervised Principal Components. J. Am. Statist. Assoc. 101, 119–137 (2006)
Yu, S., Yu, K., Tresp, V., Kriegel, H.P., Wu, M.: Supervised Probabilistic Principal Component Analysis. In: Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining, pp. 464–473. ACM Press, New York (2006)
de Leeuw, J.: Principal Component Analysis of Binary Data by Iterated Singular Value Decomposition. Comput. Stat. Data An. 50(1), 21–39 (2006)
Schein, A.I., Saul, L.K., Ungar, L.H.: A Generalized Linear Model for Principal Component Analysis of Binary Data. In: Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics (2003)
Tipping, M.E., Bishop, C.M.: Probabilistic Principal Component Analysis. J. R. Stat. Soc. Ser. B 61, 611–622 (1999)
Collins, M., Dasgupta, S., Schapire, R.E.: A Generalization of Principal Components Analysis to the Exponential Family. In: Advances in Neural Information Processing Systems, pp. 617–624 (2001)
Mccullagh, P., Nelder, J.A.: Generalized Linear Models. Chapman & Hall, London (1989)
Cox, D.R., Snell, E.J.: Analysis of Binary Data, 2nd edn. Chapman & Hall, London (1989)
Tinterwordspacing MDL Information Systems Inc.: The MDL drug data report database (2006), http://www.mdli.com
Morgan, H.L.: The Generation of a Unique Machine Description for Chemical Structure – A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 5, 107–113 (1965)
MathWorks Inc.: Matlab Version 7.10 (2010), http://www.mathworks.com
Siegel, S., Castellian, N.J.: Nonparametric Statistics for the Behavioral Sciences, 2nd edn. McGraw-Hill, Singapore (1988)
Kabán, A., Bingham, E., Hirsimäki, T.: Learning to Read Between the Lines: The Aspect Bernoulli Model. In: Proceedings of 4th SIAM International Conference on Data Mining, pp. 462–466. SIAM, Florida (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pasupa, K. (2013). A Comparison of Dimensionality Reduction Techniques in Virtual Screening. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2013. Lecture Notes in Computer Science(), vol 7895. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38610-7_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-38610-7_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38609-1
Online ISBN: 978-3-642-38610-7
eBook Packages: Computer ScienceComputer Science (R0)