Unsupervised Dimension Reduction Methods for Protein Sequence Classification

Heider, Dominik; Bartenhagen, Christoph; Dybowski, J. Nikolaj; Hauke, Sascha; Pyka, Martin; Hoffmann, Daniel

doi:10.1007/978-3-319-01595-8_32

Dominik Heider²¹,
Christoph Bartenhagen²²,
J. Nikolaj Dybowski²¹,
Sascha Hauke²³,
Martin Pyka²⁴ &
…
Daniel Hoffmann²¹

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

5448 Accesses

Abstract

Feature extraction methods are widely applied in order to reduce the dimensionality of data for subsequent classification, thus decreasing the risk of noise fitting. Principal Component Analysis (PCA) is a popular linear method for transforming high-dimensional data into a low-dimensional representation. Non-linear and non-parametric methods for dimension reduction, such as Isomap, Stochastic Neighbor Embedding (SNE) and Interpol are also used. In this study, we compare the performance of PCA, Isomap, t-SNE and Interpol as preprocessing steps for classification of protein sequences. Using random forests, we compared the classification performance on two artificial and eighteen real-world protein data sets, including HIV drug resistance, HIV-1 co-receptor usage and protein functional class prediction, preprocessed with PCA, Isomap, t-SNE and Interpol. Significant differences between these feature extraction methods were observed. The prediction performance of Interpol converges towards a stable and significantly higher value compared to PCA, Isomap and t-SNE. This is probably due to the nature of protein sequences, where amino acid are often dependent from and affect each other to achieve, for instance, conformational stability. However, visualization of data reduced with Interpol is rather unintuitive, compared to the other methods. We conclude that Interpol is superior to PCA, Isomap and t-SNE for feature extraction previous to classification, but is of limited use for visualization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improved Feature Selection Algorithm for Biological Sequences Classification

Hierarchical feature extraction based on discriminant analysis

Article 04 February 2019

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

References

Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Article MATH Google Scholar
Cai, C. Z., Han, L. Y., Ji, Z. L., Chen, X., & Chen, Y. Z. (2003). SVM-Prot: Web-based support vector machinee software for functional classification of a protein from its primary sequence. Nucleic Acids Research, 31, 459–462.
Article Google Scholar
Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA.
Google Scholar
Cox, T. F., Cox, M. A. A., & Raton, B. (2003). Multidimensional scaling. Technometrics, 45(2), 182.
Google Scholar
Dybowski, J. N., Heider, D., & Hoffmann, D. (2010). Prediction of co-receptor usage of HIV-1 from genotype. PLOS Computational Biology, 6(4), e1000743.
Article Google Scholar
Dybowski, J. N., Riemenschneider, M., Hauke, S., Pyka, M., Verheyen, J., Hoffmann, D., et al. (2011). Improved Bevirimat resistance prediction by combination of structural and sequence-based classifiers. BioData Mining, 4, 26.
Article Google Scholar
Heider, D., Appelmann, J., Bayro, T., Dreckmann, W., Held, A., Winkler, J., et al. (2009). A computational approach for the identification of small GTPases based on preprocessed amino acid sequences. Technology in Cancer Research and Treatment, 8(5), 333–342.
Google Scholar
Heider, D., Hauke, S., Pyka, M., & Kessler, D. (2010). Insights into the classification of small GTPases. Advances and Applications in Bioinformatics and Chemistry, 3, 15–24.
Article Google Scholar
Heider, D., & Hoffmann, D. (2011). Interpol: An R package for preprocessing of protein sequences. BioData Mining, 4, 16.
Article Google Scholar
Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). Springer series in statistics. New York: Springer.
Google Scholar
Kyte, J., & Doolittle, R. (1982). A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 157, 105–132.
Article Google Scholar
Nanni, L., & Lumini, A. (2011). A new encoding technique for peptide classification. Expert Systems with Applications, 38(4), 3185–3191.
Article Google Scholar
Rhee, S. Y., Taylor, J., Wadhera, G., Ben-Hur, A., Brutlag, D. L., & Shafer, R. W. (2006). Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences of USA, 103(46), 17355–17360.
Article Google Scholar
Tenenbaum, J. B., De Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.
Article Google Scholar
van der Maaten, L., & Hinton, G. (2008). Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Bioinformatics, University of Duisburg-Essen, Universitätsstr. 2, 45117, Essen, Germany
Dominik Heider, J. Nikolaj Dybowski & Daniel Hoffmann
Department of Medical Informatics, University of Münster, Domagkstr. 9, 48149, Münster, Germany
Christoph Bartenhagen
CASED, Technische Universität Darmstadt, Mornewegstr. 32, 64293, Darmstadt, Germany
Sascha Hauke
Department of Psychiatry and Psychotherapy, Philipps-University Marburg, Rudolf-Bultmann-Str. 8, 35039, Marburg, Germany
Martin Pyka

Authors

Dominik Heider
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Bartenhagen
View author publications
You can also search for this author in PubMed Google Scholar
J. Nikolaj Dybowski
View author publications
You can also search for this author in PubMed Google Scholar
Sascha Hauke
View author publications
You can also search for this author in PubMed Google Scholar
Martin Pyka
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Hoffmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dominik Heider .

Editor information

Editors and Affiliations

Faculty of Computer Science, Otto-von-Guericke-Universität Magdeburg, Magdeburg, Germany
Myra Spiliopoulou
Institute of Computer Science, University of Hildesheim, Hildesheim, Germany
Lars Schmidt-Thieme
Institute of Computer Science, University of Hildesheim, Hildesheim, Germany
Ruth Janning

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Heider, D., Bartenhagen, C., Dybowski, J.N., Hauke, S., Pyka, M., Hoffmann, D. (2014). Unsupervised Dimension Reduction Methods for Protein Sequence Classification. In: Spiliopoulou, M., Schmidt-Thieme, L., Janning, R. (eds) Data Analysis, Machine Learning and Knowledge Discovery. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-01595-8_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-01595-8_32
Published: 10 October 2013
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01594-1
Online ISBN: 978-3-319-01595-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics