Abstract
The kernel method, especially the kernel-fusion method, is widely used in social networks, computer vision, bioinformatics, and other applications. It deals effectively with nonlinear classification problems, which can map linearly inseparable biological sequence data from low to high-dimensional space for more accurate differentiation, enabling the use of kernel methods to predict the structure and function of sequences. Therefore, the kernel method is significant in the solution of bioinformatics problems. Various kernels applied in bioinformatics are explained clearly, which can help readers to select proper kernels to distinguish tasks. Mass biological sequence data occur in practical applications. Research of the use of machine learning methods to obtain knowledge, and how to explore the structure and function of biological methods for theoretical prediction, have always been emphasized in bioinformatics. The kernel method has gradually become an important learning algorithm that is widely used in gene expression and biological sequence prediction. This review focuses on the requirements of classification tasks of biological sequence data. It studies kernel methods and optimization algorithms, including methods of constructing kernel matrices based on the characteristics of biological sequences and kernel fusion methods existing in a multiple kernel learning framework.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Vapnik V N. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 1999, 10(5): 988–999
Schölkopf B, Smola A, Müller K R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 1998, 10(5): 1299–1319
Nello C, John S T. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000
Mercer J. XVI. Functions of positive and negative type, and their connection the theory of integral equations. Philosophical Transactions of the Royal Society of London, 1909, 209(441–458): 415–446
Vapnik V. Statistical learning theory. New York: Wiley, 1998
Mika S, Ratsch G, Weston J, Scholkopf B, Mullers K R. Fisher discriminant analysis with kernels. In: Proceedings of the 1999 IEEE Signal Processing Society Workshop. 1999, 41–48
Cristianini N, John S T, Elisseeff A, Kandola J S. On kernel-target alignment. Advances in Neural Information Processing Systems, 2002, 367–373
Song L, Kolar M, Xing E P. KELLER: estimating time-varying interactions between genes. Bioinformatics, 2009, 25(12): i128–i136
Song L, Bedo J, Borgwardt K M, Gretton A, Smola A. Gene selection via the BAHSIC family of algorithms. Bioinformatics, 2007, 23(13): i490–i498
Kato T, Tsuda K, Asai K. Selective integration of multiple biological data for supervised network inference. Bioinformatics, 2005, 21(10): 2488–2495
Donini M, Monteiro J M, Pontil M, Shawe-Taylor J, Mourao-Miranda J. A multimodal multiple kernel learning approach to Alzheimer’s disease detection. In: Proceedings of the 26th IEEE International workshop on Machine Learning for Signal Processing. 2016, 1–6
Gu Y, Liu T, Jia X, Benediktsson J A, Chanussot J. Nonlinear multiple kernel learning with multiple-structure-element extended morphological profiles for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 2016, 54(6): 3235–3247
Han L, Yue Z, Guo X. Image segmentation implementation based on FPGA and SVM. In: Proceedings of the 3rd International Conference on Control, Automation and Robotics. 2017, 405–409
Leslie C S, Eskin E, Cohen A, Weston J, Noble W S. Mismatch string kernels for discriminative protein classification. Bioinformatics, 2004, 20(4): 467–476
Tsuda K, Noble W S. Learning kernels from biological networks by maximizing entropy. Bioinformatics, 2004, 20(suppl_1): i326–i333
Chou K C. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics, 2009, 6(4): 262–274
Swamidass S J, Chen J, Bruand J, Phung P, Ralaivola L, Baldi P. Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics, 2005, 21(suppl_1): i359–i368
Asa B H, Noble W S. Kernel methods for predicting protein-protein interactions. Bioinformatics, 2005, 21(suppl_1): i38–i46
Lanckriet G R, Cristianini N, Bartlett P, Ghaoui L E, Jordan M I. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 2004, 5(Jan): 27–72
Lanckriet G R, Tijl D B, Cristianini N, Jordan M I, Noble W S. A statistical framework for genomic data fusion. Bioinformatics, 2004, 20(16): 2626–2635
Lanckriet G R, Deng M, Cristianini N, Jordan M I, Noble W S. Kernelbased data fusion and its application to protein function prediction in yeast. Biocomputing: World Scientific, 2003
Bach F R, Thibaux R, Jordan M I. Computing regularization paths for learning multiple kernels. Advances in Neural Information Processing Systems, 2005, 73–80
Sonnenburg S, Rätsch G, Schäfer C. A general and efficient multiple kernel learning algorithm. Advances in Neural Information Processing Systems, 2006, 1273–1280
Jebara T. Multi-task feature and kernel selection for SVMs. In: Proceedings of the 21th International Conference on Machine Learning. 2004, 55
Lewis D P, Jebara T, Noble W S. Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Bioinformatics, 2006, 22(22): 2753–2760
Rätsch G, Sonnenburg S, Schäfer C. Learning interpretable SVMs for biological sequence classification. BMC Bioinformatics, 2006, 7(1): S9
Varma M, Babu B R. More generality in efficient multiple kernel learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 1065–1072
Jain A, Vishwanathan S V, Varma M. SPF-GMKL: generalized multiple kernel learning with a million kernels. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 750–758
Wu P, Hoi S C, Zhao P, Miao C, Liu Z-Y. Online multi-modal distance metric learning with application to image retrieval. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(2): 454–467
Borgwardt K M, Ong C S, Schönauer S, Vishwanathan S, Smola A J, Kriegel H-P. Protein function prediction via graph kernels. Bioinformatics, 2005, 21(suppl_1), i47–i56
Zien A, Ong C S. An automated combination of sequence motif kernels for predicting protein subcellular localization, 2006
Damoulas T, Girolami M A. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics, 2008, 24(10): 1264–1270
Vert J P, Qiu J, Noble W S. A new pairwise kernel for biological network inference with support vector machines. BMC Bioinformatics, 2007, 8(10): 1–10
Vapnik V. The nature of statistical learning theory. Springer science & business media, 2013
Aronszajn N. Theory of reproducing kernels. Transactions of the American Mathematical Society, 1950, 68(3): 337–404
Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on Computational learning theory. 1992, 144–152
Boyd S, Vandenberghe L. Convex optimization. Cambridge university press, 2004
Leslie C, Eskin E, Noble W S. The spectrum kernel: A string kernel for SVM protein classification. Biocomputing 2002: World Scientific, 2001
Saigo H, Vert J P, Ueda N, Akutsu T. Protein homology detection using string alignment kernels. Bioinformatics, 2004, 20(11): 1682–1689
Rätsch G, Sonnenburg S. Accurate splice site prediction for caenorhabditis elegans. Computational Molecular Biology, 2004, 277–298
Asa B H, Brutlag D. Remote homology detection: a motif based approach. Bioinformatics, 2003, 19(suppl_1): i26–i33
Nevill M, Craig G, Wu T D, Brutlag D L. Highly specific protein sequence motifs for genome analysis. In: Proceedings of the National Academy of Sciences. 1998, 95(11): 5865–5871
Huang J Y, Brutlag D L. The EMOTIF database. Nucleic Acids Research, 2001, 29(1): 202–204
Kuang R, WANG K, Wang K, Siddiqi M, Freund Y, Leslie C. Profile-based string kernels for remote homology detection and motif extraction. Journal of Bioinformatics and Computational Biology, 2005, 3(3): 527–550
Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller K R, Sommer R J, Schölkopf B. Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Computational Biology, 2007, 3(2): e20
Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong C S, Philips P, De Bona F, Hartmann L, Bohlen A. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Research, 2009, 19(11): 2133–2143
Jacob L, Vert J P. Efficient peptide—MHC-I binding prediction for alleles with few known binders. Bioinformatics, 2007, 24(3): 358–366
Röttig M, Rausch C, Kohlbacher O. Combining structure and sequence information allows automated prediction of substrate specificities within enzyme families. PLoS Computational Biology, 2010, 6(1): e1000636
Teramoto R, Aoki M, Kimura T, Kanaoka M. Prediction of siRNA functionality using generalized string kernel and support vector machine. FEBS letters, 2005, 579(13): 2878–2882
Kuksa P, Qi Y, Bai B, Collobert R, Weston J, Pavlovic V, Ning X. Semi-supervised abstraction-augmented string kernel for multi-level bio-relation extraction. Joint European Conference on Machine Learning and Knowledge Discovery in Databases: Springer, 2010, 128–144
Leslie C, Eskin E, Weston J, Noble W S. Mismatch string kernels for SVM protein classification. Advances in Neural Information Processing Systems, 2003, 1441–1448
Weston J, Leslie C, Ie E, Zhou D, Elisseeff A, Noble W S. Semi-supervised protein classification using cluster kernels. Bioinformatics, 2005, 21(15): 3241–3247
Kuksa P, Huang P H, Pavlovic V. Scalable algorithms for string kernels with inexact matching. Advances in Neural Information Processing Systems, 2008, 21, 881–888
Leslie C, Kuang R, Bennett K. Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 2004, 5(9)
Cichonska A, Pahikkala T, Szedmak S, Julkunen H, Airola A, Heinonen M, Aittokallio T, Rousu J. Learning with multiple pairwise kernels for drug bioactivity prediction. Bioinformatics, 2018, 34(13): i509–i518
Liao L, Noble W S. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In: Proceedings of the Sixth Annual International Conference on Computational Biology. 2002, 225–232
Filatov G, Bauwens B, Attila K F. LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification. Bioinformatics, 2018, 34(19): 3281–3288
Smith T F, Waterman M S. Identification of common molecular subsequences. Journal of Molecular Biology, 1981, 147(1): 195–197
Altschul S F, Gish W, Miller W, Myers E W, Lipman D J. Basic local alignment search tool. Journal of Molecular Biology, 1990, 215(3): 403–410
Vert J P, Saigo H, Akutsu T. Local alignment kernels for biological sequences. Kernel Methods in Computational Biology, 2004, 131–154
Jaakkola T, Diekhans M, Haussler D. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 2000, 7(1–2), 95–114
Gönen M, Alpaydin E. Multiple kernel learning algorithms. Journal of Machine Learning Research, 2011, 12: 2211–2268
Bucak S S, Jin R, Jain A K. Multiple kernel learning for visual object recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1354–1369
Bach F R, Lanckriet G R, Jordan M I. Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the 21th International Conference on Machine Learning. 2004, 6
Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B. Large scale multiple kernel learning. Journal of Machine Learning Research, 2006, 7: 1531–1565
Papadopoulos A. Metric spaces, convexity and nonpositive curvature. European Mathematical Society, 2005
Rapcsak T. Geodesic convexity in nonlinear optimization. Journal of Optimization Theory and Applications, 1991, 69(1): 169–183
Zakeri P, Jeuris B, Vandebril R, Moreau Y. Protein fold recognition using geometric kernel data fusion. Bioinformatics, 2014, 30(13): 1850–1857
Jeuris B, Vandebril R, Vandereycken B. A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electronic Transactions on Numerical Analysis, 2012, 39(ARTICLE): 379–402
Wang Y C, Zhang C H, Deng N Y, Wang Y. Kernel-based data fusion improves the drug-protein interaction prediction. Computational Biology and Chemistry, 2011, 35(6): 353–362
Yu G, Rangwala H, Domeniconi C, Zhang G, Zhang Z. Protein Function Prediction by Integrating Multiple Kernels. In: Proceedings of Twenty-Third International Joint Conference on Artificial Intelligence. 2013
Yu G, Rangwala H, Domeniconi C, Zhang G, Zhang Z. Predicting Protein Function using Multiple Kernels. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2015, 12(1): 219–233
Yu G, Fu G, Wang J, Zhu H. Predicting Protein Function via Semantic Integration of Multiple Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016, 13(2): 220–232
Cortes Corinna M M, Afshin Rostamizadeh. Learning non-linear combinations of kernels. Advances in neural information processing systems, 2009
Kloft M, Brefeld U, Sonnenburg S, Zien A. Non-sparse regularization and efficient training with multiple kernels. 2010, arXiv preprint arXiv: 1003.0079 2010, 186: 189–190
Eli M, Kisilev P. Nuc-mkl: A convex approach to non linear multiple kernel learning. Artificial Intelligence and Statistics, 2016, 610–619
Wilson C M, Li K, Yu X, Kuan P F, Wang X. Multiple-kernel learning for genomic data mining and prediction. BMC Bioinformatics, 2019, 20(1): 1–7
Sakakibara Y, Popendorf K, Ogawa N, Asai K, Sato K. Stem kernels for RNA sequence analyses. Journal of Bioinformatics and Computational Biology, 2007, 5(05): 1103–1122
Navarin N, Costa F. An efficient graph kernel method for non-coding RNA functional prediction. Bioinformatics, 2017, 33(17): 2642–2650
Brayet J, Zehraoui F, Laurence J L, Israeli D, Tahi F. Towards a piRNA prediction using multiple kernel fusion and support vector machine. Bioinformatics, 2014, 30(17): i364–i370
Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 27
Costello J C, Heiser L M, Georgii E, Gönen M, Menden M P, Wang N J, Bansal M, Hintsanen P, Khan S A, Mpindi J P. A community effort to assess and improve drug sensitivity prediction algorithms. Nature Biotechnology, 2014, 32(12): 1202
Cichonska A, Ravikumar B, Parri E, Timonen S, Pahikkala T, Airola A, Wennerberg K, Rousu J, Aittokallio T. Computational-experimental approach to drug-target interaction mapping: A case study on kinase inhibitors. PLoS Computational Biology, 2017, 13(8): e1005678
Gönen M. Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics, 2012, 28(18): 2304–2310
Gönen M, Khan S, Kaski S. Kernelized Bayesian matrix factorization. International Conference on Machine Learning, 2013, 864–872
Nascimento A C, Prudêncio R B, Costa I G. A multiple kernel learning algorithm for drug-target interaction prediction. BMC Bioinformatics, 2016, 17(1): 46
Kloft M, Brefeld U, Laskov P, Müller K-R, Zien A, Sonnenburg S. Efficient and accurate lp-norm multiple kernel learning. Advances in Neural Information Processing Systems, 2009, 997–1005
Sun Z, Ampornpunt N, Varma M, Vishwanathan S. Multiple kernel learning and the SMO algorithm. Advances in Neural Information Processing Systems, 2010, 2361–2369
Rakotomamonjy A, Bach F R, Canu S, Grandvalet Y. SimpleMKL. Journal of Machine Learning Research, 2008, 9: 2491–2521
Bucak S, Jin R, Jain A. Multi-label multiple kernel learning by stochastic approximation: Application to visual object recognition. Advances in Neural Information Processing Systems, 2010, 325–333
Gönen M, Alpaydin E. Localized multiple kernel learning. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 352–359
Mostafavi S, Morris Q. Fast integration of heterogeneous data sources for predict-ing gene function with limited annotation. Bioinformatics, 2010, 26(14): 1759–1765
Kong X, Ng M K, Zhou Z-H. Trans-ductive multi-label learning via label set propagation. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(3): 704–719
Tang L, Chen J, Ye J. On multiplekernel learning with multiple labels. In: Proceedings of Twenty-First International Joint Conference on Artificial Intelligence. 2009, 1255–1260
Bucak S, Jin R, Jain A. Multi-label multiple kernel learning by stochastic approximation:Application to visual object recognition. Advances in Neural Information Processing Systems, 2010, 24: 325–333
Acknowledgements
The work was supported by the National Natural Science Foundation of China (Grant Nos. 61922020, 61771331, 61902259).
Author information
Authors and Affiliations
Corresponding author
Additional information
Ren Qi received the MS degrees in computer technology from the College of Intelligence and Computing, Tianjin University, China in 2018 and she is a doctoral student in College of Intelligence and Computing from Tianjin University, China in 2018.
Her research interests include machine learning, metric learning, and bioinformatics. Several related works have been published by IJCAI, Journal of software.
Fei Guo received the BSc and the PhD degrees in School of Computer Science and Technology, Shandong University, China in 2007 and 2012, respectively. She is currently an assistant professor in the College of Intelligence and Computing, Tianjin University, China.
Her research interests include bioinformatics, algorithms, machine learning. Several related works have been published by Briefings in Bioinformatics, Bioinformatics, IEEE/ACM Transactions on Computational Biology and Bioinformatics, etc. She is the PC member of APBC-2016, APBC-2017, and reviewer of BMC Bioinformatics, BMC System Biology, IEEE TCBB, PLoS ONE, etc.
Quan Zou (M’13-SM’17) received the BSc, MSc and the PhD degrees in computer science from Harbin Institute of Technology, China in 2004, 2007 and 2009, respectively. He worked in Xiamen University and Tianjin University, China, from 2009 to 2018 as an assistant professor, associate professor and professor. He is currently a professor in the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, China.
His research is in the areas of bioinformatics, machine learning and parallel computing. Several related works have been published by Science, Briefings in Bioinformatics, Bioinformatics, IEEE/ACM Transactions on Computational Biology and Bioinformatics, etc. Google scholar showed that his more than 100 papers have been cited more than 5000 times. He is the editor-in-chief of Current Bioinformatics, associate editor of IEEE Access, and the editor board member of Computers in Biology and Medicine, Genes, Scientific Reports, etc. He was selected as one of the Clarivate Analytics Highly Cited Researchers in 2018 and 2021.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Qi, R., Guo, F. & Zou, Q. String kernels construction and fusion: a survey with bioinformatics application. Front. Comput. Sci. 16, 166904 (2022). https://doi.org/10.1007/s11704-021-1118-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-021-1118-x