Abstract
We attempt to establish geometrical methods for amino acid sequences. To measure the similarities of these sequences, a kernel on strings is defined using only the sequence structure and a good amino acid substitution matrix (e.g. BLOSUM62). The kernel is used in learning machines to predict binding affinities of peptides to human leukocyte antigen DR (HLA-DR) molecules. On both fixed allele (Nielsen and Lund in BMC Bioinform. 10:296, 2009) and pan-allele (Nielsen et al. in Immunome Res. 6(1):9, 2010) benchmark databases, our algorithm achieves the state-of-the-art performance. The kernel is also used to define a distance on an HLA-DR allele set based on which a clustering analysis precisely recovers the serotype classifications assigned by WHO (Holdsworth et al. in Tissue Antigens 73(2):95–170, 2009; Marsh et al. in Tissue Antigens 75(4):291–455, 2010). These results suggest that our kernel relates well the sequence structure of both peptides and HLA-DR molecules to their biological functions, and that it offers a simple, powerful and promising methodology to immunology and amino acid sequence studies.
Similar content being viewed by others
Notes
Allele: an alternative form of a gene that occurs at a specified chromosomal position (locus) [22].
We have found from a number of different experiments that “they do not cluster”. (Perhaps the geometric phenomenon here is in the higher dimensional scaled topology, i.e. the Betti numbers β i >0, for i>0 [4].)
Both the data set and the 5-fold partition are available at http://www.cbs.dtu.dk/suppl/immunology/NetMHCII-2.0.php.
Both the data set and the 5-part partition are available at http://www.cbs.dtu.dk/suppl/immunology/NetMHCIIpan-2.0.
The data set was downloaded from http://www.immuneepitope.org/list_page.php?list_type=mhc&measured_response=&total_rows=64797&queryType=true, on May 23, 2012.
The code is published in http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?netMHCIIpan.
Another way of measuring distance between clusters is the Hausdorff distance.
References
M. Andreatta, Discovering sequence motifs in quantitative and qualitative peptide data. Ph.D. thesis, Center for Biological Sequence Analysis, Department of systems biology, Technical University of Denmark, 2012.
N. Aronszajn, Theory of reproducing kernels, Trans. Am. Math. Soc. 68, 337–404 (1950).
A. Baas, X.J. Gao, G. Chelvanayagam, Peptide binding motifs and specificities for HLA-DQ molecules, Immunogenetics 50, 8–15 (1999).
L. Bartholdi, T. Schick, N. Smale, S. Smale, A.W. Baker, Hodge theory on metric spaces, Found. Comput. Math. 12(1), 1–48 (2012).
E.E. Bittar, N. Bittar (eds.), Principles of Medical Biology: Molecular and Cellular Pharmacology (JAI Press, London, 1997).
F.A. Castelli, C. Buhot, A. Sanson, H. Zarour, S. Pouvelle-Moratille, C. Nonn, H. Gahery-Ségard, J.-G. Guillet, A. Ménez, B. Georges, B. Maillère, HLA-DP4, the most frequent HLA II molecule, defines a new supertype of peptide-binding specificity, J. Immunol. 169, 6928–6934 (2002).
F. Cucker, D.X. Zhou, Learning Theory: An Approximation Theory Viewpoint (Cambridge University Press, Cambridge, 2007).
W.H.E. Day, H. Edelsbrunner, Efficient algorithms for agglomerative hierarchical clustering methods, J. Classif. 1(1), 7–24 (1984).
I.A. Doytchinova, D.R. Flower, In silico identification of supertypes for class II MHCs, J. Immunol. 174(11), 7085–7095 (2005).
Y. El-Manzalawy, D. Dobbs, V. Honavar, On evaluating MHC-II binding peptide prediction methods, PLoS ONE 3, e3268 (2008).
M. Galan, E. Guivier, G. Caraux, N. Charbonnel, J.-F. Cosson, A 454 multiplex sequencing method for rapid and reliable genotyping of highly polymorphic genes in large-scale studies, BMC Genom. 11(296) (2010).
G.H. Golub, M. Heath, G. Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21, 215–224 (1979).
D. Graur, W.-H. Li, Fundamentals of Molecular Evolution (Sinauer Associates, Sunderland, 2000).
W.W. Grody, R.M. Nakamura, F.L. Kiechle, C. Strom, Molecular Diagnostics: Techniques and Applications for the Clinical Laboratory (Academic Press, San Diego, 2010).
D. Haussler, Convolution kernels on discrete structures. Tech. report, 1999.
S. Henikoff, J.G. Henikoff, Amino acid substitution matrices from protein blocks, Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992).
R. Holdsworth, C.K. Hurley, S.G. Marsh, M. Lau, H.J. Noreen, J.H. Kempenich, M. Setterholm, M. Maiers, The HLA dictionary 2008: a summary of HLA-A, -B, -C, -DRB1/3/4/5, and -DQB1 alleles and their association with serologically defined HLA-A, -B, -C, -DR, and -DQ antigens, Tissue Antigens 73(2), 95–170 (2009).
R.A. Horn, C.R. Johnson, Topics in Matrix Analysis (Cambridge University Press, Cambridge, 1994).
L. Jacob, J.-P. Vert, Efficient peptide–MHC-I binding prediction for alleles with few known binders, Bioinformatics 24(3), 358–366 (2008).
C.A. Janeway, P. Travers, M. Walport, M.J. Shlomchik, Immunobiology, 5th edn. (Garland Science, New York, 2001).
N. Jojic, M. Reyes-Gomez, D. Heckerman, C. Kadie, O. Schueler-Furman, Learning MHC I–peptide binding, Bioinformatics 22(14), e227–e235 (2006).
T.J. Kindt, R.A. Goldsby, B.A. Osborne, J. Kuby, Kuby Immunology (Freeman, New York, 2007).
C. Leslie, E. Eskin, W.S. Noble, The spectrum kernel: a string kernel for SVM protein classification, in Pacific Symposium on Biocomputing, vol. 7 (2002), pp. 566–575.
H.H. Lin, G.L. Zhang, S. Tongchusak, E.L. Reinherz, V. Brusic, Evaluation of MHC-II peptide binding prediction servers: applications for vaccine research, BMC Bioinform. 9(Suppl 12), S22 (2008).
O. Lund, M. Nielsen, C. Kesmir, A.G. Petersen, C. Lundegaard, P. Worning, C. Sylvester-Hvid, K. Lamberth, G. Røder, S. Justesen, S. Buus, S. Brunak, Definition of supertypes for HLA molecules using clustering of specificity matrices, Immunogenetics 55(12), 797–810 (2004).
O. Lund, M. Nielsen, C. Lundegaard, C. Keşmir, S. Brunak, Immunological Bioinformatics (MIT Press, Cambridge, 2005).
M. Maiers, G.M. Schreuder, M. Lau, S.G. Marsh, M. Fernandes-Vi na, H. Noreen, M. Setterholm, C.K. Hurley, Use of a neural network to assign serologic specificities to HLA-A, -B and -DRB1 allelic products, Tissue Antigens 62(1), 21–47 (2003).
S.G.E. Marsh, E.D. Albert, W.F. Bodmer, R.E. Bontrop, B. Dupont, H.A. Erlich, M. Fernández-Vi na, D.E. Geraghty, R. Holdsworth, C.K. Hurley, M. Lau, K.W. Lee, B. Mach, M. Maiersj, W.R. Mayr, C.R. Müller, P. Parham, E.W. Petersdorf, T. SasaZuki, J.L. Strominger, A. Svejgaard, P.I. Terasaki, J.M. Tiercy, J. Trowsdale, Nomenclature for factors of the HLA system, 2010, Tissue Antigens 75(4), 291–455 (2010).
M. Nielsen, O. Lund, NN-align. An artificial neural network-based alignment algorithm for MHC class II peptide binding prediction, BMC Bioinform. 10, 296 (2009).
M. Nielsen, C. Lundegaard, T. Blicher, B. Peters, A. Sette, S. Justesen, S. Buus, O. Lund, Quantitative predictions of peptide binding to any HLA-DR molecule of known sequence: NetMHCIIpan, PLoS Comput. Biol. 4(7), e1000107 (2008).
M. Nielsen, S. Justesen, O. Lund, C. Lundegaard, S. Buus, NetMHCIIpan-2.0: improved pan-specific HLA-DR predictions using a novel concurrent alignment and weight optimization training procedure, Immunome Res. 6(1), 9 (2010).
D. Ou, L.A. Mitchell, A.J. Tingle, A new categorization of HLA DR alleles on a functional basis, Hum. Immunol. 59(10), 665–676 (1998).
J. Robinson, M.J. Waller, P. Parham, N. de Groot, R. Bontrop, L.J. Kennedy, P. Stoehr, S.G. Marsh, IMGT/HLA and IMGT/MHC: sequence databases for the study of the major histocompatibility complex, Nucleic Acids Res. 31(1), 311–314 (2003).
R. Sadiq, S. Tesfamariam, Probability density functions based weights for ordered weighted averaging (OWA) operators: an example of water quality indices, Eur. J. Oper. Res. 182(3), 1350–1368 (2007).
H. Saigo, J.-P. Vert, N. Ueda, T. Akutsu, Protein homology detection using string alignment kernels, Bioinformatics 20(11), 1682–1689 (2004).
H. Saigo, J.P. Vert, T. Akutsu, Optimizing amino acid substitution matrices with a local alignment kernel, BMC Bioinform. 7, 246 (2006).
J. Salomon, D.R. Flower, Predicting class II MHC-peptide binding: a kernel based approach using similarity scores, BMC Bioinform. 7, 501 (2006).
B. Schölkopf, A.J. Smola, Learning with Kernels (MIT Press, Cambridge, 2001).
A. Sette, J. Sidney, Nine major HLA class I supertypes account for the vast preponderance of HLA-A and -B polymorphism, Immunogenetics 50(3–4), 201–212 (1999).
A. Sette, L. Adorini, S.M. Colon, S. Buus, H.M. Grey, Capacity of intact proteins to bind to MHC class II molecules, J. Immunol. 143(4), 1265–1267 (1989).
J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis (Cambridge University Press, Cambridge, 2004).
J. Sidney, H.M. Grey, R.T. Kubo, A. Sette, Practical, biochemical and evolutionary implications of the discovery of HLA class I supermotifs, Immunol. Today 17(6), 261–266 (1996).
J. Sidney, B. Peters, N. Frahm, C. Brander, A. Sette, HLA class I supertypes: a revised and updated classification, BMC Immunol. 9(1) (2008).
S. Smale, L. Rosasco, J. Bouvrie, A. Caponnetto, T. Poggio, Mathematics of the neural response, Found. Comput. Math. 10(1), 67–91 (2010).
S. Southwood, J. Sidney, A. Kondo, M.F. del Guercio, E. Appella, S. Hoffman, R.T. Kubo, R.W. Chesnut, H.M. Grey, A. Sette, Several common HLA-DR types share largely overlapping peptide binding repertoires, J. Immunol. 160(7), 3363–3373 (1998).
G. Thomson, N. Marthandan, J.A. Hollenbach, S.J. Mack, H.A. Erlich, R.M. Single, M.J. Waller, S.G.E. Marsh, P.A. Guidry, D.R. Karp, R.H. Scheuermann, S.D. Thompson, D.N. Glass, W. Helmberg, Sequence feature variant type (SFVT) analysis of the HLA genetic association in juvenile idiopathic arthritis, in Pacific Symposium on Biocomputing’2010 (2010), pp. 359–370.
J.-P. Vert, H. Saigo, T. Akustu, Convolution and local alignment kernel, in Kernel Methods in Computational Biology, ed. by B. Schoelkopf, K. Tsuda, J.-P. Vert (MIT Press, Cambridge, 2004), pp. 131–154.
G. Wahba, Spline Models for Observational Data (SIAM, Philadelphia, 1990).
L. Wan, G. Reinert, F. Sun, M.S. Waterman, Alignment-free sequence comparison (II): theoretical power of comparison statistics, J. Comput. Biol. 17(11), 1467–1490 (2010).
P. Wang, J. Sidney, C. Dow, B. Mothé, A. Sette, B. Peters, A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach, PLoS Comput. Biol. 4, e1000048 (2008).
C. Widmer, N.C. Toussaint, Y. Altun, O. Kohlbacher, G. Rätsch, Novel machine learning methods for MHC class I binding prediction, in Pattern Recognition Bioinformatics, vol. 6282, ed. by T.M.H. Dijkstra, E. Tsivtsivadze, E. Marchiori, T. Heskes (Springer, Berlin, 2010), pp. 98–109.
R.R. Yager, On ordered weighted averaging aggregation operators in multicriteria decision making, IEEE Trans. Syst. Man Cybern. 18(1), 183–190 (1988).
J.W. Yewdell, J.R. Bennink, Immunodominance in major histocompatibility complex class I-restricted T lymphocyte responses, Annu. Rev. Immunol. 17, 51–88 (1999).
Acknowledgements
The authors would like to thank Shuaicheng Li for pointing out to us that the portions of DRB alleles that contact with peptides can be obtain from the non-aligned DRB amino acid sequences by the use of two markers, “RFL” and “TVQ”. We thank Morten Nielsen for his criticism on over-fitting.
We thank Yiming Cheng for his suggestions on the computer code which were very helpful for speeding up the algorithm for evaluating K 3. He also discussed with us the influence on HLA–peptide binding prediction of using different representations of the alleles, and of adjusting the index β in the kernel according to the sequence length. Although the topics are not included in the paper, they have some potential for future work.
Also, we appreciate Felipe Cucker for reviewing our draft, making many improvements. We thank Santiago Laplagne for pointing out a bug in the codes for Table 2.
The work described in this paper is supported by GRF grant [Project No. 9041544] and [Project No. CityU 103210] and [Project No. 9380050].
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Teresa Krick.
Appendix: The BLOSUM62-2 Matrix
Appendix: The BLOSUM62-2 Matrix
We list the whole BLOSUM62-2 matrix in Table 8. Table 9 explains the amino acids denoted by the capital letters.
From the Introduction, we see that the matrix Q can be recovered from the BLOSUM62-2 once the marginal probability vector p is available. The latter vector is obtained by
where \(v_{1} = (1,\ldots,1)\in\mathbb{R}^{20}\) is a vector with all its coordinate being 1. The matrix Q can be obtained precisely from http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/algo/blast/composition_adjustment/matrix_frequency_data.c#L391.
Rights and permissions
About this article
Cite this article
Shen, WJ., Wong, HS., Xiao, QW. et al. Introduction to the Peptide Binding Problem of Computational Immunology: New Results. Found Comput Math 14, 951–984 (2014). https://doi.org/10.1007/s10208-013-9173-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-013-9173-9
Keywords
- String kernel
- Peptide binding prediction
- Reproducing kernel Hilbert space
- Major histocompatibility complex
- HLA DRB allele classification