skip to main content
10.1145/3674658.3674674acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicbbtConference Proceedingsconference-collections
research-article

ESM-MHC: An Improved Predictor of MHC Using ESM Protein Language Model

Published: 18 November 2024 Publication History

Abstract

The major histocompatibility complex (MHC) comprises a set of genes located on chromosome 6 in humans and chromosome 17 in mice. It is instrumental in autoimmune diseases, the success of organ transplants, and antigen presentation in tumor immunotherapy. Therefore, accurately identifying MHC is essential for understanding its role in immunology. In this study, we proposed a machine learning-based model for predicting MHC, called ESM-MHC. Firstly, the ESM protein language model was employed to characterize protein sequences, resulting in the generation of 1280-dimensional features. Secondly, principal component analysis (PCA) was employed to analyze these features and select the features that showed notably different distributions between positive and negative samples as the final feature set. Finally, Multi-Layer Perceptron (MLP) was trained on the final feature set to obtain the identification model. The model achieved an accuracy of 95.97% and 96% in the 10-fold cross-validation and test set, respectively, outperforming existing methods in terms of performance.

References

[1]
Kubiniok, P., et al., Understanding the constitutive presentation of MHC class I immunopeptidomes in primary tissues. 2022. 25(2): p. 103768.
[2]
Kubiniok, P., et al., Understanding the constitutive presentation of MHC class I immunopeptidomes in primary tissues. Iscience, 2022. 25(2).
[3]
Rock, K.L., E. Reits, and J. Neefjes, Present yourself! By MHC class I and MHC class II molecules. Trends in immunology, 2016. 37(11): p. 724-737.
[4]
Axelrod, M.L., et al., Biological consequences of MHC-II expression by tumor cells in cancer. Clinical cancer research, 2019. 25(8): p. 2392-2402.
[5]
Garcia‐Lora, A., I. Algarra, and F. Garrido, MHC class I antigens, immune surveillance, and tumor immune escape. Journal of cellular physiology, 2003. 195(3): p. 346-355.
[6]
Li, Y., M. Niu, and Q. Zou, ELM-MHC: an improved MHC identification method with extreme learning machine algorithm. Journal of proteome research, 2019. 18(3): p. 1392-1401.
[7]
Chen, D. and Y. Li, PredMHC: An Effective Predictor of Major Histocompatibility Complex Using Mixed Features. Frontiers in Genetics, 2022. 13: p. 875112.
[8]
Guan, S., et al. DNA-binding protein prediction based on deep learning feature fusion. in Intelligent Computing Theories and Application: 17th International Conference, ICIC 2021, Shenzhen, China, August 12–15, 2021, Proceedings, Part III 17. 2021. Springer.
[9]
Wang, Y., et al., SBSM-Pro: Support Bio-sequence Machine for Proteins. arXiv preprint arXiv:2308.10275, 2023.
[10]
Apweiler, R., et al., UniProt: the universal protein knowledgebase. Nucleic acids research, 2004. 32(suppl_1): p. D115-D119.
[11]
Consortium, U., Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic acids research, 2012. 40(D1): p. D71-D75.
[12]
Fu, L., et al., CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 2012. 28(23): p. 3150-3152.
[13]
Brandes, N., et al., Genome-wide prediction of disease variant effects with a deep protein language model. Nature Genetics, 2023. 55(9): p. 1512-1522.
[14]
Livesey, B.J. and J.A. Marsh, Advancing variant effect prediction using protein language models. Nature Genetics, 2023. 55(9): p. 1426-1427.
[15]
Tran, C., S. Khadkikar, and A. Porollo, Survey of Protein Sequence Embedding Models. International Journal of Molecular Sciences, 2023. 24(4): p. 3775.
[16]
Zhang, Y., et al., HLAB: learning the BiLSTM features from the ProtBert-encoded proteins for the class I HLA-peptide binding prediction. Briefings in Bioinformatics, 2022. 23(5): p. bbac173.
[17]
Behjati, A., et al., Protein sequence profile prediction using ProtAlbert transformer. Computational Biology and Chemistry, 2022. 99: p. 107717.
[18]
Guntuboina, C., et al., Peptidebert: A language model based on transformers for peptide property prediction. The Journal of Physical Chemistry Letters, 2023. 14: p. 10427-10434.
[19]
Khan Jr, A.R., M. Reinders, and I. Khatri Jr, Determining epitope specificity of T-cell receptors with transformers. bioRxiv, 2023: p. 2023.03. 31.534974.
[20]
Alkuhlani, A., et al., PTG-PLM: Predicting Post-Translational Glycosylation and Glycation Sites Using Protein Language Models and Deep Learning. Axioms, 2022. 11(9): p. 469.
[21]
Ashrafzadeh, S., G.B. Golding, and L. Ilie, Scoring alignments by embedding vector similarity. bioRxiv, 2023: p. 2023.08. 30.555602.
[22]
He, S., et al., MRMD3.0: A Python tool and webserver for dimensionality reduction and data visualization via an ensemble strategy. Journal of Molecular Biology, 2023. 435: p. 168116.
[23]
Bro, R. and A.K. Smilde, Principal component analysis. Analytical methods, 2014. 6(9): p. 2812-2831.
[24]
Price, A., Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909. (No Title), 2006.
[25]
Zeng, X., et al., Deep collaborative filtering for prediction of disease genes. IEEE/ACM transactions on computational biology and bioinformatics, 2019. 17(5): p. 1639-1647.
[26]
Zeng, X., et al., Prediction and validation of disease genes using HeteSim Scores. IEEE/ACM transactions on computational biology and bioinformatics, 2016. 14(3): p. 687-695.
[27]
Ahmed, Z., et al., iThermo: a sequence-based model for identifying thermophilic proteins using a multi-feature fusion strategy. Frontiers in Microbiology, 2022. 13: p. 790063.
[28]
Susanty, M., et al., Low Complexity Classification of Thermophilic Protein using One Hot Encoding as Protein Representation. International Journal of Advanced Computer Science and Applications, 2022. 13(12).
[29]
Zhao, J., W. Yan, and Y. Yang, DeepTP: A Deep Learning Model for Thermophilic Protein Prediction. International Journal of Molecular Sciences, 2023. 24(3): p. 2217.
[30]
Van Der Maaten, L., Accelerating t-SNE using tree-based algorithms. The journal of machine learning research, 2014. 15(1): p. 3221-3245.

Index Terms

  1. ESM-MHC: An Improved Predictor of MHC Using ESM Protein Language Model

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICBBT '24: Proceedings of the 2024 16th International Conference on Bioinformatics and Biomedical Technology
    May 2024
    279 pages
    ISBN:9798400717666
    DOI:10.1145/3674658
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 November 2024

    Check for updates

    Qualifiers

    • Research-article

    Conference

    ICBBT 2024

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 48
      Total Downloads
    • Downloads (Last 12 months)48
    • Downloads (Last 6 weeks)18
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media