skip to main content
10.1145/3097983.3097984acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

Functional Annotation of Human Protein Coding Isoforms via Non-convex Multi-Instance Learning

Published: 04 August 2017 Publication History

Editorial Notes

The authors have requested minor, non-substantive changes to the VoR and, in accordance with ACM policies, a Corrected VoR was published on August 30, 2022. For reference purposes the VoR may still be accessed via the Supplemental Material section on this page.

Abstract

Functional annotation of human genes is fundamentally important for understanding the molecular basis of various genetic diseases. A major challenge in determining the functions of human genes lies in the functional diversity of proteins, that is, a gene can perform different functions as it may consist of multiple protein coding isoforms (PCIs). Therefore, differentiating functions of PCIs can significantly deepen our understanding of the functions of genes. However, due to the lack of isoform-level gold-standards (ground-truth annotation), many existing functional annotation approaches are developed at gene-level. In this paper, we propose a novel approach to differentiate the functions of PCIs by integrating sparse simplex projection---that is, a nonconvex sparsity-inducing regularizer---with the framework of multi-instance learning (MIL). Specifically, we label the genes that are annotated to the function under consideration as positive bags and the genes without the function as negative bags. Then, by sparse projections onto simplex, we learn a mapping that embeds the original bag space to a discriminative feature space. Our framework is flexible to incorporate various smooth and non-smooth loss functions such as logistic loss and hinge loss. To solve the resulting highly nontrivial non-convex and non-smooth optimization problem, we further develop an efficient block coordinate descent algorithm. Extensive experiments on human genome data demonstrate that the proposed approaches significantly outperform the state-of-the-art methods in terms of functional annotation accuracy of human PCIs and efficiency.

Supplementary Material

3097984-VoR (3097984-vor.pdf)
Version of Record for "Functional Annotation of Human Protein Coding Isoforms via Non-convex Multi-Instance Learning" by Luo et al., Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17).
MP4 File (luo_human_protein_coding.mp4)

References

[1]
Robert A Amar, Daniel R Dooly, Sally A Goldman, and Qi Zhang. 2001. Multiple-instance learning of real-valued data. In International Conference on Machine learning. 3--10.
[2]
Jaume Amores. 2013. Multiple instance classification: Review, taxonomy and comparative study. Artificial Intelligence Vol. 201 (2013), 81--105.
[3]
Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. 2002. Support vector machines for multiple-instance learning Advances in Neural Information Processing Systems. 561--568.
[4]
Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, and others. 2000. Gene Ontology: tool for the unification of biology. Nature Genetics, Vol. 25, 1 (2000), 25--29.
[5]
Amir Beck and Marc Teboulle 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences Vol. 2, 1 (2009), 183--202.
[6]
Ko-Fan Chen and Damian C Crowther 2012. Functional genomics in Drosophila models of human disease. Briefings in Functional Genomics Vol. 11, 5 (2012), 405--415.
[7]
Yixin Chen, Jinbo Bi, and James Z Wang 2006. MILES: Multiple-instance learning via embedded instance selection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol. 28, 12 (2006), 1931--1947.
[8]
Veronika Cheplygina, David MJ Tax, and Marco Loog. 2015. Multiple instance learning with bag dissimilarities. Pattern Recognition, Vol. 48, 1 (2015), 264--275.
[9]
ENCODE Project Consortium and others 2012. An integrated encyclopedia of DNA elements in the human genome. Nature, Vol. 489, 7414 (2012), 57--74.
[10]
Melissa J Davis, Muhammad SB Sehgal, and Mark A Ragan. 2010. Automatic, context-specific generation of Gene Ontology slims. BMC Bioinformatics, Vol. 11, 1 (2010), 498.
[11]
Eleonora de Klerk and Peter A.C. 't Hoen 2015. Alternative mRNA transcription, processing, and translation: insights from RNA sequencing. Trends in Genetics, Vol. 31, 3 (2015), 128--139.
[12]
Thomas Derrien, Rory Johnson, Giovanni Bussotti, Andrea Tanzer, Sarah Djebali, Hagen Tilgner, Gregory Guernec, David Martin, Angelika Merkel, David G Knowles, and others 2012. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Research, Vol. 22, 9 (2012), 1775--1789.
[13]
Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez 1997. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence Vol. 89, 1 (1997), 31--71.
[14]
John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra 2008. Efficient projections onto the l 1-ball for learning in high dimensions Proceedings of the 25th International Conference on Machine learning. ACM, 272--279.
[15]
Ridvan Eksi, Hong-Dong Li, Rajasree Menon, Yuchen Wen, Gilbert S. Omenn, Matthias Kretzler, and Yuanfang Guan 2013. Systematically Differentiating Functions for Alternatively Spliced Isoforms through Integrating RNA-seq Data. PLoS Comput Biol, Vol. 9, 11 (11 2013), e1003314.
[16]
Tom Fawcett. 2006. An introduction to ROC analysis. Pattern Recognition Letters Vol. 27, 8 (2006), 861--874.
[17]
Robert Fung and Kuo-Chu Chang 2013. Weighing and integrating evidence for stochastic simulation in Bayesian networks. arXiv preprint arXiv:1304.1504 (2013).
[18]
Gene H Golub, Michael Heath, and Grace Wahba. 1979. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, Vol. 21, 2 (1979), 215--223.
[19]
Jennifer Harrow, Adam Frankish, Jose M Gonzalez, Electra Tapanari, Mark Diekhans, Felix Kokocinski, Bronwen L Aken, Daniel Barrell, Amonida Zadissa, Stephen Searle, and others 2012. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Research, Vol. 22, 9 (2012), 1760--1774.
[20]
Marti A. Hearst, Susan T Dumais, Edgar Osman, John Platt, and Bernhard Scholkopf 1998. Support vector machines. IEEE Intelligent Systems and their Applications, Vol. 13, 4 (1998), 18--28.
[21]
David W Hosmer Jr and Stanley Lemeshow 2004. Applied logistic regression. John Wiley & Sons.
[22]
Minoru Kanehisa, Michihiro Araki, Susumu Goto, Masahiro Hattori, Mika Hirakawa, Masumi Itoh, Toshiaki Katayama, Shuichi Kawashima, Shujiro Okuda, Toshiaki Tokimatsu, and others 2008. KEGG for linking genomes to life and the environment. Nucleic Acids Research Vol. 36, suppl 1 (2008), D480--D484.
[23]
Minoru Kanehisa, Susumu Goto, Yoko Sato, Miho Furumichi, and Mao Tanabe 2011. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Research (2011), gkr988.
[24]
Anastasios T Kyrillidis, Stephen Becker, Volkan Cevher, and Christoph Koch 2013. Sparse projections onto the simplex. In Proceedings of The 30th International Conference on Machine Learning, Vol. Vol. 28. JMLR, 235--243.
[25]
Hong-Dong Li, Rajasree Menon, Gilbert S Omenn, and Yuanfang Guan 2014. The emerging era of genomic data integration for analyzing splice isoform function. Trends in Genetics, Vol. 30, 8 (2014), 340--347.
[26]
Wenyuan Li, Shuli Kang, Chun-Chi Liu, Xianghong Jasmine Zhou, and others 2014. High-resolution functional annotation of human transcriptome: predicting isoform functions by a novel multiple instance-based label propagation method. Nucleic Acids Research Vol. 42, 6 (2014), e39.
[27]
Yan Li, David MJ Tax, Robert PW Duin, and Marco Loog. 2013. Multiple-instance learning as a classifier combining problem. Pattern Recognition, Vol. 46, 3 (2013), 865--874.
[28]
Guoqing Liu, Jianxin Wu, and Zhi-Hua Zhou 2012. Key instance detection in multi-instance learning. ACML, Vol. Vol. 25. 253--268.
[29]
Oded Maron and Tomás Lozano-Pérez 1998. A framework for multiple-instance learning. Advances in Neural Information Processing Systems (1998), 570--576.
[30]
Yurii Nesterov. 1983. A method of solving a convex programming problem with convergence rate$O(1/k 2). Soviet Mathematics Doklady Vol. 27 (1983). Issue 2.
[31]
Bharat Panwar, Rajasree Menon, Ridvan Eksi, Hong-Dong Li, Gilbert S Omenn, and Yuanfang Guan. 2016. Genome-wide functional annotation of human protein-coding splice variants using multiple instance learning. Journal of Proteome Research Vol. 15, 6 (2016), 1747--1753.
[32]
Gunnar R"aätsch, Takashi Onoda, and K-R Müller. 2001. Soft margins for AdaBoost. Machine Learning, Vol. 42, 3 (2001), 287--320.
[33]
Burr Settles, Mark Craven, and Soumya Ray. 2008. Multiple-instance active learning. In Advances in Neural Information Processing Systems. 1289--1296.
[34]
Timothy Sterne-Weiler and Jeremy R Sanford 2014. Exon identity crisis: disease-causing mutations that disrupt the splicing code. Genome Biology, Vol. 15, 1 (2014), 201.
[35]
Cole Trapnell, Lior Pachter, and Steven L Salzberg. 2009. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, Vol. 25, 9 (2009), 1105--1111.
[36]
Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, and Lior Pachter. 2012. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, Vol. 7, 3 (2012), 562--578.
[37]
Kai Wang, Mingyao Li, and Hakon Hakonarson 2010. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research Vol. 38, 16 (2010), e164--e164.
[38]
Xiushen Wei, Jianxin Wu, and Zhihua Zhou 2014. Scalable multi-instance learning. In 2014 IEEE International Conference on Data Mining (ICDM). IEEE, 1037--1042.
[39]
Xiushen Wei, Jianxin Wu, and Zhihua Zhou 2017. Scalable algorithms for multi-instance learning. IEEE Transactions on Neural Networks and Learning Systems, Vol. 28, 4 (2017), 975--987.
[40]
Hui Y Xiong, Babak Alipanahi, Leo J Lee, Hannes Bretschneider, Daniele Merico, Ryan KC Yuen, Yimin Hua, Serge Gueroussov, Hamed S Najafabadi, Timothy R Hughes, and others 2015. The human splicing code reveals new insights into the genetic determinants of disease. Science, Vol. 347, 6218 (2015).
[41]
Yin Wotao Xu, Yangyang. 2017. A Globally Convergent Algorithm for Nonconvex Optimization Based on Block Coordinate Update. Journal of Scientific Computing (2017), 1--35.
[42]
Hui Zou and Trevor Hastie 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 67, 2 (2005), 301--320.

Cited By

View all
  • (2025)Imbalanced multi-instance multi-label learning via tensor product-based semantic fusionFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40192-519:8Online publication date: 1-Aug-2025
  • (2024)Multiple-Instance Learning from Pairwise Comparison BagsACM Transactions on Intelligent Systems and Technology10.1145/369646015:6(1-22)Online publication date: 29-Sep-2024
  • (2024)Imbalanced Multi-instance Multi-label Learning via Coding Ensemble and Adaptive ThresholdsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680911(5413-5422)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. alternative splicing
  2. human pcis
  3. key instance detection
  4. multiple instance learning
  5. non-convex problem

Qualifiers

  • Research-article

Funding Sources

  • NSF China
  • NSF
  • NIH

Conference

KDD '17
Sponsor:

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)97
  • Downloads (Last 6 weeks)9
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Imbalanced multi-instance multi-label learning via tensor product-based semantic fusionFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40192-519:8Online publication date: 1-Aug-2025
  • (2024)Multiple-Instance Learning from Pairwise Comparison BagsACM Transactions on Intelligent Systems and Technology10.1145/369646015:6(1-22)Online publication date: 29-Sep-2024
  • (2024)Imbalanced Multi-instance Multi-label Learning via Coding Ensemble and Adaptive ThresholdsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680911(5413-5422)Online publication date: 28-Oct-2024
  • (2022)Multiview Multi-Instance Multilabel Active LearningIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.305643633:9(4311-4321)Online publication date: Sep-2022
  • (2022)Tissue Specificity Based Isoform Function PredictionIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2021.309316719:5(3048-3059)Online publication date: 1-Sep-2022
  • (2021)FINER: enhancing the prediction of tissue-specific functions of isoforms by refining isoform interaction networksNAR Genomics and Bioinformatics10.1093/nargab/lqab0573:2Online publication date: 22-Jun-2021
  • (2021)Imbalance deep multi‐instance learning for predicting isoform–isoform interactionsInternational Journal of Intelligent Systems10.1002/int.2240236:6(2797-2824)Online publication date: 25-Feb-2021
  • (2020)Computational Methods for Predicting Functions at the mRNA Isoform LevelInternational Journal of Molecular Sciences10.3390/ijms2116568621:16(5686)Online publication date: 8-Aug-2020
  • (2020)Multi-typed Objects Multi-view Multi-instance Multi-label Learning2020 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM50108.2020.00179(1370-1375)Online publication date: Nov-2020
  • (2019)DIFFUSE: predicting isoform functions from sequences and expression profiles via deep learningBioinformatics10.1093/bioinformatics/btz36735:14(i284-i294)Online publication date: 5-Jul-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media