skip to main content
10.1145/3637528.3671886acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Advancing Molecule Invariant Representation via Privileged Substructure Identification

Published: 24 August 2024 Publication History

Abstract

Graph neural networks (GNNs) have revolutionized molecule representation learning by modeling molecules as graphs, with atoms represented as nodes and chemical bonds as edges. Despite their progress, they struggle with out-of-distribution scenarios, such as changes in size or scaffold of molecules with identical properties. Some studies attempt to mitigate this issue through graph invariant learning, which penalizes prediction variance across environments to learn invariant representations. But in the realm of molecules, core functional groups forming privileged substructures dominate molecular properties and remain invariant across distribution shifts. This highlights the need for integrating this prior knowledge and ensuring the environment split compatible with molecule invariant learning. To bridge this gap, we propose a novel framework named MILI. Specifically, we first formalize molecule invariant learning based on privileged substructure identification and introduce substructure invariance constraint. Building on this foundation, we theoretically establish two criteria for environment splits conducive to molecule invariant learning. Inspired by these criteria, we develop a dual-head graph neural network. A shared identifier identifies privileged substructures, while environment and task heads generate predictions based on variant and privileged substructures. Through the interaction of two heads, the environments are split and optimized to meet our criteria. The unified MILI guarantees that molecule invariant learning and environment split achieve mutual enhancement from theoretical analysis and network design. Extensive experiments across eight benchmarks validate the effectiveness of MILI compared to state-of-the-art baselines.

References

[1]
Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. 2021. Invariance principle meets information bottleneck for out-of-distribution generalization. In NeurIPS, Vol. 34. 3438--3450.
[2]
Bruce Alberts. 2017. Molecular biology of the cell. Garland science.
[3]
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant risk minimization. arXiv preprint arXiv:1907.02893 (2019).
[4]
Adrià Cereto-Massagué, María José Ojeda, Cristina Valls, Miquel Mulero, Santiago Garcia-Vallvé, and Gerard Pujadas. 2015. Molecular fingerprint similarity search in virtual screening. Methods, Vol. 71 (2015), 58--63.
[5]
Shiyu Chang, Yang Zhang, Mo Yu, and Tommi Jaakkola. 2020. Invariant rationalization. In ICML. 1448--1458.
[6]
Yimeng Chen, Ruibin Xiong, Zhi-Ming Ma, and Yanyan Lan. 2022a. When Does Group Invariant Learning Survive Spurious Correlations?. In NeurIPS, Vol. 35. 7038--7051.
[7]
Yongqiang Chen, Yonggang Zhang, Yatao Bian, Han Yang, MA Kaili, Binghui Xie, Tongliang Liu, Bo Han, and James Cheng. 2022b. Learning causally invariant representations for out-of-distribution generalization on graphs. In NeurIPS, Vol. 35. 22131--22148.
[8]
Elliot Creager, Jörn-Henrik Jacobsen, and Richard Zemel. 2021. Environment inference for invariant learning. In ICML. 2189--2200.
[9]
Laurianne David, Amol Thakkar, Rocío Mercado, and Ola Engkvist. 2020. Molecular representations in AI-driven drug discovery: a review and practical guide. Journal of Cheminformatics, Vol. 12, 1 (2020), 1--22.
[10]
Jörg Degen, Christof Wegscheid-Gerlach, Andrea Zaliani, and Matthias Rarey. 2008. On the Art of Compiling and Using'Drug-Like'Chemical Fragment Spaces. ChemMedChem: Chemistry Enabling Drug Discovery, Vol. 3, 10 (2008), 1503--1507.
[11]
Qi Dou, Daniel Coelho de Castro, Konstantinos Kamnitsas, and Ben Glocker. 2019. Domain generalization via model-agnostic learning of semantic features. In NeurIPS, Vol. 32.
[12]
Joseph L Durant, Burton A Leland, Douglas R Henry, and James G Nourse. 2002. Reoptimization of MDL keys for use in drug discovery. Journal of chemical information and computer sciences, Vol. 42, 6 (2002), 1273--1280.
[13]
David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In NeurIPS, Vol. 28.
[14]
Shaohua Fan, Xiao Wang, Yanhu Mo, Chuan Shi, and Jian Tang. 2022. Debiasing graph neural networks via learning disentangled causal substructure. In NeurIPS, Vol. 35. 24934--24946.
[15]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franccois Laviolette, Mario March, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. Journal of machine learning research, Vol. 17, 59 (2016), 1--35.
[16]
Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In ICLR.
[17]
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In ICML. 1263--1272.
[18]
Garrett B Goh, Nathan O Hodas, Charles Siegel, and Abhinav Vishnu. 2017. Smiles2vec: An interpretable general-purpose deep neural network for predicting chemical properties. arXiv preprint arXiv:1712.02034 (2017).
[19]
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS, Vol. 30.
[20]
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. In NeurIPS, Vol. 33. 22118--22133.
[21]
Kexin Huang, Tianfan Fu, Lucas M Glass, Marinka Zitnik, Cao Xiao, and Jimeng Sun. 2020. DeepPurpose: a deep learning library for drug--target interaction prediction. Bioinformatics, Vol. 36, 22--23 (2020), 5545--5547.
[22]
Yuanfeng Ji, Lu Zhang, Jiaxiang Wu, Bingzhe Wu, Lanqing Li, Long-Kai Huang, Tingyang Xu, Yu Rong, Jie Ren, Ding Xue, et al. 2023. Drugood: Out-of-distribution dataset curator and benchmark for ai-aided drug discovery--a focus on affinity prediction problems with noise annotations. In AAAI, Vol. 37. 8023--8031.
[23]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[24]
Thomas N Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.
[25]
Douglas B Kitchen, Hélène Decornez, John R Furr, and Jürgen Bajorath. 2004. Docking and scoring in virtual screening for drug discovery: methods and applications. Nature reviews Drug discovery, Vol. 3, 11 (2004), 935--949.
[26]
Justin Klekota and Frederick P Roth. 2008. Chemical substructures that enrich for biological activity. Bioinformatics, Vol. 24, 21 (2008), 2518--2525.
[27]
David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. 2021. Out-of-distribution generalization via risk extrapolation (rex). In ICML. 5815--5826.
[28]
Haoyang Li, Xin Wang, Ziwei Zhang, and Wenwu Zhu. 2022a. Ood-gnn: Out-of-distribution generalized graph neural network. IEEE Transactions on Knowledge and Data Engineering (2022).
[29]
Haoyang Li, Ziwei Zhang, Xin Wang, and Wenwu Zhu. 2022b. Learning invariant graph representations for out-of-distribution generalization. In NeurIPS, Vol. 35. 11828--11841.
[30]
Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. 2018. Deep domain generalization via conditional invariant adversarial networks. In ECCV. 624--639.
[31]
Gang Liu, Tong Zhao, Jiaxin Xu, Tengfei Luo, and Meng Jiang. 2022. Graph rationalization with environment-based augmentations. In KDD. 1069--1078.
[32]
Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. 2021. Heterogeneous risk minimization. In ICML. 6804--6814.
[33]
Divyat Mahajan, Shruti Tople, and Amit Sharma. 2021. Domain generalization using causal matching. In ICML. 7313--7324.
[34]
Toshihiko Matsuura and Tatsuya Harada. 2020. Domain generalization using a mixture of multiple latent domains. In AAAI, Vol. 34. 11749--11756.
[35]
Siqi Miao, Mia Liu, and Pan Li. 2022. Interpretable and generalizable graph learning via stochastic attention mechanism. In ICML. 15524--15543.
[36]
Chuleeporn Phanus-Umporn, Watshara Shoombuatong, Veda Prachayasittikul, Nuttapat Anuwongcharoen, and Chanin Nantasenamat. 2018. Privileged substructures for anti-sickling activity via cheminformatic analysis. RSC advances, Vol. 8, 11 (2018), 5920--5935.
[37]
David Rogers and Mathew Hahn. 2010. Extended-connectivity fingerprints. Journal of chemical information and modeling, Vol. 50, 5 (2010), 742--754.
[38]
Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. 2020. Self-supervised graph transformer on large-scale molecular data. In NeurIPS, Vol. 33. 12559--12571.
[39]
Elan Rosenfeld, Pradeep Kumar Ravikumar, and Andrej Risteski. 2020. The Risks of Invariant Risk Minimization. In ICLR.
[40]
Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. 2019. Distributionally Robust Neural Networks. In ICLR.
[41]
Stephan R Sain. 1996. The nature of statistical learning theory.
[42]
Zheyan Shen, Jiashuo Liu, Yue He, Xingxuan Zhang, Renzhe Xu, Han Yu, and Peng Cui. 2021. Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624 (2021).
[43]
Hiroyuki Shindo and Yuji Matsumoto. 2019. Gated graph recursive neural networks for molecular property prediction. arXiv preprint arXiv:1909.00259 (2019).
[44]
Zeren Shui and George Karypis. 2020. Heterogeneous molecular graph neural networks for predicting molecule properties. In ICDM. 492--500.
[45]
Manvi Singh, Reshmi Divakaran, Leela Sarath Kumar Konda, and Rajendra Kristam. 2020. A classification model for blood brain barrier penetration. Journal of Molecular Graphics and Modelling, Vol. 96 (2020), 107516.
[46]
Yongduo Sui, Xiang Wang, Jiancan Wu, Min Lin, Xiangnan He, and Tat-Seng Chua. 2022. Causal attention for interpretable and generalizable graph classification. In KDD. 1696--1705.
[47]
Baochen Sun and Kate Saenko. 2016. Deep coral: Correlation alignment for deep domain adaptation. In ECCV Workshops. 443--450.
[48]
Damien Teney, Ehsan Abbasnejad, and Anton van den Hengel. 2021. Unshuffling data for improved generalization in visual question answering. In ICCV. 1417--1427.
[49]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS, Vol. 30.
[50]
Petar Velivcković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In ICLR.
[51]
Hongwei Wang, Weijiang Li, Xiaomeng Jin, Kyunghyun Cho, Heng Ji, Jiawei Han, and Martin D Burke. 2021a. Chemical-Reaction-Aware Molecule Representation Learning. In ICLR.
[52]
Ruijia Wang, Shuai Mou, Xiao Wang, Wanpeng Xiao, Qi Ju, Chuan Shi, and Xing Xie. 2021b. Graph structure estimation neural networks. In WWW. 342--353.
[53]
Ruijia Wang, Yiwu Sun, Yujie Luo, Shaochuan Li, Cheng Yang, Xingyi Cheng, Hui Li, Chuan Shi, and Le Song. 2024. Injecting Multimodal Information into Rigid Protein Docking via Bi-level Optimization. NeurIPS, Vol. 36.
[54]
Xiao Wang, Ruijia Wang, Chuan Shi, Guojie Song, and Qingyong Li. 2020. Multi-component graph convolutional collaborative filtering. In AAAI, Vol. 34. 6267--6274.
[55]
David Weininger. 1988. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences, Vol. 28, 1 (1988), 31--36.
[56]
Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying graph convolutional networks. In ICML. 6861--6871.
[57]
Yingxin Wu, Xiang Wang, An Zhang, Xiangnan He, and Tat-Seng Chua. 2021. Discovering Invariant Rationales for Graph Neural Networks. In ICLR.
[58]
Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. 2018. MoleculeNet: a benchmark for molecular machine learning. Chemical science, Vol. 9, 2 (2018), 513--530.
[59]
Zhaoping Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xutong Li, Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, et al. 2019. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of medicinal chemistry, Vol. 63, 16 (2019), 8749--8760.
[60]
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How Powerful are Graph Neural Networks?. In ICLR.
[61]
Cheng Yang, Maosong Sun, Wayne Xin Zhao, Zhiyuan Liu, and Edward Y Chang. 2017. A neural network approach to jointly modeling social networks and mobile trajectories. ACM Transactions on Information Systems (TOIS), Vol. 35, 4 (2017), 1--28.
[62]
Nianzu Yang, Kaipeng Zeng, Qitian Wu, Xiaosong Jia, and Junchi Yan. 2022. Learning substructure invariance for out-of-distribution molecular representations. In NeurIPS, Vol. 35. 12964--12978.
[63]
Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. 2019. Gnnexplainer: Generating explanations for graph neural networks. In NeurIPS, Vol. 32.
[64]
Hao Yuan, Haiyang Yu, Shurui Gui, and Shuiwang Ji. 2022. Explainability in graph neural networks: A taxonomic survey. IEEE transactions on pattern analysis and machine intelligence, Vol. 45, 5 (2022), 5782--5799.
[65]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).
[66]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In ICLR.
[67]
Michael Zhang, Nimit S Sohoni, Hongyang R Zhang, Chelsea Finn, and Christopher Re. 2022. Correct-N-Contrast: a Contrastive Approach for Improving Robustness to Spurious Correlations. In ICML. 26484--26516.
[68]
Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. 2019. On learning invariant representations for domain adaptation. In ICML. 7523--7532.
[69]
Tianyi Zhao, Yang Hu, Linda R Valsdottir, Tianyi Zang, and Jiajie Peng. 2021. Identifying drug--target interactions based on graph convolutional network and deep neural network. Briefings in bioinformatics, Vol. 22, 2 (2021), 2141--2150.
[70]
Xiang Zhuang, Qiang Zhang, Keyan Ding, Yatao Bian, Xiao Wang, Jingsong Lv, Hongyang Chen, and Huajun Chen. 2023. Learning Invariant Molecular Representation in Latent Discrete Space. In NeurIPS. endthebibl

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2024
6901 pages
ISBN:9798400704901
DOI:10.1145/3637528
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. environment split
  2. molecule invariant learning
  3. molecule representation learning
  4. privileged substructure identification

Qualifiers

  • Research-article

Conference

KDD '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 257
    Total Downloads
  • Downloads (Last 12 months)257
  • Downloads (Last 6 weeks)38
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media