skip to main content
10.1145/3534678.3542606acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
abstract
Public Access

New Frontiers of Scientific Text Mining: Tasks, Data, and Tools

Published: 14 August 2022 Publication History

Abstract

Exploring the vast amount of rapidly growing scientific text data is highly beneficial for real-world scientific discovery. However, scientific text mining is particularly challenging due to the lack of specialized domain knowledge in natural language context, complex sentence structures in scientific writing, and multi-modal representations of scientific knowledge. This tutorial presents a comprehensive overview of recent research and development on scientific text mining, focusing on the biomedical and chemistry domains. First, we introduce the motivation and unique challenges of scientific text mining. Then we discuss a set of methods that perform effective scientific information extraction, such as named entity recognition, relation extraction, and event extraction. We also introduce real-world applications such as textual evidence retrieval, scientific topic contrasting for drug discovery, and molecule representation learning for reaction prediction. Finally, we conclude our tutorial by demonstrating, on real-world datasets (COVID-19 and organic chemistry literature), how the information can be extracted and retrieved, and how they can assist further scientific discovery. We also discuss the emerging research problems and future directions for scientific text mining.

References

[1]
Alexis Allot, Qingyu Chen, Sun Kim, Roberto Vera Alvarez, Donald C Comeau, W John Wilbur, and Zhiyong Lu. 2019. LitSense: making sense of biomedical literature at sentence level. Nucleic acids research (2019).
[2]
Seyone Chithrananda, Gabe Grand, and Bharath Ramsundar. 2020. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. ArXiv preprint, Vol. abs/2010.09885 (2020).
[3]
Benedek Fabian, Thomas Edlich, Héléna Gaspar, Marwin Segler, Joshua Meyers, Marco Fiscato, and Mohamed Ahmed. 2020. Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv:2011.13230 (2020).
[4]
Jiayuan He, Dat Quoc Nguyen, Saber A Akhondi, Christian Druckenbrodt, Camilo Thorne, Ralph Hoessel, Zubair Afzal, Zenan Zhai, Biaoyan Fang, Hiyori Yoshikawa, et almbox. 2020. Overview of ChEMU 2020: named entity recognition and event extraction of chemical reactions from patents. In CLEF. Springer, 237--254.
[5]
Sabrina Jaeger, Simone Fulle, and Samo Turk. 2018. Mol2vec: unsupervised machine learning approach with chemical intuition. Journal of chemical information and modeling, Vol. 58, 1 (2018), 27--35.
[6]
Martin Krallinger, Florian Leitner, Obdulia Rabal, Miguel Vazquez, Julen Oyarzabal, and Alfonso Valencia. 2015. CHEMDNER: The drugs and chemical names extraction challenge. Journal of cheminformatics, Vol. 7, 1 (2015), S1.
[7]
Tuan Lai, Heng Ji, ChengXiang Zhai, and Quan Hung Tran. 2021. Joint Biomedical Entity and Relation Extraction with Knowledge-Enhanced Collective Inference. In ACL. 6248--6260.
[8]
Manling Li, Alireza Zareian, Ying Lin, Xiaoman Pan, Spencer Whitehead, Brian Chen, Bo Wu, Heng Ji, Shih-Fu Chang, Clare Voss, et al. 2020. Gaia: A fine-grained multimedia knowledge extraction system. In ACL. 77--86.
[9]
Qi Li, Xuan Wang, Yu Zhang, Fei Ling, Cathy Wu H, and Jiawei Han. 2018. Pattern Discovery for Wide-Window Open Information Extraction in Biomedical Literature. In BIBM. 420--427.
[10]
David A Liem, Sanjana Murali, Dibakar Sigdel, Yu Shi, Xuan Wang, Jiaming Shen, Howard Choi, John H Caufield, Wei Wang, Peipei Ping, et almbox. 2018. Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease. American Journal of Physiology-Heart and Circulatory Physiology, Vol. 315, 4 (2018), H910--H924.
[11]
Emily K Mallory, Ambika Acharya, Stefano E Rensi, Peter J Turnbaugh, Roselie A Bright, and Russ B Altman. 2018. Chemical reaction vector embeddings: towards predicting drug metabolism in the human gut microbiome. In PSB. 56--67.
[12]
Hans-Michael Müller, Eimear E Kenny, and Paul W Sternberg. 2004. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol., Vol. 2, 11 (2004), e309.
[13]
Thomas Rebele, Fabian Suchanek, Johannes Hoffart, Joanna Biega, Erdal Kuzey, and Gerhard Weikum. 2016. YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames. In ISWC. Springer, 177--185.
[14]
Xiang Ren, Jiaming Shen, Meng Qu, Xuan Wang, Zeqiu Wu, Qi Zhu, Meng Jiang, Fangbo Tao, Saurabh Sinha, David Liem, et al. 2017. Life-inet: A structured network-based knowledge exploration and analytics system for life sciences. In ACL. 55--60.
[15]
Stefano Rensi and Russ B Altman. 2017. Flexible analog search with kernel PCA embedded molecule vectors. Computational and structural biotechnology journal, Vol. 15 (2017), 320--327.
[16]
Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Hsu, and Kuansan Wang. 2015. An overview of microsoft academic service (mas) and applications. In WWW. 243--246.
[17]
Raphael Tang, Rodrigo Nogueira, Edwin Zhang, Nikhil Gupta, Phuong Cam, Kyunghyun Cho, and Jimmy Lin. 2020. Rapidly Bootstrapping a Question Answering Dataset for COVID-19. arXiv preprint arXiv:2004.11339 (2020).
[18]
George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. 2015. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, Vol. 16, 1 (2015), 138.
[19]
Marco A Valenzuela-Escárcega, Özgün Babur, Gus Hahn-Powell, Dane Bell, Thomas Hicks, Enrique Noriega-Atala, Xia Wang, Mihai Surdeanu, Emek Demir, and Clayton T Morrison. 2018. Large-scale automated machine reading discovers new cancer-driving mechanisms. Database, Vol. 2018 (2018).
[20]
Hongwei Wang, Weijiang Li, Xiaomeng Jin, Kyunghyun Cho, Heng Ji, Jiawei Han, and Martin D Burke. 2021 b. Chemical-Reaction-Aware Molecule Representation Learning. ICLR (2021).
[21]
Qingyun Wang, Manling Li, Xuan Wang, Nikolaus Parulian, Guangxing Han, Jiawei Ma, Jingxuan Tu, Ying Lin, Ranran Haoran Zhang, Weili Liu, et al. 2021 c. COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation. In NAACL. 66--77.
[22]
Xuan Wang, Yingjun Guan, Weili Liu, Aabhas Chauhan, Enyi Jiang, Qi Li, David Liem, Dibakar Sigdel, John Caufield, Peipei Ping, and Jiawei Han. 2020 a. EVIDENCEMINER: Textual Evidence Discovery for Life Sciences. In ACL. 56--62.
[23]
Xuan Wang, Vivian Hu, Xiangchen Song, Shweta Garg, Jinfeng Xiao, and Jiawei Han. 2021 a. ChemNER: Fine-Grained Chemistry Named Entity Recognition with Ontology-guided Distant Supervision. In EMNLP. 5227--5240.
[24]
Xuan Wang, Xiangchen Song, Bangzheng Li, Kang Zhou, Qi Li, and Jiawei Han. 2020 b. Fine-Grained Named Entity Recognition with Distant Supervision in COVID-19 Literature. In BIBM. 491--494.
[25]
Xuan Wang, Yu Zhang, Aabhas Chauhan, Qi Li, and Jiawei Han. 2020 c. Textual Evidence Mining via Spherical Heterogeneous Information Network Embedding. In BigData. 828--837.
[26]
Xuan Wang, Yu Zhang, Qi Li, Yinyin Chen, and Jiawei Han. 2018. Open Information Extraction with Meta-pattern Discovery in Biomedical Literature. In BCB. 291--300.
[27]
Xuan Wang, Yu Zhang, Qi Li, Xiang Ren, Jingbo Shang, and Jiawei Han. 2019. Distantly supervised biomedical named entity recognition with dictionary expansion. In BIBM. 496--503.
[28]
Taiki Watanabe, Akihiro Tamura, Takashi Ninomiya, Takuya Makino, and Tomoya Iwakura. 2019. Multi-Task Learning for Chemical Named Entity Recognition with Chemical Compound Paraphrasing. In EMNLP-IJCNLP. 6244--6249.
[29]
Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2013. PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research, Vol. 41, W1 (2013), W518--W522.
[30]
Zixuan Zhang, Nikolaus Parulian, Heng Ji, Ahmed Elsayed, Skatje Myers, and Martha Palmer. 2021. Fine-grained Information Extraction from Biomedical Literature based on Knowledge-enriched Abstract Meaning Representation. In ACL. 6261--6270.

Cited By

View all
  • (2024)Syntax-based argument correlation-enhanced end-to-end model for scientific relation extractionNeurocomputing10.1016/j.neucom.2024.127639586:COnline publication date: 14-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2022
5033 pages
ISBN:9781450393850
DOI:10.1145/3534678
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

  1. information extraction
  2. scientific discovery
  3. scientific text mining

Qualifiers

  • Abstract

Funding Sources

Conference

KDD '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)215
  • Downloads (Last 6 weeks)28
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Syntax-based argument correlation-enhanced end-to-end model for scientific relation extractionNeurocomputing10.1016/j.neucom.2024.127639586:COnline publication date: 14-Jun-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media