Selecting Documents Relevant for Chemistry as a Classification Problem

Zhu, Zhemin; Akhondi, Saber A.; Nandal, Umesh; Doornenbal, Marius; Gregory, Michelle

doi:10.1007/978-3-319-58694-6_31

Zhemin Zhu²¹,
Saber A. Akhondi²¹,
Umesh Nandal²¹,
Marius Doornenbal²¹ &
…
Michelle Gregory²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10180))

Included in the following conference series:

European Knowledge Acquisition Workshop

900 Accesses

Abstract

We present a first version of a system for selecting chemical publications for inclusion in a chemistry information database. This database, Reaxys (https://www.elsevier.com/solutions/reaxys), is a portal for the retrieval of structured chemistry information from published journals and patents. There are three challenges in this task: (i) Training and input data are highly imbalanced; (ii) High recall (\({\ge }95\%\)) is desired; and (iii) Data offered for selection is numerically massive but at the same time, incomplete. Our system successfully handles the imbalance with the undersampling technique and achieves relatively high recall using chemical named entities as features. Experiments on a real-world data set consisting of 15,822 documents show that the features of chemical named entities boost recall by \(8\%\) over the usual n-gram features being widely used in general document classification applications. For fostering research on this challenging topic, a part of the data set compiled in this paper can be requested.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.uima.apache.org.

References

Borrajo, L., Romero, R., Iglesias, E.L., Marey, C.R.: Improving imbalanced scientific text classification using sampling strategies and dictionaries. J. Integr. Bioinform. 8(3), 176 (2011)
Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008). http://dl.acm.org/citation.cfm?id=1390681.1442794
Irmer, M., Lutz, W., Böhme, T., Püschel, A., Claudia, B., Ulf, L.: OCMiner for patents: extracting chemical information from patent texts (2013). http://www.biocreative.org/media/store/files/2015/BCV2015_paper_57.pdf
Jessop, D.M., Adams, S.E., Murray-Rust, P.: Mining chemical information from open patents. J. Cheminform. 3(1), 40 (2011). http://jcheminf.springeropen.com/articles/10.1186/1758-2946-3-40
Muresan, S., Petrov, P., Southan, C., Kjellberg, M.J., Kogej, T., Tyrchan, C., Varkonyi, P., Xie, P.H.: Making every SAR point count: the development of chemistry connect for the large-scale integration of structure and bioactivity data. Drug Disc. Today 16(23–24), 1019–1030 (2011). http://www.sciencedirect.com/science/article/pii/S1359644611003448
Ogren, P.V., Wetzler, P.G., Bethard, S.: Cleartk: a UIMA toolkit for statistical natural language processing. In: Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP 32 (2008)
Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Article Google Scholar
Vazquez, M., Krallinger, M., Leitner, F., Valencia, A.: Text mining for drugs and chemical compounds: methods, tools and applications. Mol. Inform. 30(6–7), 506–519 (2011). http://doi.wiley.com/10.1002/minf.201100005
Weininger, D.: SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inform. Model. 28(1), 31–36 (1988). http://dx.doi.org/10.1021/ci00057a005

Download references

Author information

Authors and Affiliations

Elsevier, Radarweg 29, 1043 NX, Amsterdam, The Netherlands
Zhemin Zhu, Saber A. Akhondi, Umesh Nandal, Marius Doornenbal & Michelle Gregory

Authors

Zhemin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Saber A. Akhondi
View author publications
You can also search for this author in PubMed Google Scholar
Umesh Nandal
View author publications
You can also search for this author in PubMed Google Scholar
Marius Doornenbal
View author publications
You can also search for this author in PubMed Google Scholar
Michelle Gregory
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhemin Zhu .

Editor information

Editors and Affiliations

University of Bologna , Bologna, Italy
Paolo Ciancarini
University of Bologna , Bologna, Italy
Francesco Poggi
Stanford University , Palo Alto, California, USA
Matthew Horridge
Lancaster University , Lancaster, United Kingdom
Jun Zhao
Garvan Institute of Medical Research , Darlinghurst, New South Wales, Australia
Tudor Groza
Universidad Politecnica de Madrid , Boadilla del Monte, Spain
Mari Carmen Suarez-Figueroa
The Open University , Milton Keynes, United Kingdom
Mathieu d'Aquin
STLab ISTC-CNR , Rome, Italy
Valentina Presutti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Z., Akhondi, S.A., Nandal, U., Doornenbal, M., Gregory, M. (2017). Selecting Documents Relevant for Chemistry as a Classification Problem. In: Ciancarini, P., et al. Knowledge Engineering and Knowledge Management. EKAW 2016. Lecture Notes in Computer Science(), vol 10180. Springer, Cham. https://doi.org/10.1007/978-3-319-58694-6_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-58694-6_31
Published: 20 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58693-9
Online ISBN: 978-3-319-58694-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics