Abstract
We present a first version of a system for selecting chemical publications for inclusion in a chemistry information database. This database, Reaxys (https://www.elsevier.com/solutions/reaxys), is a portal for the retrieval of structured chemistry information from published journals and patents. There are three challenges in this task: (i) Training and input data are highly imbalanced; (ii) High recall (\({\ge }95\%\)) is desired; and (iii) Data offered for selection is numerically massive but at the same time, incomplete. Our system successfully handles the imbalance with the undersampling technique and achieves relatively high recall using chemical named entities as features. Experiments on a real-world data set consisting of 15,822 documents show that the features of chemical named entities boost recall by \(8\%\) over the usual n-gram features being widely used in general document classification applications. For fostering research on this challenging topic, a part of the data set compiled in this paper can be requested.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Borrajo, L., Romero, R., Iglesias, E.L., Marey, C.R.: Improving imbalanced scientific text classification using sampling strategies and dictionaries. J. Integr. Bioinform. 8(3), 176 (2011)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008). http://dl.acm.org/citation.cfm?id=1390681.1442794
Irmer, M., Lutz, W., Böhme, T., Püschel, A., Claudia, B., Ulf, L.: OCMiner for patents: extracting chemical information from patent texts (2013). http://www.biocreative.org/media/store/files/2015/BCV2015_paper_57.pdf
Jessop, D.M., Adams, S.E., Murray-Rust, P.: Mining chemical information from open patents. J. Cheminform. 3(1), 40 (2011). http://jcheminf.springeropen.com/articles/10.1186/1758-2946-3-40
Muresan, S., Petrov, P., Southan, C., Kjellberg, M.J., Kogej, T., Tyrchan, C., Varkonyi, P., Xie, P.H.: Making every SAR point count: the development of chemistry connect for the large-scale integration of structure and bioactivity data. Drug Disc. Today 16(23–24), 1019–1030 (2011). http://www.sciencedirect.com/science/article/pii/S1359644611003448
Ogren, P.V., Wetzler, P.G., Bethard, S.: Cleartk: a UIMA toolkit for statistical natural language processing. In: Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP 32 (2008)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Vazquez, M., Krallinger, M., Leitner, F., Valencia, A.: Text mining for drugs and chemical compounds: methods, tools and applications. Mol. Inform. 30(6–7), 506–519 (2011). http://doi.wiley.com/10.1002/minf.201100005
Weininger, D.: SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inform. Model. 28(1), 31–36 (1988). http://dx.doi.org/10.1021/ci00057a005
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Zhu, Z., Akhondi, S.A., Nandal, U., Doornenbal, M., Gregory, M. (2017). Selecting Documents Relevant for Chemistry as a Classification Problem. In: Ciancarini, P., et al. Knowledge Engineering and Knowledge Management. EKAW 2016. Lecture Notes in Computer Science(), vol 10180. Springer, Cham. https://doi.org/10.1007/978-3-319-58694-6_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-58694-6_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58693-9
Online ISBN: 978-3-319-58694-6
eBook Packages: Computer ScienceComputer Science (R0)