ABSTRACT
Automatically extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable good quality training set to train a reliable entity extraction model. Leveraging the well-studied random text generation techniques based on formal grammars, we explore the idea of automatically creating training sets for the task of chemical named entity extraction. Assuming the availability of an incomplete list of chemical names, we are able to generate well-controlled, random, yet realistic chemical-like training documents. Compared to state-of-the-art models learned from manually labeled data and rule-based systems using real-world data, our solutions show comparable or better results, with least human effort.
- BioCreAtIvE-Critical Assessment of Information Extraction systems in Biology http://biocreative.sourceforge.net/.Google Scholar
- A. C. Bulhak. On the simulation of postmodernism and mental debility using recursive transition networks. Technical report, 1996.Google Scholar
- P. Corbett and A. Copestake. Cascaded classifiers for confidence-based chemical named entity recognition. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP '08, pages 54--62, 2008. Google ScholarDigital Library
- C. Cortes and V. Vapnik. Support-vector networks. Mach. Learn., 20(3):273--297, Sept. 1995. Google ScholarDigital Library
- C. M. Friedrich, T. Revillion, M. Hofmann, and J. Fluck. Biomedical and chemical named entity recognition with conditional random fields: The advantage of dictionary features. In Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM 2006)., pages 85--89, 2006.Google Scholar
- L. Q. Ha, E. I. Sicilia-Garcia, J. Ming, and F. J. Smith. Extension of zipf's law to words and phrases. In Proceedings of the 19th international conference on Computational linguistics - Volume 1, COLING '02, pages 1--6, 2002. Google ScholarDigital Library
- R. Klinger, C. Kolárik, J. Fluck, M. Hofmann-Apitius, and C. M. Friedrich. Detection of iupac and iupac-like chemical names. In ISMB, pages 268--276, 2008. Google ScholarDigital Library
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pages 282--289, 2001. Google ScholarDigital Library
- A. McCallum, D. Freitag, and F. C. N. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, pages 591--598, 2000. Google ScholarDigital Library
- L. R. Rabiner. Readings in speech recognition. chapter A tutorial on hidden Markov models and selected applications in speech recognition, pages 267--296. 1990. Google ScholarDigital Library
- C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27, 1948.Google Scholar
- H. A. Simon. On a class of skew distribution functions. Biometrika, 42(3-4):425--440, 1955.Google ScholarCross Ref
- J. Stribling, M. Krohn, and D. Aguayo. SCIgen - An Automatic CS Paper Generator, http://www.pdos.lcs.mit.edu/scigen/, 2006.Google Scholar
- B. Sun, P. Mitra, and C. L. Giles. Mining, indexing, and searching for textual chemical molecule information on the web. In WWW, pages 735--744, 2008. Google ScholarDigital Library
- W. J. Wilbur, G. F. Hazard, G. Divita, J. G. Mork, A. R. Aronson, and A. C. Browne. Analysis of biomedical text for chemical names: a comparison of three methods. Proc AMIA Symp, pages 176--180, 1999.Google Scholar
- S. Yan, W. S. Spangler, and Y. Chen. Cross media entity extraction and linkage for chemical documents. In AAAI, 2011.Google Scholar
- Learning to extract chemical names based on random text generation and incomplete dictionary
Recommendations
Disambiguation of proper names in text
ANLC '97: Proceedings of the fifth conference on Applied natural language processingIdentifying the occurrences of proper names in text and the entities they refer to can be a difficult task because of the many-to-many mapping between names and their referents. We analyze the types of ambiguity --- structural and semantic --- that make ...
Learning Recognition of Ambiguous Proper Names in Hindi
ICMLA '11: Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops - Volume 01An ambiguous proper name is a name which is also a valid dictionary word with a meaning of its own when used in the text. For example in English, the word 'bush' in 'Mr. Bush' is a proper name whereas in 'a dense bush' it is a lexical entity. Almost all ...
Comments