research-article

Learning to extract chemical names based on random text generation and incomplete dictionary

Authors:
Su Yan

IBM Almaden Research Lab, San Jose, CA

IBM Almaden Research Lab, San Jose, CA
View Profile

,
W. Scott Spangler

IBM Almaden Research Lab, San Jose, CA

IBM Almaden Research Lab, San Jose, CA
View Profile

,
Ying Chen

IBM Almaden Research Lab, San Jose, CA

IBM Almaden Research Lab, San Jose, CA
View Profile

BIOKDD '12: Proceedings of the 11th International Workshop on Data Mining in BioinformaticsAugust 2012Pages 21–25https://doi.org/10.1145/2350176.2350180

Published:12 August 2012Publication History

BIOKDD '12: Proceedings of the 11th International Workshop on Data Mining in Bioinformatics

Pages 21–25

ABSTRACT

Automatically extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable good quality training set to train a reliable entity extraction model. Leveraging the well-studied random text generation techniques based on formal grammars, we explore the idea of automatically creating training sets for the task of chemical named entity extraction. Assuming the availability of an incomplete list of chemical names, we are able to generate well-controlled, random, yet realistic chemical-like training documents. Compared to state-of-the-art models learned from manually labeled data and rule-based systems using real-world data, our solutions show comparable or better results, with least human effort.

References

BioCreAtIvE-Critical Assessment of Information Extraction systems in Biology http://biocreative.sourceforge.net/.Google Scholar
A. C. Bulhak. On the simulation of postmodernism and mental debility using recursive transition networks. Technical report, 1996.Google Scholar
P. Corbett and A. Copestake. Cascaded classifiers for confidence-based chemical named entity recognition. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP '08, pages 54--62, 2008. Google ScholarDigital Library
C. Cortes and V. Vapnik. Support-vector networks. Mach. Learn., 20(3):273--297, Sept. 1995. Google ScholarDigital Library
C. M. Friedrich, T. Revillion, M. Hofmann, and J. Fluck. Biomedical and chemical named entity recognition with conditional random fields: The advantage of dictionary features. In Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM 2006)., pages 85--89, 2006.Google Scholar
L. Q. Ha, E. I. Sicilia-Garcia, J. Ming, and F. J. Smith. Extension of zipf's law to words and phrases. In Proceedings of the 19th international conference on Computational linguistics - Volume 1, COLING '02, pages 1--6, 2002. Google ScholarDigital Library
R. Klinger, C. Kolárik, J. Fluck, M. Hofmann-Apitius, and C. M. Friedrich. Detection of iupac and iupac-like chemical names. In ISMB, pages 268--276, 2008. Google ScholarDigital Library
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pages 282--289, 2001. Google ScholarDigital Library
A. McCallum, D. Freitag, and F. C. N. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, pages 591--598, 2000. Google ScholarDigital Library
L. R. Rabiner. Readings in speech recognition. chapter A tutorial on hidden Markov models and selected applications in speech recognition, pages 267--296. 1990. Google ScholarDigital Library
C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27, 1948.Google Scholar
H. A. Simon. On a class of skew distribution functions. Biometrika, 42(3-4):425--440, 1955.Google ScholarCross Ref
J. Stribling, M. Krohn, and D. Aguayo. SCIgen - An Automatic CS Paper Generator, http://www.pdos.lcs.mit.edu/scigen/, 2006.Google Scholar
B. Sun, P. Mitra, and C. L. Giles. Mining, indexing, and searching for textual chemical molecule information on the web. In WWW, pages 735--744, 2008. Google ScholarDigital Library
W. J. Wilbur, G. F. Hazard, G. Divita, J. G. Mork, A. R. Aronson, and A. C. Browne. Analysis of biomedical text for chemical names: a comparison of three methods. Proc AMIA Symp, pages 176--180, 1999.Google Scholar
S. Yan, W. S. Spangler, and Y. Chen. Cross media entity extraction and linkage for chemical documents. In AAAI, 2011.Google Scholar

Learning to extract chemical names based on random text generation and incomplete dictionary
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Disambiguation of proper names in text
ANLC '97: Proceedings of the fifth conference on Applied natural language processing

Identifying the occurrences of proper names in text and the entities they refer to can be a difficult task because of the many-to-many mapping between names and their referents. We analyze the types of ambiguity --- structural and semantic --- that make ...
Read More
Learning Recognition of Ambiguous Proper Names in Hindi
ICMLA '11: Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops - Volume 01

An ambiguous proper name is a name which is also a valid dictionary word with a meaning of its own when used in the text. For example in English, the word 'bush' in 'Mr. Bush' is a proper name whereas in 'a dense bush' it is a lexical entity. Almost all ...
Read More
Detection of IUPAC and IUPAC-like chemical names

Motivation: Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
BIOKDD '12: Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
August 2012
38 pages
ISBN:9781450315524
DOI:10.1145/2350176
General Chairs:
Jake Chen
Indiana University-Purdue University Indianapolis, Indianapolis, IN
,
Mohammed J. Zaki
Rensselaer Polytechnic Institute, Troy, NY
,
Program Chairs:
Tamer Kahveci
University of Florida, Gainesville, FL
,
Saeed Salem
North Dakota State University, Fargo, ND
,
Mehmet Koyutürk
Case Western Reserve University, Cleveland, OH
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of16submissions,44%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 144
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning to extract chemical names based on random text generation and incomplete dictionary

BIOKDD '12: Proceedings of the 11th International Workshop on Data Mining in Bioinformatics

ABSTRACT

References

Cited By

Recommendations

Disambiguation of proper names in text

Learning Recognition of Ambiguous Proper Names in Hindi

Detection of IUPAC and IUPAC-like chemical names

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning to extract chemical names based on random text generation and incomplete dictionary

BIOKDD '12: Proceedings of the 11th International Workshop on Data Mining in Bioinformatics

ABSTRACT

References

Cited By

Recommendations

Disambiguation of proper names in text

Learning Recognition of Ambiguous Proper Names in Hindi

Detection of IUPAC and IUPAC-like chemical names

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media