skip to main content
10.1145/1242572.1242607acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Extraction and search of chemical formulae in text documents on the web

Published: 08 May 2007 Publication History

Abstract

Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of keywords during searching results in two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like "He" return all documents where Helium is mentioned as well as documents where the pronoun "he" occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: 1) extract chemical formulae from text documents, 2) index chemical formulae, and 3) designranking functions for the chemical formulae. Furthermore, query models are introduced for formula search, and for each a scoring scheme based on features of partial formulae is proposed tomeasure the relevance of chemical formulae and queries. We evaluate algorithms for identifying chemical formulae in documents using classification methods based on Support Vector Machines(SVM), and a probabilistic model based on conditional random fields (CRF). Different methods for SVM and CRF to tune the trade-off between recall and precision forim balanced data are proposed to improve the overall performance. A feature selection method based on frequency and discrimination isused to remove uninformative and redundant features. Experiments show that our approaches to chemical formula extraction work well, especially after trade-off tuning. The results also demonstrate that feature selection can reduce the index size without changing ranked query results much.

References

[1]
Gate. http://gate.ac.uk/.
[2]
D. A. Bainville. Mining chemical structural information from the drug literature. Drug Discovery Today, 11(1-2:35-42, 2006.
[3]
A. L. Berger, S. A. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):3971, 1996.
[4]
A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6(Sep):15791619, 2005.
[5]
A. Borthwick. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University, 1999.
[6]
L. Dehaspe, H. Toivonen, and R. D. King. Finding frequent substructures in chemical compounds. In Proceedings of SIGKDD, 1998.
[7]
S. J. Edgar, J. D. Holliday, and P. Willet. Effectiveness of retrieval in similarity searches of chemical databases: a review of performance measures. Journal of Molecular Graphics and Modelling, 18(4-5):343--357, 2000.
[8]
D. Freitag and A. McCallum. Information extraction using hmms and shrinkage. In AAAI Workshop on Machine Learning for Information Extration, 1999.
[9]
D. Haussler. Convolution kernels on discrete structures. Technical Report UCS-CRL-99-10, 1999.
[10]
T. Joachims. Svm light. http://svmlight.joachims.org/.
[11]
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proceedings of ICDM, 2001.
[12]
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, 2001.
[13]
A. McCallum. Effciently inducing features of conditional random fields. In Proceedings of Conference on UAI, 2003.
[14]
A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proceedings of ICML, 2000.
[15]
A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of CoNLL, 2003.
[16]
A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
[17]
R. McDonald and F. Pereira. Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics, 6(Suppl 1):S6, 2005.
[18]
S. D. Pietra, V. D. Pietra, and J. Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380--393, 1997.
[19]
J. W. Raymond, E. J. Gardiner, and P. Willet. Rascal: Calculation of graph similarity using maximum common edgesubgraphs. The Computer Journal, 45(6):631--644, 2002.
[20]
S. S. Sahoo, C. Thomas, A. Sheth, W. S. York, and S. Tartir. Knowledge modeling and its application in life sciences: A tale of two ontologies. In Proceedings of WWW, 2006.
[21]
B. Settles. Abner: an open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14):3191--3192, 2005.
[22]
F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL, 2003.
[23]
J. G. Shanahan and N. Roma. Boosting support vector machines for text classification through parameter-free threshold relaxation. In Proceedings of CIKM, 2003.
[24]
D. Shasha, J. T. L. Wang, and R. Giugno. Algorithmics and applications of tree and graph searching. In Proceedings of PODS, 2002.
[25]
P. Willet, J. M. Barnard, and G. M. Downs. Chemical similarity searching. J. Chem. Inf. Comput. Sci., 38(6):983996, 1998.
[26]
X. Yan, P. S. Yu, and J. Han. Graph indexing: A frequent structure-based approach. In Proceedings of SIGMOD, 2004.
[27]
X. Yan, F. Zhu, P. S. Yu, and J. Han. Feature-based substructure similarity search. ACM Transactions on Database Systems, 2006.
[28]
W. Yih, J. Goodman, and V. R. Carvalho. Finding advertising keywords on web pages. In Proceedings of WWW, 2006.
[29]
J. Zhao, C. Goble, and R. Stevens. Semantic web applications to e-science in silico experiments. In Proceedings of WWW, 2004.

Cited By

View all
  • (2021)Extraction and Search of Relevant Chemical Documents from the WebProceedings of First International Conference on Mathematical Modeling and Computational Science10.1007/978-981-33-4389-4_24(251-262)Online publication date: 5-May-2021
  • (2019)Unsupervised clustering with smoothing for detecting paratext boundaries in scanned documentsProceedings of the 18th Joint Conference on Digital Libraries10.1109/JCDL.2019.00018(53-56)Online publication date: 2-Jun-2019
  • (2018)Formula Ranking within an ArticleProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197061(123-126)Online publication date: 23-May-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '07: Proceedings of the 16th international conference on World Wide Web
May 2007
1382 pages
ISBN:9781595936547
DOI:10.1145/1242572
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. chemical formula
  2. conditional random fields
  3. entity extraction
  4. feature boosting
  5. feature selection
  6. query models
  7. ranking
  8. similarity search
  9. support vector machines

Qualifiers

  • Article

Conference

WWW'07
Sponsor:
WWW'07: 16th International World Wide Web Conference
May 8 - 12, 2007
Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)4
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Extraction and Search of Relevant Chemical Documents from the WebProceedings of First International Conference on Mathematical Modeling and Computational Science10.1007/978-981-33-4389-4_24(251-262)Online publication date: 5-May-2021
  • (2019)Unsupervised clustering with smoothing for detecting paratext boundaries in scanned documentsProceedings of the 18th Joint Conference on Digital Libraries10.1109/JCDL.2019.00018(53-56)Online publication date: 2-Jun-2019
  • (2018)Formula Ranking within an ArticleProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197061(123-126)Online publication date: 23-May-2018
  • (2018)Multi-perspective and Domain Specific Tagging of Chemical DocumentsData Science Analytics and Applications10.1007/978-981-10-8603-8_7(72-85)Online publication date: 24-Feb-2018
  • (2015)Chemical entity extraction using CRF and an ensemble of extractorsJournal of Cheminformatics10.1186/1758-2946-7-S1-S127:S1Online publication date: 19-Jan-2015
  • (2014)Chemical named entities recognition: a review on approaches and applicationsJournal of Cheminformatics10.1186/1758-2946-6-176:1Online publication date: 28-Apr-2014
  • (2012)ChemEx: information extraction system for chemical data curationBMC Bioinformatics10.1186/1471-2105-13-S17-S913:S17Online publication date: 13-Dec-2012
  • (2011)Taking chemistry to the taskProceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries10.1145/1998076.1998137(325-334)Online publication date: 13-Jun-2011
  • (2011)Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital DocumentsACM Transactions on Information Systems10.1145/1961209.196121529:2(1-38)Online publication date: 1-Apr-2011
  • (2011)Text Mining for Drugs and Chemical Compounds: Methods, Tools and ApplicationsMolecular Informatics10.1002/minf.20110000530:6-7(506-519)Online publication date: 12-Jul-2011
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media