Domain-specific entity extraction from noisy, unstructured data using ontology-guided search

Bratus, Sergey; Rumshisky, Anna; Khrabrov, Alexy; Magar, Rajenda; Thompson, Paul

doi:10.1007/s10032-011-0149-5

Domain-specific entity extraction from noisy, unstructured data using ontology-guided search

Original Paper
Published: 03 March 2011

Volume 14, pages 201–211, (2011)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Sergey Bratus¹,
Anna Rumshisky³,
Alexy Khrabrov²,
Rajenda Magar¹ &
…
Paul Thompson¹

289 Accesses
3 Citations
Explore all metrics

Abstract

Domain-specific knowledge is often recorded by experts in the form of unstructured text. For example, in the medical domain, clinical notes from electronic health records contain a wealth of information. Similar practices are found in other domains. The challenge we discuss in this paper is how to identify and extract part names from technicians repair notes, a noisy unstructured text data source from General Motors’ archives of solved vehicle repair problems, with the goal to develop a robust and dynamic reasoning system to be used as a repair adviser by service technicians. In the present work, we discuss two approaches to this problem. We present an algorithm for ontology-guided entity disambiguation that uses existing knowledge sources, such as domain-specific taxonomies and other structured data. We illustrate its use in the automotive domain, using GM parts ontology and the unit structure of repair manuals text to build context models, which are then used to disambiguate mentions of part-related entities in the text. We also describe extraction of part names with a small amount of annotated data using hidden Markov models (HMM) with shrinkage, achieving an f-score of approximately 80%. Next, we used linear-chain conditional random fields (CRF) in order to model observation dependencies present in the repair notes. Using CRF did not lead to improved performance, but a slight improvement over the HMM results was obtained by using a weighted combination of the HMM and CRF models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Named Entity Normalization for Supporting Information Fusion for Big Bridge Data Analytics

Knowledge Graph Enrichment of a Semantic Search System for Construction Safety

A data- and ontology-driven text mining-based construction of reliability model to analyze and predict component failures

Article 11 January 2015

References

Bruninghaus, S., Ashley, K.D.: Reasoning with Textual Cases Proceedings of the International Conference on Case-Based Reasoning (ICCBR), pp. 137–151 (2005)
Bundschus, M., Volker Tresp, V., Hans-Peter Kriegel, H.-P.: Topic models for semantically annotated document collections. In NIPS 2009 workshop: applications for topic models: text and beyond (2009)
Chapman, W., Dowling, J.N., Wagner, M.M.: Classification of emergency department chief complaints into 7 syndromes: a retrospective analysis of 527,228 patients. Annals of emergency medicine, vol. 46, no. 5, November (2005)
Cover T.M., Thomas J.A.: Elements of Information Theory. Wiley, New York (2006)
MATH Google Scholar
Demner-Fushman, D.: UMLS content views appropriate for NLP processing of the biomedical literature vs. clinical text. J. Biomed. Inform. August (2010)
Fellbaum C. et al.: WordNet: An Electronic Lexical Database. MIT Press, New York (1998)
MATH Google Scholar
Freitag, D., McCallum, A.: Information Extraction with HMM Structures Learned by Stochastic Optimization. In: Proceedings of the 17th National Conference on Artificial Intelligence, AAAI, pp. 584–589 (2000)
Freitag, D., McCallum, A.: Information Extraction with HMMs and Shrinkage. In: Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, 31–36, July. AAAI Technical Report WS-99-11 (1999)
Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of 18th International Conference on Machine Learning (2001)
Lenz, M.: Textual CBR and information retreival: a comparison. In: Gierl L., Lenz M. (eds.) Proceedings of the 6th German Workshop on Case-Based Reasoning, IMIB Series vol. 7, Inst. fuer Medizinische Informatik und Biometrie, University of Rostock (1998)
Morgan A.P., Cafeo, J.A., Gibbons, D.I., Lesperance, R.M., Sengir, G.H., Simon, A.M.: The general motors variation-reduction adviser: evolution of a CBR system. ICCBR 2003, pp. 306–318 (2003)
Morgan A.P., Cafeo J.A., Godden K., Lesperance R.M., Simon A.M., McGuinness D.L., Benedict J.L.: The general motors variation-reduction adviser. AI Magazine 26(3), 18–28 (2005)
Google Scholar
Rabiner, L.R.: A tutorial on Hidden Markov Models and selected applications in speech recognition. In: Proceedings of the IEEE, 77, 2 (1989)
Roberts, A., Gaizauskas, R., Hepple, M., Davis, N., Demetriou, G., Guo, Y., Kola, J., Roberts, I., Setzer, A., Tapuria, A. et al.: The CLEF corpus: semantic annotation of clinical text. In: AMIA Annual Symposium Proceedings, vol. 625 (2007)
Sha F., Pereira F.: Shallow Parsing with Conditional Random Fields Technical Report MS-CIS-02-35. University of Pennsylvania, Pennsylvania (2003)
Google Scholar
Sutton C., McCallum A.: An introduction to conditional random fields for relational learning. In: Getoor, L., BenTaskar, B. (eds) Introduction to Statistical Relational Learning, MIT Press, New York (2006)
Google Scholar
Uschold, M.: Creating, integrating and maintaining local and global ontologies. In: Proceedings of the 14th European Conference on Artificial Intelligence ECAI 2000, Berlin, Germany (2000)

Download references

Author information

Authors and Affiliations

Department of Computer Science, Dartmouth College, Hanover, NH, USA
Sergey Bratus, Rajenda Magar & Paul Thompson
Thayer School of Engineering, Dartmouth College, Hanover, NH, USA
Alexy Khrabrov
Department of Computer Science, Brandeis University, Waltham, MA, USA
Anna Rumshisky

Authors

Sergey Bratus
View author publications
You can also search for this author in PubMed Google Scholar
Anna Rumshisky
View author publications
You can also search for this author in PubMed Google Scholar
Alexy Khrabrov
View author publications
You can also search for this author in PubMed Google Scholar
Rajenda Magar
View author publications
You can also search for this author in PubMed Google Scholar
Paul Thompson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Rumshisky.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bratus, S., Rumshisky, A., Khrabrov, A. et al. Domain-specific entity extraction from noisy, unstructured data using ontology-guided search. IJDAR 14, 201–211 (2011). https://doi.org/10.1007/s10032-011-0149-5

Download citation

Received: 21 January 2010
Revised: 11 August 2010
Accepted: 24 January 2011
Published: 03 March 2011
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10032-011-0149-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Domain-specific entity extraction from noisy, unstructured data using ontology-guided search

Abstract

Access this article

Similar content being viewed by others

Unsupervised Named Entity Normalization for Supporting Information Fusion for Big Bridge Data Analytics

Knowledge Graph Enrichment of a Semantic Search System for Construction Safety

A data- and ontology-driven text mining-based construction of reliability model to analyze and predict component failures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Domain-specific entity extraction from noisy, unstructured data using ontology-guided search

Abstract

Access this article

Similar content being viewed by others

Unsupervised Named Entity Normalization for Supporting Information Fusion for Big Bridge Data Analytics

Knowledge Graph Enrichment of a Semantic Search System for Construction Safety

A data- and ontology-driven text mining-based construction of reliability model to analyze and predict component failures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation