Review of information extraction technologies and applications

Small, Sharon Gower; Medsker, Larry

doi:10.1007/s00521-013-1516-6

Review of information extraction technologies and applications

Review
Published: 01 December 2013

Volume 25, pages 533–548, (2014)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Sharon Gower Small¹ &
Larry Medsker^1,2

2873 Accesses
29 Citations
Explore all metrics

Abstract

Information extraction (IE) is an important and growing field, in part because of the development of ubiquitous social media networking millions of people and producing huge collections of textual information. Mined information is being used in a wide array of application areas from targeted marketing of products to intelligence gathering for military and security needs. IE has its roots in artificial intelligence fields including machine learning, logic and search algorithms, computational linguistics, and pattern recognition. This review summarizes the history of IE, surveys the various uses of IE, identifies current technological accomplishments and challenges, and explores the role that neural and adaptive computing might play in future research. A goal for this review is also to encourage practitioners of neural and adaptive computing to look for interesting applications in the important emerging area of IE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Chinchor N, Lewis DD, Hirschman L (1993) Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3). Comput Linguist 19(3):409–449
Google Scholar
Bikel DM, Schwartz R, Weischedel RM (1999) An algorithm that learns what’s in a name. Mach Learn J Spec Issue Nat Lang Learn 34(1–3):211–231
MATH Google Scholar
Appelt, DE, Hobbs J, Bear J, Israel D, Tyson M (1993) FASTUS: a finite-state processor for information extraction from real-world text. In: Proceedings IJCAI-93, pp 1172–1178
Miller S, Crystal M, Fox H, Ramshaw L, Schwartz R, Stoner R, Weischedel R (1998) Algorithms that learn to extract information; BBN: description of the SIFT system as used for MUC-7. In: Proceedings of the seventh annual message understanding conference (MUC-7), 17 pp
Nadeau D (2007) PhD Thesis. Ottawa-Carleton Institute for Computer Science, School of Information Technology and Engineering, University of Ottawa
Yangarber R, Grishman Tapanainen RP, Huttunen S (2000) Unsupervised discovery of scenario-level patterns for information extraction. In: Proceedings of the applied natural language processing conference (ANLP 2000), pp 282–289
Riloff, E (1996) Automatically generating extraction patterns from untagged text. In: Proceedings of the thirteenth national conference on artificial intelligence (AAAI-96), pp 1044–1049
Hasegawa H, Satoshi S, Grishman R (2004) Discovering relations among named entities from large corpora. In: Proceeding of ACL-2004, 8 pp
Dalvi N, Kumar R, Soliman M (2011) Automatic wrappers for large scale web extraction. Proc VLDB Endowment 4(4):219–230
Google Scholar
Piskoraski J, Yangarber R (2013) Information extraction: past, present and future, Chapter 2. In: Poibeau et al (eds) Multi-source, multilingual information extraction and summarization 11, theory and applications of natural language processing. doi:10.1007/978-3-642-28569-1_2, Springer
Buckley C (1985) Implementation of the Smart information retrieval system. Cornell University Department of Computer Science Technical Report, 37 pp
Callan JP, Croft WB, and Harding SM (1992) The INQUERY retrieval system. In: Proceedings of the third international conference on database and expert systems applications, pp 78–83
Cormack GV, Clarke CLA, Palmer CR, Samuel SL (2000) Passage-based query refinement (MultiText experiments for TREC-6). Inf Process Manage 36(1):133–153
Google Scholar
Toutanova K, Klein D, Manning C, and Singer Y. (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL, pp 252–259
http://nlp.stanford.edu/software/tagger.shtml
Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41
Google Scholar
Soon WM, Ng HT, Lim DCY (2001) A machine learning approach to coreference resolution of noun phrases. Comput Linguist 27(4):521–544
Google Scholar
Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Commun 33(1–2):23–60
MATH Google Scholar
Core MG and Allen JF (1997) Coding dialogs with the DAMSL annotation scheme. In: Working notes of AAAI fall symposium on communicative action in humans and machines. CSLU Toolkit: http://www.cslu.ogi.edu/toolkit
Cassidy S, Harrington J (2001) Multi-level annotation in the Emu speech database management system. Speech Commun 33:61–77
MATH Google Scholar
Maeda, K, Bird S, Ma X, Lee H (2002) Creating annotation tools with the annotation graph toolkit. In: Proceedings of the third international conference on language resources and evaluation, 8 pp
Small, SG, Strzalkowski T, Stommer-Galley J (2012) Multi-modal annotation of quest games in second life. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 171–179
Small SG, Booker J (2013) Hydrofracking comments meets computational linguistics. Submitted to the 7th international AAAI conference on weblogs and social media
Joachims T (2002) Learning to classify text using support vector machines: methods, theory, and algorithms. Kluwer Academic, Dordrecht
Google Scholar
Cortez C, Vapnik VN (1995) Support-vector networks. Mach Learn 20(3):273–297
Google Scholar
Ji Y, Sun S (2013) Multitask multiclass support vector machines: model and experiments. Pattern Recogn 46(3):914–924
MATH Google Scholar
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Haussler D (ed) 5th annual ACM workshop on COLT, pp 144–152, Pittsburgh, PA. ACM Press
Li Y, Bontcheva K, Cunningham, H (2005) SVM based learning system for information extraction. In: Proceedings of the Sheffield machine learning workshop
Li Y, Bontcheva K, Cunningham H (2005) Using uneven margins SVM and perceptron for information extraction. In: Proceedings of the ninth conference on computational natural language learning CoNLL-2005, pp 72–79
Li Y, Bontcheva K, Cunningham H (2004) SVM based learning system for information extraction. Determ Stat Methods Mach Learn 3635:319–339
Google Scholar
Sang EF, Kim T, Meulder FD (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of CoNLL-2003, vol 4, pp 142–147
Mayfield J, McNamee P, Piatko C (2003). Named entity recognition using hundreds of thousands of features. In: Proceedings of CoNLL-2003, vol 4, pp 184–187
Hammerton J (2003) Named entity recognition with long short-term memory. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL, vol 4, pp 172–175
Turian J, Ratinov L, and Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394
Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, Finland
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
MATH Google Scholar
Honkela T (1997) Self-organizing maps in natural language processing, PhD Thesis. Helsinski University of Technology
Honkela T, Laaksonen J, Törrö H, Tenhunen J (2011) Media Map: a multilingual document map with a design interface, WSOM 2011, LNCS 6731, pp 247–256, Springer
Frawley WJ, Piatetsky-Shapiro G, Matheus CJ (1992) Knowledge discovery in databases: an overview. AI Mag 13(3):57
Google Scholar
Hendler J (2013) Thetherless World Constellation at RPI. http://tw.rpi.edu/wiki/Tetherless_World_Constellation_at_RPI
Etzioni O, Fader A, Christensen J, Soderland S, Mausam (2011) Open information extraction: the second generation. IJCAI 2011:3–10
Google Scholar
Dalvi B, Cohen W, Callan J (2012) WebSets: extracting sets of entities from the web using unsupervised information extraction. In: Proceedings of the fifth ACM international conference on web search and data mining, pp 243–252
Peng F, McCallum A (2006) Accurate information extraction from research papers using conditional random fields. Inf Process Manag 42(4):963–979. http://people.cs.umass.edu/~mccallum/papers/hlt2004.pdf
Giles CL, Bollacker KD, Lawrence S (1998) Citeseer: an automatic citation indexing system. In: Digital Libraries, pp 89–98
Han H, Giles CL, Manavoglu, E, Zha H, Zhang Z, Fox EA (2003) Automatic document metadata extraction using support vector machines, ACM/IEEE joint conference on Digital Libraries (JCDL 2003), pp 37–48
Khabsa M, Treeratpituk P, Lee Giles CL (2012) AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries. In: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL 2012), pp 185–194
Teregowda PB, Councill IG, Fernandez JP, Khabsa M, Zheng S, Giles CL (2010) SeerSuite: developing a scalable and reliable application framework for building digital libraries by crawling the web. In: 1st USENIX conference on web application development
Ferrucci D et al (2006) Towards an interoperability standard for text and multi-modal analytics. IBM Research Report, RC24122 (W0611-188), 106 pp. http://domino.research.ibm.com/library/cyberdig.nsf/papers/1898F3F640FEF47E8525723C00551250/$File/rc24122.pdf
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18
Google Scholar
Alonso O, Rose DE, Stewart B (2008) Crowdsourcing for relevance evaluation. SIGIR Forum 42(2):9–15
Google Scholar
Luger GF (2009) Artificial intelligence, 6th edn, pp 664–665
Cunningham H, Tablan V, Roberts A, Bontcheva K (2013) Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput Biol 9(2):e1002854
Google Scholar
Khabsa M, Koppman S, Giles CL (2012) Towards building and analyzing a social network of acknowledgments in scientific and academic documents. In: Social computing, behavioral–cultural modeling and prediction—5th international conference (SBP 2012), pp 357–364
Radwanick S (2011) The rise of social networking in Latin America: how social media is shaping Latin America’s digital landscape. In: http://www.comscore.com/Press_Events/Presentations_Whitepapers/2011/The_Rise_of_Social_Networking_in_Latin_America
Signorini A, Segre AM, Polgreen PM (2011) The use of Twitter to track levels of disease activity and public concern in the US during the Inlfuenza A H1N1 pandemic. PLoS One 6(5):e19467. doi:10.1371/journal.pone.0019467
Google Scholar
Eccarius-Kelly V (2007) Counterterrorism policies and the revolutionary movement of Tupac Amaru: the unmasking of Peru’s National Security State. In: Forest JJF (ed) Countering terrorism in the 21st century, vol 3. Praeger Security International, Westport, pp 463–484
Google Scholar
Atkinson-Abutridy J, Mellish C, Aitken S (2004) Combining information extraction with genetic algorithms for text mining. IEEE Intell Syst 19(3):22–30
Google Scholar
Downey D, Etzioni O, Soderland S (2010) Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif Intell 174(11):726–748
MATH MathSciNet Google Scholar
Yates A, Etzioni O (2009) Unsupervised methods for determining object and relation synonyms on the Web. J Artif Intell Res 34:255–296
MATH Google Scholar
Nahm UY, Mooney RJ (2000) A mutually beneficial integration of data mining and information extraction. In: Proceedings of the American association for artificial intelligence conference, pp 627–632
Medsker L, Small SG, Rivadereira C, Reynolds A, Afzali M (2012) The Siena College medical information retrieval system (MIRS). In: The twenty-first text retrieval conference proceedings (TREC2012)
Apache Lucene^TM is an open source high-performance, full- featured text search engine. http://lucene.apache.org/
Hyvärinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, New York
Google Scholar
Hyvärinen A, Zhang K, Shimizu S, Hoyer PO (2010) Estimation of a structural vector autoregression model using non-gaussianity. J Mach Learn Res 11:1709–1731
MATH MathSciNet Google Scholar
Wilson N, Wang H, McGuinness DL (2012) Scientific names and descriptions for organisms on the Semantic Web. In: Proceedings of 2nd international workshop on linked science 2012—Tackling Big Data at ISWC 2012
Bizer C, Heath T, Berners-Lee T (2009) Linked data—the story so far, special issue on linked data. In: Heath T, Hepp M, Bizer C (eds) International Journal on Semantic Web and Information Systems (IJSWIS) 5(3). http://eventseer.net/e/4789/, http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf
Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes Twitter users: real-time event detection by social sensors. In: WWW’10 Proceedings of the 19th international conference on World Wide Web, pp 851–860
Ritter A, Mausam, Etzioni O, Clark S (2012) Open domain event extraction from Twitter. In: ACM SIGKDD conference on knowledge discovery and data mining (SIGKDD), pp 1104–1112
Lin T, Mausam, Etzioni O (2012) Entity linking at web scale. In: Joint workshop on automatic knowledge base construction and Web-scale knowledge extraction
Small S, Strzalkowski T (2010) (Tacitly) Collaborative question answering utilizing Web trails. In: Proceedings of the international conference on language resources and evaluation; Workshop on web question answering, pp 36–42

Download references

Author information

Authors and Affiliations

Department of Computer Science and Siena College Institute for Artificial Intelligence, Siena College, Loudonville, NY, USA
Sharon Gower Small & Larry Medsker
Department of Physics and Astronomy, Siena College, Loudonville, NY, USA
Larry Medsker

Authors

Sharon Gower Small
View author publications
You can also search for this author in PubMed Google Scholar
Larry Medsker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Larry Medsker.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Small, S.G., Medsker, L. Review of information extraction technologies and applications. Neural Comput & Applic 25, 533–548 (2014). https://doi.org/10.1007/s00521-013-1516-6

Download citation

Received: 16 March 2013
Accepted: 06 November 2013
Published: 01 December 2013
Issue Date: September 2014
DOI: https://doi.org/10.1007/s00521-013-1516-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of information extraction technologies and applications

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions

Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Review of information extraction technologies and applications

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions

Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation