Abstract
Information extraction (IE) is an important and growing field, in part because of the development of ubiquitous social media networking millions of people and producing huge collections of textual information. Mined information is being used in a wide array of application areas from targeted marketing of products to intelligence gathering for military and security needs. IE has its roots in artificial intelligence fields including machine learning, logic and search algorithms, computational linguistics, and pattern recognition. This review summarizes the history of IE, surveys the various uses of IE, identifies current technological accomplishments and challenges, and explores the role that neural and adaptive computing might play in future research. A goal for this review is also to encourage practitioners of neural and adaptive computing to look for interesting applications in the important emerging area of IE.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Chinchor N, Lewis DD, Hirschman L (1993) Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3). Comput Linguist 19(3):409–449
Bikel DM, Schwartz R, Weischedel RM (1999) An algorithm that learns what’s in a name. Mach Learn J Spec Issue Nat Lang Learn 34(1–3):211–231
Appelt, DE, Hobbs J, Bear J, Israel D, Tyson M (1993) FASTUS: a finite-state processor for information extraction from real-world text. In: Proceedings IJCAI-93, pp 1172–1178
Miller S, Crystal M, Fox H, Ramshaw L, Schwartz R, Stoner R, Weischedel R (1998) Algorithms that learn to extract information; BBN: description of the SIFT system as used for MUC-7. In: Proceedings of the seventh annual message understanding conference (MUC-7), 17 pp
Nadeau D (2007) PhD Thesis. Ottawa-Carleton Institute for Computer Science, School of Information Technology and Engineering, University of Ottawa
Yangarber R, Grishman Tapanainen RP, Huttunen S (2000) Unsupervised discovery of scenario-level patterns for information extraction. In: Proceedings of the applied natural language processing conference (ANLP 2000), pp 282–289
Riloff, E (1996) Automatically generating extraction patterns from untagged text. In: Proceedings of the thirteenth national conference on artificial intelligence (AAAI-96), pp 1044–1049
Hasegawa H, Satoshi S, Grishman R (2004) Discovering relations among named entities from large corpora. In: Proceeding of ACL-2004, 8 pp
Dalvi N, Kumar R, Soliman M (2011) Automatic wrappers for large scale web extraction. Proc VLDB Endowment 4(4):219–230
Piskoraski J, Yangarber R (2013) Information extraction: past, present and future, Chapter 2. In: Poibeau et al (eds) Multi-source, multilingual information extraction and summarization 11, theory and applications of natural language processing. doi:10.1007/978-3-642-28569-1_2, Springer
Buckley C (1985) Implementation of the Smart information retrieval system. Cornell University Department of Computer Science Technical Report, 37 pp
Callan JP, Croft WB, and Harding SM (1992) The INQUERY retrieval system. In: Proceedings of the third international conference on database and expert systems applications, pp 78–83
Cormack GV, Clarke CLA, Palmer CR, Samuel SL (2000) Passage-based query refinement (MultiText experiments for TREC-6). Inf Process Manage 36(1):133–153
Toutanova K, Klein D, Manning C, and Singer Y. (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL, pp 252–259
Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41
Soon WM, Ng HT, Lim DCY (2001) A machine learning approach to coreference resolution of noun phrases. Comput Linguist 27(4):521–544
Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Commun 33(1–2):23–60
Core MG and Allen JF (1997) Coding dialogs with the DAMSL annotation scheme. In: Working notes of AAAI fall symposium on communicative action in humans and machines. CSLU Toolkit: http://www.cslu.ogi.edu/toolkit
Cassidy S, Harrington J (2001) Multi-level annotation in the Emu speech database management system. Speech Commun 33:61–77
Maeda, K, Bird S, Ma X, Lee H (2002) Creating annotation tools with the annotation graph toolkit. In: Proceedings of the third international conference on language resources and evaluation, 8 pp
Small, SG, Strzalkowski T, Stommer-Galley J (2012) Multi-modal annotation of quest games in second life. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 171–179
Small SG, Booker J (2013) Hydrofracking comments meets computational linguistics. Submitted to the 7th international AAAI conference on weblogs and social media
Joachims T (2002) Learning to classify text using support vector machines: methods, theory, and algorithms. Kluwer Academic, Dordrecht
Cortez C, Vapnik VN (1995) Support-vector networks. Mach Learn 20(3):273–297
Ji Y, Sun S (2013) Multitask multiclass support vector machines: model and experiments. Pattern Recogn 46(3):914–924
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Haussler D (ed) 5th annual ACM workshop on COLT, pp 144–152, Pittsburgh, PA. ACM Press
Li Y, Bontcheva K, Cunningham, H (2005) SVM based learning system for information extraction. In: Proceedings of the Sheffield machine learning workshop
Li Y, Bontcheva K, Cunningham H (2005) Using uneven margins SVM and perceptron for information extraction. In: Proceedings of the ninth conference on computational natural language learning CoNLL-2005, pp 72–79
Li Y, Bontcheva K, Cunningham H (2004) SVM based learning system for information extraction. Determ Stat Methods Mach Learn 3635:319–339
Sang EF, Kim T, Meulder FD (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of CoNLL-2003, vol 4, pp 142–147
Mayfield J, McNamee P, Piatko C (2003). Named entity recognition using hundreds of thousands of features. In: Proceedings of CoNLL-2003, vol 4, pp 184–187
Hammerton J (2003) Named entity recognition with long short-term memory. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL, vol 4, pp 172–175
Turian J, Ratinov L, and Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394
Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, Finland
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Honkela T (1997) Self-organizing maps in natural language processing, PhD Thesis. Helsinski University of Technology
Honkela T, Laaksonen J, Törrö H, Tenhunen J (2011) Media Map: a multilingual document map with a design interface, WSOM 2011, LNCS 6731, pp 247–256, Springer
Frawley WJ, Piatetsky-Shapiro G, Matheus CJ (1992) Knowledge discovery in databases: an overview. AI Mag 13(3):57
Hendler J (2013) Thetherless World Constellation at RPI. http://tw.rpi.edu/wiki/Tetherless_World_Constellation_at_RPI
Etzioni O, Fader A, Christensen J, Soderland S, Mausam (2011) Open information extraction: the second generation. IJCAI 2011:3–10
Dalvi B, Cohen W, Callan J (2012) WebSets: extracting sets of entities from the web using unsupervised information extraction. In: Proceedings of the fifth ACM international conference on web search and data mining, pp 243–252
Peng F, McCallum A (2006) Accurate information extraction from research papers using conditional random fields. Inf Process Manag 42(4):963–979. http://people.cs.umass.edu/~mccallum/papers/hlt2004.pdf
Giles CL, Bollacker KD, Lawrence S (1998) Citeseer: an automatic citation indexing system. In: Digital Libraries, pp 89–98
Han H, Giles CL, Manavoglu, E, Zha H, Zhang Z, Fox EA (2003) Automatic document metadata extraction using support vector machines, ACM/IEEE joint conference on Digital Libraries (JCDL 2003), pp 37–48
Khabsa M, Treeratpituk P, Lee Giles CL (2012) AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries. In: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL 2012), pp 185–194
Teregowda PB, Councill IG, Fernandez JP, Khabsa M, Zheng S, Giles CL (2010) SeerSuite: developing a scalable and reliable application framework for building digital libraries by crawling the web. In: 1st USENIX conference on web application development
Ferrucci D et al (2006) Towards an interoperability standard for text and multi-modal analytics. IBM Research Report, RC24122 (W0611-188), 106 pp. http://domino.research.ibm.com/library/cyberdig.nsf/papers/1898F3F640FEF47E8525723C00551250/$File/rc24122.pdf
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18
Alonso O, Rose DE, Stewart B (2008) Crowdsourcing for relevance evaluation. SIGIR Forum 42(2):9–15
Luger GF (2009) Artificial intelligence, 6th edn, pp 664–665
Cunningham H, Tablan V, Roberts A, Bontcheva K (2013) Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput Biol 9(2):e1002854
Khabsa M, Koppman S, Giles CL (2012) Towards building and analyzing a social network of acknowledgments in scientific and academic documents. In: Social computing, behavioral–cultural modeling and prediction—5th international conference (SBP 2012), pp 357–364
Radwanick S (2011) The rise of social networking in Latin America: how social media is shaping Latin America’s digital landscape. In: http://www.comscore.com/Press_Events/Presentations_Whitepapers/2011/The_Rise_of_Social_Networking_in_Latin_America
Signorini A, Segre AM, Polgreen PM (2011) The use of Twitter to track levels of disease activity and public concern in the US during the Inlfuenza A H1N1 pandemic. PLoS One 6(5):e19467. doi:10.1371/journal.pone.0019467
Eccarius-Kelly V (2007) Counterterrorism policies and the revolutionary movement of Tupac Amaru: the unmasking of Peru’s National Security State. In: Forest JJF (ed) Countering terrorism in the 21st century, vol 3. Praeger Security International, Westport, pp 463–484
Atkinson-Abutridy J, Mellish C, Aitken S (2004) Combining information extraction with genetic algorithms for text mining. IEEE Intell Syst 19(3):22–30
Downey D, Etzioni O, Soderland S (2010) Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif Intell 174(11):726–748
Yates A, Etzioni O (2009) Unsupervised methods for determining object and relation synonyms on the Web. J Artif Intell Res 34:255–296
Nahm UY, Mooney RJ (2000) A mutually beneficial integration of data mining and information extraction. In: Proceedings of the American association for artificial intelligence conference, pp 627–632
Medsker L, Small SG, Rivadereira C, Reynolds A, Afzali M (2012) The Siena College medical information retrieval system (MIRS). In: The twenty-first text retrieval conference proceedings (TREC2012)
Apache LuceneTM is an open source high-performance, full- featured text search engine. http://lucene.apache.org/
Hyvärinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, New York
Hyvärinen A, Zhang K, Shimizu S, Hoyer PO (2010) Estimation of a structural vector autoregression model using non-gaussianity. J Mach Learn Res 11:1709–1731
Wilson N, Wang H, McGuinness DL (2012) Scientific names and descriptions for organisms on the Semantic Web. In: Proceedings of 2nd international workshop on linked science 2012—Tackling Big Data at ISWC 2012
Bizer C, Heath T, Berners-Lee T (2009) Linked data—the story so far, special issue on linked data. In: Heath T, Hepp M, Bizer C (eds) International Journal on Semantic Web and Information Systems (IJSWIS) 5(3). http://eventseer.net/e/4789/, http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf
Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes Twitter users: real-time event detection by social sensors. In: WWW’10 Proceedings of the 19th international conference on World Wide Web, pp 851–860
Ritter A, Mausam, Etzioni O, Clark S (2012) Open domain event extraction from Twitter. In: ACM SIGKDD conference on knowledge discovery and data mining (SIGKDD), pp 1104–1112
Lin T, Mausam, Etzioni O (2012) Entity linking at web scale. In: Joint workshop on automatic knowledge base construction and Web-scale knowledge extraction
Small S, Strzalkowski T (2010) (Tacitly) Collaborative question answering utilizing Web trails. In: Proceedings of the international conference on language resources and evaluation; Workshop on web question answering, pp 36–42
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Small, S.G., Medsker, L. Review of information extraction technologies and applications. Neural Comput & Applic 25, 533–548 (2014). https://doi.org/10.1007/s00521-013-1516-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-013-1516-6