Skip to main content
Log in

Review of information extraction technologies and applications

  • Review
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Information extraction (IE) is an important and growing field, in part because of the development of ubiquitous social media networking millions of people and producing huge collections of textual information. Mined information is being used in a wide array of application areas from targeted marketing of products to intelligence gathering for military and security needs. IE has its roots in artificial intelligence fields including machine learning, logic and search algorithms, computational linguistics, and pattern recognition. This review summarizes the history of IE, surveys the various uses of IE, identifies current technological accomplishments and challenges, and explores the role that neural and adaptive computing might play in future research. A goal for this review is also to encourage practitioners of neural and adaptive computing to look for interesting applications in the important emerging area of IE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Chinchor N, Lewis DD, Hirschman L (1993) Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3). Comput Linguist 19(3):409–449

    Google Scholar 

  2. Bikel DM, Schwartz R, Weischedel RM (1999) An algorithm that learns what’s in a name. Mach Learn J Spec Issue Nat Lang Learn 34(1–3):211–231

    MATH  Google Scholar 

  3. Appelt, DE, Hobbs J, Bear J, Israel D, Tyson M (1993) FASTUS: a finite-state processor for information extraction from real-world text. In: Proceedings IJCAI-93, pp 1172–1178

  4. Miller S, Crystal M, Fox H, Ramshaw L, Schwartz R, Stoner R, Weischedel R (1998) Algorithms that learn to extract information; BBN: description of the SIFT system as used for MUC-7. In: Proceedings of the seventh annual message understanding conference (MUC-7), 17 pp

  5. Nadeau D (2007) PhD Thesis. Ottawa-Carleton Institute for Computer Science, School of Information Technology and Engineering, University of Ottawa

  6. Yangarber R, Grishman Tapanainen RP, Huttunen S (2000) Unsupervised discovery of scenario-level patterns for information extraction. In: Proceedings of the applied natural language processing conference (ANLP 2000), pp 282–289

  7. Riloff, E (1996) Automatically generating extraction patterns from untagged text. In: Proceedings of the thirteenth national conference on artificial intelligence (AAAI-96), pp 1044–1049

  8. Hasegawa H, Satoshi S, Grishman R (2004) Discovering relations among named entities from large corpora. In: Proceeding of ACL-2004, 8 pp

  9. Dalvi N, Kumar R, Soliman M (2011) Automatic wrappers for large scale web extraction. Proc VLDB Endowment 4(4):219–230

    Google Scholar 

  10. Piskoraski J, Yangarber R (2013) Information extraction: past, present and future, Chapter 2. In: Poibeau et al (eds) Multi-source, multilingual information extraction and summarization 11, theory and applications of natural language processing. doi:10.1007/978-3-642-28569-1_2, Springer

  11. Buckley C (1985) Implementation of the Smart information retrieval system. Cornell University Department of Computer Science Technical Report, 37 pp

  12. Callan JP, Croft WB, and Harding SM (1992) The INQUERY retrieval system. In: Proceedings of the third international conference on database and expert systems applications, pp 78–83

  13. Cormack GV, Clarke CLA, Palmer CR, Samuel SL (2000) Passage-based query refinement (MultiText experiments for TREC-6). Inf Process Manage 36(1):133–153

    Google Scholar 

  14. Toutanova K, Klein D, Manning C, and Singer Y. (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL, pp 252–259

  15. http://nlp.stanford.edu/software/tagger.shtml

  16. Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41

    Google Scholar 

  17. Soon WM, Ng HT, Lim DCY (2001) A machine learning approach to coreference resolution of noun phrases. Comput Linguist 27(4):521–544

    Google Scholar 

  18. Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Commun 33(1–2):23–60

    MATH  Google Scholar 

  19. Core MG and Allen JF (1997) Coding dialogs with the DAMSL annotation scheme. In: Working notes of AAAI fall symposium on communicative action in humans and machines. CSLU Toolkit: http://www.cslu.ogi.edu/toolkit

  20. Cassidy S, Harrington J (2001) Multi-level annotation in the Emu speech database management system. Speech Commun 33:61–77

    MATH  Google Scholar 

  21. Maeda, K, Bird S, Ma X, Lee H (2002) Creating annotation tools with the annotation graph toolkit. In: Proceedings of the third international conference on language resources and evaluation, 8 pp

  22. Small, SG, Strzalkowski T, Stommer-Galley J (2012) Multi-modal annotation of quest games in second life. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 171–179

  23. Small SG, Booker J (2013) Hydrofracking comments meets computational linguistics. Submitted to the 7th international AAAI conference on weblogs and social media

  24. Joachims T (2002) Learning to classify text using support vector machines: methods, theory, and algorithms. Kluwer Academic, Dordrecht

    Google Scholar 

  25. Cortez C, Vapnik VN (1995) Support-vector networks. Mach Learn 20(3):273–297

    Google Scholar 

  26. Ji Y, Sun S (2013) Multitask multiclass support vector machines: model and experiments. Pattern Recogn 46(3):914–924

    MATH  Google Scholar 

  27. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Haussler D (ed) 5th annual ACM workshop on COLT, pp 144–152, Pittsburgh, PA. ACM Press

  28. Li Y, Bontcheva K, Cunningham, H (2005) SVM based learning system for information extraction. In: Proceedings of the Sheffield machine learning workshop

  29. Li Y, Bontcheva K, Cunningham H (2005) Using uneven margins SVM and perceptron for information extraction. In: Proceedings of the ninth conference on computational natural language learning CoNLL-2005, pp 72–79

  30. Li Y, Bontcheva K, Cunningham H (2004) SVM based learning system for information extraction. Determ Stat Methods Mach Learn 3635:319–339

    Google Scholar 

  31. Sang EF, Kim T, Meulder FD (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of CoNLL-2003, vol 4, pp 142–147

  32. Mayfield J, McNamee P, Piatko C (2003). Named entity recognition using hundreds of thousands of features. In: Proceedings of CoNLL-2003, vol 4, pp 184–187

  33. Hammerton J (2003) Named entity recognition with long short-term memory. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL, vol 4, pp 172–175

  34. Turian J, Ratinov L, and Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394

  35. Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, Finland

  36. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537

    MATH  Google Scholar 

  37. Honkela T (1997) Self-organizing maps in natural language processing, PhD Thesis. Helsinski University of Technology

  38. Honkela T, Laaksonen J, Törrö H, Tenhunen J (2011) Media Map: a multilingual document map with a design interface, WSOM 2011, LNCS 6731, pp 247–256, Springer

  39. Frawley WJ, Piatetsky-Shapiro G, Matheus CJ (1992) Knowledge discovery in databases: an overview. AI Mag 13(3):57

    Google Scholar 

  40. Hendler J (2013) Thetherless World Constellation at RPI. http://tw.rpi.edu/wiki/Tetherless_World_Constellation_at_RPI

  41. Etzioni O, Fader A, Christensen J, Soderland S, Mausam (2011) Open information extraction: the second generation. IJCAI 2011:3–10

    Google Scholar 

  42. Dalvi B, Cohen W, Callan J (2012) WebSets: extracting sets of entities from the web using unsupervised information extraction. In: Proceedings of the fifth ACM international conference on web search and data mining, pp 243–252

  43. Peng F, McCallum A (2006) Accurate information extraction from research papers using conditional random fields. Inf Process Manag 42(4):963–979. http://people.cs.umass.edu/~mccallum/papers/hlt2004.pdf

  44. Giles CL, Bollacker KD, Lawrence S (1998) Citeseer: an automatic citation indexing system. In: Digital Libraries, pp 89–98

  45. Han H, Giles CL, Manavoglu, E, Zha H, Zhang Z, Fox EA (2003) Automatic document metadata extraction using support vector machines, ACM/IEEE joint conference on Digital Libraries (JCDL 2003), pp 37–48

  46. Khabsa M, Treeratpituk P, Lee Giles CL (2012) AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries. In: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL 2012), pp 185–194

  47. Teregowda PB, Councill IG, Fernandez JP, Khabsa M, Zheng S, Giles CL (2010) SeerSuite: developing a scalable and reliable application framework for building digital libraries by crawling the web. In: 1st USENIX conference on web application development

  48. Ferrucci D et al (2006) Towards an interoperability standard for text and multi-modal analytics. IBM Research Report, RC24122 (W0611-188), 106 pp. http://domino.research.ibm.com/library/cyberdig.nsf/papers/1898F3F640FEF47E8525723C00551250/$File/rc24122.pdf

  49. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18

    Google Scholar 

  50. Alonso O, Rose DE, Stewart B (2008) Crowdsourcing for relevance evaluation. SIGIR Forum 42(2):9–15

    Google Scholar 

  51. Luger GF (2009) Artificial intelligence, 6th edn, pp 664–665

  52. Cunningham H, Tablan V, Roberts A, Bontcheva K (2013) Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput Biol 9(2):e1002854

    Google Scholar 

  53. Khabsa M, Koppman S, Giles CL (2012) Towards building and analyzing a social network of acknowledgments in scientific and academic documents. In: Social computing, behavioral–cultural modeling and prediction—5th international conference (SBP 2012), pp 357–364

  54. Radwanick S (2011) The rise of social networking in Latin America: how social media is shaping Latin America’s digital landscape. In: http://www.comscore.com/Press_Events/Presentations_Whitepapers/2011/The_Rise_of_Social_Networking_in_Latin_America

  55. Signorini A, Segre AM, Polgreen PM (2011) The use of Twitter to track levels of disease activity and public concern in the US during the Inlfuenza A H1N1 pandemic. PLoS One 6(5):e19467. doi:10.1371/journal.pone.0019467

    Google Scholar 

  56. Eccarius-Kelly V (2007) Counterterrorism policies and the revolutionary movement of Tupac Amaru: the unmasking of Peru’s National Security State. In: Forest JJF (ed) Countering terrorism in the 21st century, vol 3. Praeger Security International, Westport, pp 463–484

    Google Scholar 

  57. Atkinson-Abutridy J, Mellish C, Aitken S (2004) Combining information extraction with genetic algorithms for text mining. IEEE Intell Syst 19(3):22–30

    Google Scholar 

  58. Downey D, Etzioni O, Soderland S (2010) Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif Intell 174(11):726–748

    MATH  MathSciNet  Google Scholar 

  59. Yates A, Etzioni O (2009) Unsupervised methods for determining object and relation synonyms on the Web. J Artif Intell Res 34:255–296

    MATH  Google Scholar 

  60. Nahm UY, Mooney RJ (2000) A mutually beneficial integration of data mining and information extraction. In: Proceedings of the American association for artificial intelligence conference, pp 627–632

  61. Medsker L, Small SG, Rivadereira C, Reynolds A, Afzali M (2012) The Siena College medical information retrieval system (MIRS). In: The twenty-first text retrieval conference proceedings (TREC2012)

  62. Apache LuceneTM is an open source high-performance, full- featured text search engine. http://lucene.apache.org/

  63. Hyvärinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, New York

    Google Scholar 

  64. Hyvärinen A, Zhang K, Shimizu S, Hoyer PO (2010) Estimation of a structural vector autoregression model using non-gaussianity. J Mach Learn Res 11:1709–1731

    MATH  MathSciNet  Google Scholar 

  65. Wilson N, Wang H, McGuinness DL (2012) Scientific names and descriptions for organisms on the Semantic Web. In: Proceedings of 2nd international workshop on linked science 2012—Tackling Big Data at ISWC 2012

  66. Bizer C, Heath T, Berners-Lee T (2009) Linked data—the story so far, special issue on linked data. In: Heath T, Hepp M, Bizer C (eds) International Journal on Semantic Web and Information Systems (IJSWIS) 5(3). http://eventseer.net/e/4789/, http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf

  67. Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes Twitter users: real-time event detection by social sensors. In: WWW’10 Proceedings of the 19th international conference on World Wide Web, pp 851–860

  68. Ritter A, Mausam, Etzioni O, Clark S (2012) Open domain event extraction from Twitter. In: ACM SIGKDD conference on knowledge discovery and data mining (SIGKDD), pp 1104–1112

  69. Lin T, Mausam, Etzioni O (2012) Entity linking at web scale. In: Joint workshop on automatic knowledge base construction and Web-scale knowledge extraction

  70. Small S, Strzalkowski T (2010) (Tacitly) Collaborative question answering utilizing Web trails. In: Proceedings of the international conference on language resources and evaluation; Workshop on web question answering, pp 36–42

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Larry Medsker.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Small, S.G., Medsker, L. Review of information extraction technologies and applications. Neural Comput & Applic 25, 533–548 (2014). https://doi.org/10.1007/s00521-013-1516-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-013-1516-6

Keywords

Navigation