Skip to main content
Log in

Introduction to Information Extraction: Basic Notions and Current Trends

  • Schwerpunktbeitrag
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

Transforming unstructured or semi-structured information into structured knowledge is one of the big challenges of today’s knowledge society. While this abstract goal is still unreached and probably unreachable, intelligent information extraction techniques are considered key ingredients on the way to generating and representing knowledge for a wide variety of applications. This is especially true for the current efforts to turn the World Wide Web being the world’s largest collection of information into the world’s largest knowledge base. This introduction gives a broad overview about the major topics and current trends in information extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1999) Learning to construct knowledge bases from the World Wide Web. Artif Intell

  2. Weikum G, Theobald M (2010) From information to knowledge: harvesting entities and relationships from web sources. In: Proc of ACM symposium on principles of database systems (PODS), Indianapolis, USA

    Google Scholar 

  3. Buneman P, Khanna S, Tan WC (2001) Why and where: a characterization of data provenance. In: Proc of 8th international conference on database theory (ICDT), London, UK

    Google Scholar 

  4. Tayi GK, Ballou DP (1998) Examining data quality. Commun ACM 41(2)

  5. Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3)

    Google Scholar 

  6. Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118(1–2)

  7. d’Oro L, Ruffolo M, Staab S (2010) SXPath—extending XPath towards spatial querying on web documents. In: Proc of international conference on very large data bases (VLDB), Singapore. PVLDB, vol 4(2)

    Google Scholar 

  8. Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives ZG (2007) DBPedia: a nucleus for a web of open data. In: The semantic web (ISWC/ASWC 2007). LNCS, vol 4825. Springer, Berlin

    Google Scholar 

  9. Jurafsky D, Martin JH (2008) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall, New York

    Google Scholar 

  10. Gildea D, Jurafsky D (2000) Automatic labeling of semantic roles. In: Proc of annual meeting of the association for computational linguistics (ACL), Hong Kong, China

    Google Scholar 

  11. Grishman R, Sundheim B (1996) Message understanding conference—6: a brief history. In: Proc of international conference on computational linguistics (COLING), Kopenhagen, Denmark

    Google Scholar 

  12. Malouf R (2002) Markov models for language-independent named entity recognition. In: Proc of conference on natural language learning (CoNLL), Taipei, Taiwan

    Google Scholar 

  13. Curran JR, Clark S (2003) Language independent NER using a maximum entropy tagger. In: Proc of conference on natural language learning (CoNLL), Edmonton, Canada

    Google Scholar 

  14. Bunescu RC, Mooney RJ (2004) Collective information extraction with relational Markov networks. In: Proc of annual meeting of the association for computational linguistics (ACL), Barcelona, Spain

    Google Scholar 

  15. Finkel JR, Grenager T, Manning C (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proc of annual meeting of the association for computational linguistics (ACL), Ann Arbor, MI, USA

    Google Scholar 

  16. Hearst M (1992) Automatic acquisition of hyponyms from large text corpora. In: Proc of international conference on computational linguistics (COLING), Nantes, France

    Google Scholar 

  17. Etzioni O, Cafarella M, Downey D, Popescu A, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the web: an experimental study. J Artif Intell 165(1)

  18. Banko M, Cafarella M, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: Proc of international joint conference on artificial intelligence (IJCAI), Hyderabad, India

    Google Scholar 

  19. Bunescu RC, Pasca M (2006) Using encyclopedic knowledge for named entity disambiguation. In: Proc of conference of the European chapter of the association for computational linguistics (EACL), Trento, Italy

    Google Scholar 

  20. Cucerzan S (2011) Large-scale named entity disambiguation based on Wikipedia data. In: Proc of conference on empirical methods in natural language processing (EMNLP), Edinburgh, UK

    Google Scholar 

  21. Hoffart J, Yosef M, Bordino I, Fürstenau H, Pinkal M, Spaniol M, Taneva B, Thater S, Weikum G (2011) Robust disambiguation of named entities in text. In: Proc of conference on empirical methods in natural language processing (EMNLP), Edinburgh, UK

    Google Scholar 

  22. Hassell J, Aleman-Meza B, Arpinar IB (2006) Ontology-driven automatic entity disambiguation in unstructured text. In: Proc of international semantic web conference (ISWC), Athens, GA, USA

    Google Scholar 

  23. Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Proc of international conference on World Wide Web (WWW), Chiba, Japan

    Google Scholar 

  24. Dorow B, Widdows D (2003) Discovering corpus-specific word senses. In: Proc of conference of the European chapter of the association for computational linguistics (EACL), Budapest, Hungary

    Google Scholar 

  25. Nie Z, Ma Y, Shi S, Wen J, Ma W (2007) Web object retrieval. In: Proc of international conference on World Wide Web (WWW), Banff, Canada

    Google Scholar 

  26. Nie Z, Wen J, Ma W (2007) Object-level vertical search. In: Proc of biennial conference on innovative data systems research (CIDR), Asilomar, CA, USA

    Google Scholar 

  27. Dey D, Sarkar S, De P (2002) A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Trans Knowl Data Eng 14(3)

  28. Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proc of ACM international conference on management of data (SIGMOD), Baltimore, MD, USA

    Google Scholar 

  29. Chaudhuri S, Ganti V, Xin D (2009) Mining document collections to facilitate accurate approximate entity matching. In: Proc of international conference on very large data bases (VLDB), Lyon, France. PVLDB, vol 2(1)

    Google Scholar 

  30. Hearst M (1992) Automatic acquisition of hyponyms from large text corpora. In: Proc of international conference on computational linguistics (COLING), Nantes, France

    Google Scholar 

  31. Charniak E, Berland M (1999) Finding parts in very large corpora. In: Proc of annual meeting of the association for computational linguistics (ACL), College Park, MD, USA

    Google Scholar 

  32. Cederberg S, Widdows D (2003) Using LSA and noun coordination information to improve the precision and recall of automatic hyponymy extraction. In: Proc of conference on natural language learning (CoNLL), Edmonton, Canada

    Google Scholar 

  33. Stoica E, Hearst M, Richardson M (2007) Automating creation of hierarchical faceted metadata structures. In: Proc of human language technology conference of the association of computational linguistics, Rochester, NY, USA

    Google Scholar 

  34. Cimiano P, Handschuh S, Staab S (2004) Towards the self-annotating web. In: Proc of international conference on World Wide Web (WWW), New York, NY, USA

    Google Scholar 

  35. Sanderson M, Croft B (1999) Deriving concept hierarchies from text. In: Proc of international ACM SIGIR conference on research and development in information retrieval, Berkeley, CA, USA

    Google Scholar 

  36. Diederich J, Balke W (2007) The semantic GrowBag algorithm: automatically deriving categorization systems. In: Proc of European conference on research and advanced technology for digital libraries (ECDL), Budapest, Hungary

    Google Scholar 

  37. Jäschke R, Hotho A, Schmitz C, Ganter B, Stumme G (2008) Discovering shared conceptualizations in folksonomies. J Web Seman 6(1)

  38. Cohen W (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proc of ACM international conference on management of data (SIGMOD), Seattle, WA, USA

    Google Scholar 

  39. Mena E, Kashyap V, Illarramendi A, Sheth A (2000) Imprecise answers in distributed environments: estimation of information loss for multi-ontology based query processing. Int J Cooperat Inf Syst 9(4)

  40. Rodriguez M, Egenhofer M (2003) Determining semantic similarity among entity classes from different ontologies. IEEE Trans Knowl Data Eng 15(2)

  41. Gracia J, d’Aquin M, Mena E (2009) Large scale integration of senses for the semantic web. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain

    Google Scholar 

  42. Fader A, Soderland S, Etzioni O (2011) Identifying relations for open information extraction. In: Proc of conference on empirical methods in natural language processing (EMNLP), Edinburgh, UK

    Google Scholar 

  43. Kasneci G, Ramanath M, Suchanek F, Weikum G (2008) The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec 37(4)

  44. Wu F, Weld D (2007) Autonomouslly semantifying Wikipedia. In: Proc of ACM international conference on information and knowledge management (CIKM), Lisbon, Portugal

    Google Scholar 

  45. Brin S (1998) Extracting patterns and relations from the World Wide Web. In: Proc of international workshop on the World Wide Web and databases (WebDB), Valencia, Spain

    Google Scholar 

  46. Zhu J, Nie Z, Liu X, Zhang B, Wen J (2009) StatSnowball: a statistical approach to extracting entity relationships. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain

    Google Scholar 

  47. Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: Proc of ACM international conference on digital libraries (DL), San Antonio, TX, USA

    Google Scholar 

  48. Suchanek F, Sozio M, Weikum G (2009) SOFIE: a self-organizing framework for information extraction. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain

    Google Scholar 

  49. Nakashole N, Theobald M, Weikum G (2011) Scalable knowledge harvesting with high precision and high recall. In: Proc of ACM international conference on web search and data mining (WSDM), Hong Kong, China

    Google Scholar 

  50. Kok S, Domingos P (2008) Extracting semantic networks from text via relational clustering. In: Proc of European conference on machine learning and knowledge discovery in databases (ECML/PKDD), Antwerp, Belgium

    Google Scholar 

  51. Yates A, Etzioni O (2009) Unsupervised methods for determining object and relation synonyms on the Web. J Artif Intell Res 34

  52. Bollegala D, Matsuo Y, Ishizuka M (2009) Measuring the similarity between implicit semantic relations from the web. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain

    Google Scholar 

  53. Wang Y, Zhu M, Qu L, Spaniol M, Weikum G, (2010) Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia. In: Proc of international conference on extending database technology (EDBT), Lausanne, Switzerland

    Google Scholar 

  54. Doan A, Ramakrishnan R, Halevy AY (2011) Crowdsourcing systems on the World-Wide Web. Commun ACM 54

  55. Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 11

  56. McCann R, Shen W, Doan A (2008) Matching schemas in online communities: a web 2.0 approach. In: Proc of the international conference on data engineering (ICDE), Cancun, Mexico

    Google Scholar 

  57. Alonso O, Rose DE, Stewart B (2008) Crowdsourcing for relevance evaluation. In: ACM SIGIR forum. ACM, New York

    Google Scholar 

  58. Chai X, Gao BJ, Shen W, Doan A, Bohannon P, Zh X (2008) Building community Wikipedias: a machine-human partnership approach. In: Proc of int conf on data engineering (ICDE), Cancun, Mexico

    Google Scholar 

  59. DeRose P, Shen W, Chen F, Lee Y, Burdick D, Doan A, Ramakrishnan R (2007) DBLife: a community information management platform for the database research community. In: Proc of conference on innovative data systems research (CIDR), Asilomar, CA, USA

    Google Scholar 

  60. Chai X, Vuong B, Doan A, Naughton JF (2009) Efficiently incorporating user feedback into information extraction and integration programs. In: Proc of ACM international conference on management of data (SIGMOD), Providence, RI, USA

    Google Scholar 

  61. Franklin M, Kossmann D, Kraska T, Ramesh S, Xin R (2011) CrowdDB: answering queries with crowdsourcing. In: Proc of ACM international conference on management of data (SIGMOD), Athens, Greece

    Google Scholar 

  62. Demartini G, Difallah DE, Cudré-Mauroux P (2012) ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proc of international World Wide Web conference (WWW), Lyon, France

    Google Scholar 

  63. Selke J, Lofi C, Balke W (2012) Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. In: Proc of international conference on very large data bases (VLDB), Istanbul, Turkey. PVLDB, vol 5(6)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wolf-Tilo Balke.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balke, WT. Introduction to Information Extraction: Basic Notions and Current Trends. Datenbank Spektrum 12, 81–88 (2012). https://doi.org/10.1007/s13222-012-0090-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-012-0090-x

Keywords

Navigation