Introduction to Information Extraction: Basic Notions and Current Trends

Balke, Wolf-Tilo

doi:10.1007/s13222-012-0090-x

Introduction to Information Extraction: Basic Notions and Current Trends

Schwerpunktbeitrag
Published: 19 May 2012

Volume 12, pages 81–88, (2012)
Cite this article

Datenbank-Spektrum Aims and scope Submit manuscript

Wolf-Tilo Balke¹

542 Accesses
5 Citations
Explore all metrics

Abstract

Transforming unstructured or semi-structured information into structured knowledge is one of the big challenges of today’s knowledge society. While this abstract goal is still unreached and probably unreachable, intelligent information extraction techniques are considered key ingredients on the way to generating and representing knowledge for a wide variety of applications. This is especially true for the current efforts to turn the World Wide Web being the world’s largest collection of information into the world’s largest knowledge base. This introduction gives a broad overview about the major topics and current trends in information extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1999) Learning to construct knowledge bases from the World Wide Web. Artif Intell
Weikum G, Theobald M (2010) From information to knowledge: harvesting entities and relationships from web sources. In: Proc of ACM symposium on principles of database systems (PODS), Indianapolis, USA
Google Scholar
Buneman P, Khanna S, Tan WC (2001) Why and where: a characterization of data provenance. In: Proc of 8th international conference on database theory (ICDT), London, UK
Google Scholar
Tayi GK, Ballou DP (1998) Examining data quality. Commun ACM 41(2)
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3)
Google Scholar
Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118(1–2)
d’Oro L, Ruffolo M, Staab S (2010) SXPath—extending XPath towards spatial querying on web documents. In: Proc of international conference on very large data bases (VLDB), Singapore. PVLDB, vol 4(2)
Google Scholar
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives ZG (2007) DBPedia: a nucleus for a web of open data. In: The semantic web (ISWC/ASWC 2007). LNCS, vol 4825. Springer, Berlin
Google Scholar
Jurafsky D, Martin JH (2008) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall, New York
Google Scholar
Gildea D, Jurafsky D (2000) Automatic labeling of semantic roles. In: Proc of annual meeting of the association for computational linguistics (ACL), Hong Kong, China
Google Scholar
Grishman R, Sundheim B (1996) Message understanding conference—6: a brief history. In: Proc of international conference on computational linguistics (COLING), Kopenhagen, Denmark
Google Scholar
Malouf R (2002) Markov models for language-independent named entity recognition. In: Proc of conference on natural language learning (CoNLL), Taipei, Taiwan
Google Scholar
Curran JR, Clark S (2003) Language independent NER using a maximum entropy tagger. In: Proc of conference on natural language learning (CoNLL), Edmonton, Canada
Google Scholar
Bunescu RC, Mooney RJ (2004) Collective information extraction with relational Markov networks. In: Proc of annual meeting of the association for computational linguistics (ACL), Barcelona, Spain
Google Scholar
Finkel JR, Grenager T, Manning C (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proc of annual meeting of the association for computational linguistics (ACL), Ann Arbor, MI, USA
Google Scholar
Hearst M (1992) Automatic acquisition of hyponyms from large text corpora. In: Proc of international conference on computational linguistics (COLING), Nantes, France
Google Scholar
Etzioni O, Cafarella M, Downey D, Popescu A, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the web: an experimental study. J Artif Intell 165(1)
Banko M, Cafarella M, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: Proc of international joint conference on artificial intelligence (IJCAI), Hyderabad, India
Google Scholar
Bunescu RC, Pasca M (2006) Using encyclopedic knowledge for named entity disambiguation. In: Proc of conference of the European chapter of the association for computational linguistics (EACL), Trento, Italy
Google Scholar
Cucerzan S (2011) Large-scale named entity disambiguation based on Wikipedia data. In: Proc of conference on empirical methods in natural language processing (EMNLP), Edinburgh, UK
Google Scholar
Hoffart J, Yosef M, Bordino I, Fürstenau H, Pinkal M, Spaniol M, Taneva B, Thater S, Weikum G (2011) Robust disambiguation of named entities in text. In: Proc of conference on empirical methods in natural language processing (EMNLP), Edinburgh, UK
Google Scholar
Hassell J, Aleman-Meza B, Arpinar IB (2006) Ontology-driven automatic entity disambiguation in unstructured text. In: Proc of international semantic web conference (ISWC), Athens, GA, USA
Google Scholar
Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Proc of international conference on World Wide Web (WWW), Chiba, Japan
Google Scholar
Dorow B, Widdows D (2003) Discovering corpus-specific word senses. In: Proc of conference of the European chapter of the association for computational linguistics (EACL), Budapest, Hungary
Google Scholar
Nie Z, Ma Y, Shi S, Wen J, Ma W (2007) Web object retrieval. In: Proc of international conference on World Wide Web (WWW), Banff, Canada
Google Scholar
Nie Z, Wen J, Ma W (2007) Object-level vertical search. In: Proc of biennial conference on innovative data systems research (CIDR), Asilomar, CA, USA
Google Scholar
Dey D, Sarkar S, De P (2002) A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Trans Knowl Data Eng 14(3)
Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proc of ACM international conference on management of data (SIGMOD), Baltimore, MD, USA
Google Scholar
Chaudhuri S, Ganti V, Xin D (2009) Mining document collections to facilitate accurate approximate entity matching. In: Proc of international conference on very large data bases (VLDB), Lyon, France. PVLDB, vol 2(1)
Google Scholar
Hearst M (1992) Automatic acquisition of hyponyms from large text corpora. In: Proc of international conference on computational linguistics (COLING), Nantes, France
Google Scholar
Charniak E, Berland M (1999) Finding parts in very large corpora. In: Proc of annual meeting of the association for computational linguistics (ACL), College Park, MD, USA
Google Scholar
Cederberg S, Widdows D (2003) Using LSA and noun coordination information to improve the precision and recall of automatic hyponymy extraction. In: Proc of conference on natural language learning (CoNLL), Edmonton, Canada
Google Scholar
Stoica E, Hearst M, Richardson M (2007) Automating creation of hierarchical faceted metadata structures. In: Proc of human language technology conference of the association of computational linguistics, Rochester, NY, USA
Google Scholar
Cimiano P, Handschuh S, Staab S (2004) Towards the self-annotating web. In: Proc of international conference on World Wide Web (WWW), New York, NY, USA
Google Scholar
Sanderson M, Croft B (1999) Deriving concept hierarchies from text. In: Proc of international ACM SIGIR conference on research and development in information retrieval, Berkeley, CA, USA
Google Scholar
Diederich J, Balke W (2007) The semantic GrowBag algorithm: automatically deriving categorization systems. In: Proc of European conference on research and advanced technology for digital libraries (ECDL), Budapest, Hungary
Google Scholar
Jäschke R, Hotho A, Schmitz C, Ganter B, Stumme G (2008) Discovering shared conceptualizations in folksonomies. J Web Seman 6(1)
Cohen W (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proc of ACM international conference on management of data (SIGMOD), Seattle, WA, USA
Google Scholar
Mena E, Kashyap V, Illarramendi A, Sheth A (2000) Imprecise answers in distributed environments: estimation of information loss for multi-ontology based query processing. Int J Cooperat Inf Syst 9(4)
Rodriguez M, Egenhofer M (2003) Determining semantic similarity among entity classes from different ontologies. IEEE Trans Knowl Data Eng 15(2)
Gracia J, d’Aquin M, Mena E (2009) Large scale integration of senses for the semantic web. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain
Google Scholar
Fader A, Soderland S, Etzioni O (2011) Identifying relations for open information extraction. In: Proc of conference on empirical methods in natural language processing (EMNLP), Edinburgh, UK
Google Scholar
Kasneci G, Ramanath M, Suchanek F, Weikum G (2008) The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec 37(4)
Wu F, Weld D (2007) Autonomouslly semantifying Wikipedia. In: Proc of ACM international conference on information and knowledge management (CIKM), Lisbon, Portugal
Google Scholar
Brin S (1998) Extracting patterns and relations from the World Wide Web. In: Proc of international workshop on the World Wide Web and databases (WebDB), Valencia, Spain
Google Scholar
Zhu J, Nie Z, Liu X, Zhang B, Wen J (2009) StatSnowball: a statistical approach to extracting entity relationships. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain
Google Scholar
Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: Proc of ACM international conference on digital libraries (DL), San Antonio, TX, USA
Google Scholar
Suchanek F, Sozio M, Weikum G (2009) SOFIE: a self-organizing framework for information extraction. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain
Google Scholar
Nakashole N, Theobald M, Weikum G (2011) Scalable knowledge harvesting with high precision and high recall. In: Proc of ACM international conference on web search and data mining (WSDM), Hong Kong, China
Google Scholar
Kok S, Domingos P (2008) Extracting semantic networks from text via relational clustering. In: Proc of European conference on machine learning and knowledge discovery in databases (ECML/PKDD), Antwerp, Belgium
Google Scholar
Yates A, Etzioni O (2009) Unsupervised methods for determining object and relation synonyms on the Web. J Artif Intell Res 34
Bollegala D, Matsuo Y, Ishizuka M (2009) Measuring the similarity between implicit semantic relations from the web. In: Proc of international conference on World Wide Web (WWW), Madrid, Spain
Google Scholar
Wang Y, Zhu M, Qu L, Spaniol M, Weikum G, (2010) Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia. In: Proc of international conference on extending database technology (EDBT), Lausanne, Switzerland
Google Scholar
Doan A, Ramakrishnan R, Halevy AY (2011) Crowdsourcing systems on the World-Wide Web. Commun ACM 54
Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 11
McCann R, Shen W, Doan A (2008) Matching schemas in online communities: a web 2.0 approach. In: Proc of the international conference on data engineering (ICDE), Cancun, Mexico
Google Scholar
Alonso O, Rose DE, Stewart B (2008) Crowdsourcing for relevance evaluation. In: ACM SIGIR forum. ACM, New York
Google Scholar
Chai X, Gao BJ, Shen W, Doan A, Bohannon P, Zh X (2008) Building community Wikipedias: a machine-human partnership approach. In: Proc of int conf on data engineering (ICDE), Cancun, Mexico
Google Scholar
DeRose P, Shen W, Chen F, Lee Y, Burdick D, Doan A, Ramakrishnan R (2007) DBLife: a community information management platform for the database research community. In: Proc of conference on innovative data systems research (CIDR), Asilomar, CA, USA
Google Scholar
Chai X, Vuong B, Doan A, Naughton JF (2009) Efficiently incorporating user feedback into information extraction and integration programs. In: Proc of ACM international conference on management of data (SIGMOD), Providence, RI, USA
Google Scholar
Franklin M, Kossmann D, Kraska T, Ramesh S, Xin R (2011) CrowdDB: answering queries with crowdsourcing. In: Proc of ACM international conference on management of data (SIGMOD), Athens, Greece
Google Scholar
Demartini G, Difallah DE, Cudré-Mauroux P (2012) ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proc of international World Wide Web conference (WWW), Lyon, France
Google Scholar
Selke J, Lofi C, Balke W (2012) Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. In: Proc of international conference on very large data bases (VLDB), Istanbul, Turkey. PVLDB, vol 5(6)
Google Scholar

Download references

Author information

Authors and Affiliations

Institut für Informationssysteme, Technische Universität Braunschweig, Braunschweig, Germany
Wolf-Tilo Balke

Authors

Wolf-Tilo Balke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wolf-Tilo Balke.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balke, WT. Introduction to Information Extraction: Basic Notions and Current Trends. Datenbank Spektrum 12, 81–88 (2012). https://doi.org/10.1007/s13222-012-0090-x

Download citation

Received: 06 May 2012
Accepted: 08 May 2012
Published: 19 May 2012
Issue Date: July 2012
DOI: https://doi.org/10.1007/s13222-012-0090-x

Keywords

Information extraction

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Introduction to Information Extraction: Basic Notions and Current Trends

Abstract

Access this article

Similar content being viewed by others

A survey of methods for the extraction of information from Web resources

Background

Information Extraction Approaches: A Survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Introduction to Information Extraction: Basic Notions and Current Trends

Abstract

Access this article

Similar content being viewed by others

A survey of methods for the extraction of information from Web resources

Background

Information Extraction Approaches: A Survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation