Fact-Aware Document Retrieval for Information Extraction

Boden, Christoph; Löser, Alexander; Nagel, Christoph; Pieper, Stephan

doi:10.1007/s13222-012-0088-4

Fact-Aware Document Retrieval for Information Extraction

Schwerpunktbeitrag
Published: 16 May 2012

Volume 12, pages 89–100, (2012)
Cite this article

Datenbank-Spektrum Aims and scope Submit manuscript

Christoph Boden¹,
Alexander Löser¹,
Christoph Nagel¹ &
…
Stephan Pieper¹

164 Accesses
2 Citations
Explore all metrics

Abstract

Exploiting textual information from large document collections such as the Web with structured queries is an often requested, but still unsolved requirement of many users. We present BlueFact, a framework for efficiently retrieving documents containing structured, factual information from a full-text index. This is an essential building block for information extraction systems that enable ad-hoc analytical queries on unstructured text data as well as knowledge harvesting in a digital archive scenario.

Our approach is based on the observation that documents share a set of common grammatical structures and words for expressing facts. Our system observes these keyword phrases using structural, syntactic, lexical and semantic features in an iterative, cost effective training process and systematically queries the search engine index with these automatically generated phrases. Next, BlueFact retrieves a list of document identifiers, combines observed keywords as evidence for a factual information and infers the relevance for each document identifier. Finally, we forward the documents in the order of their estimated relevance to an information extraction service. That way BlueFact can efficiently retrieve all the structured, factual information contained in an indexed collection of text documents.

We report results of a comprehensive experimental evaluation over 20 different fact types on the Reuters News Corpus Volume I (RCV1). BlueFact’s scoring model and feature generation methods significantly outperform existing approaches in terms of fact retrieval performance. BlueFact fires significantly fewer queries against the index, requires significantly less execution time and achieves very high fact recall across different domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fact Based Search Engine: News Fact Finder Utilizing Naive Bayes Classification

Knowledge Extraction for Information Retrieval

Retrieving Textual Evidence for Knowledge Graph Facts

Notes

The average execution time for an average sized document in the Reuters news corpus with a size of 1.4 Kbyte was 1.3 seconds. The average extraction time for the larges document in the corpus (52.45 KB) was 28 seconds with the OpenCalais.com information extraction service we used.
Reuters Corpus Volume I (RCV1) available at http://trec.nist.gov/data/reuters/reuters.html.
Noun phrase.
http://lucene.apache.org/.
http://hsqldb.org/.

References

Agichtein E, Gravano L (2003) Querying text databases for efficient information extraction. In: Proceedings of the 19th IEEE international conference on data engineering (ICDE), pp 113–124
Google Scholar
Alias-i.: Lingpipe 4.0.1. http://alias-i.com/lingpipe. Last visited 01/10/10
Boden C, Häfele T, Löser A (2011) Classification algorithms for relation prediction. In: DaLi workshop at ICDE 2011
Google Scholar
Bohannon P, Merugu S, Yu C, Agarwal V, DeRose P, Iyer A, Jain A, Kakade V, Muralidharan M, Ramakrishnan R, Shen W (2009) Purple sox extraction management system. SIGMOD Rec 37:21–27. doi:10.1145/1519103.1519107. http://doi.acm.org/10.1145/1519103.1519107
Article Google Scholar
Chiticariu L, Krishnamurthy R, Li Y, Raghavan S, Reiss FR, Vaithyanathan S (2010) Systemt: an algebraic approach to declarative information extraction. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL ’10), pp 128–137. Association for Computational Linguistics, Stroudsburg. http://portal.acm.org/citation.cfm?id=1858681.1858695
Google Scholar
Cohen WW (1995) Fast effective rule induction. In: ICML. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.8204
Google Scholar
Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19:61–74. http://portal.acm.org/citation.cfm?id=972450.972454
Google Scholar
Etzioni O, Banko M, Soderland S, Weld DS (2008) Open information extraction from the web. Commun ACM 51:68–74. doi:10.1145/1409360.1409378. http://doi.acm.org/10.1145/1409360.1409378
Article Google Scholar
Fang Y, Chang KCC (2011) Searching patterns for relation extraction over the web: rediscovering the pattern-relation duality. In: Proceedings of the fourth ACM international conference on Web search and data mining. doi:10.1145/1935826.1935933. http://doi.acm.org/10.1145/1935826.1935933
Google Scholar
Feldman R, Regev Y, Gorodetsky M (2008) A modular information extraction system. Intell Data Anal 12:51–71. http://portal.acm.org/citation.cfm?id=1368027.1368031
Google Scholar
Fung GPC, Yu JX, Lu H (2002) Discriminative category matching: efficient text classification for huge document collections. In: Proceedings of the 19th IEEE international conference on data mining, pp 187–194
Google Scholar
Ipeirotis PG, Agichtein E, Jain P, Gravano L (2006) To search or to crawl? Towards a query optimizer for text-centric tasks. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM, New York, pp 265–276
Chapter Google Scholar
Jain A, Doan A, Gravano L (2008) Optimizing SQL queries over text databases. In: IEEE 24th international conference on data engineering. IEEE Press, New York, pp 636–645
Chapter Google Scholar
Kasneci G, Ramanath M, Suchanek F, Weikum G (2009) The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec 37:41–47. doi:10.1145/1519103.1519110. http://doi.acm.org/10.1145/1519103.1519110
Article Google Scholar
Liu J (2006) Answering structured queries on unstructured data. In: WebDB, pp 25–30
Google Scholar
Löser A, Hueske F, Markl V (2009) Situational business intelligence. In: Aalst W, Mylopoulos J, Sadeh NM, Shaw MJ, Szyperski C, Castellanos M, Dayal U, Sellis T (Eds) Business intelligence for the real-time enterprise. Lecture notes in business information processing, vol 27. Springer, Berlin, pp 1–11. http://dx.doi.org/10.1007/978-3-642-03422-0_1
Chapter Google Scholar
Löser A, Nagel C, Pieper S (2011) Augmenting tables by self-supervised web search. In: Enabling real-time business intelligence, pp 84–99
Chapter Google Scholar
Löser A, Nagel C, Pieper S, Boden C (2011) Factcrawl: a fact retrieval framework for full-text indices. In: WebDB workshop with SIGMOD 2011
Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
MATH Google Scholar
OpenCalais: Open calais. http://www.opencalais.com/. Last visited 02/25/11
Robertson SE (1991) On term selection for query expansion. J Doc 46:359–364. doi:10.1108/eb026866. http://portal.acm.org/citation.cfm?id=104889.104901
Article Google Scholar
Shen W, DeRose P, McCann R, Doan A, Ramakrishnan R (2008) Toward best-effort information extraction. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, New York, pp 1031–1042
Chapter Google Scholar
Zhou M, Cheng T, Chang KCC (2010) Docqs: a prototype system for supporting data-oriented content query. In: Proceedings of the 2010 international conference on management of data. ACM, New York
Google Scholar

Download references

Acknowledgements

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. FP7-ICT-2009-5-257859, ‘Risk and Opportunity management of huge-scale BUSiness community cooperation’ (ROBUST). Alexander Löser also receives funding from the Federal Ministry of Economics and Technology (BMWi) under grant agreement “01MD11014A, ‘MIA-Marktplatz für Informationen und Analysen’ (MIA)”.

Author information

Authors and Affiliations

University of Technology Berlin, Einsteinufer 17, 10587, Berlin, Germany
Christoph Boden, Alexander Löser, Christoph Nagel & Stephan Pieper

Authors

Christoph Boden
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Löser
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Nagel
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Pieper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christoph Boden.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boden, C., Löser, A., Nagel, C. et al. Fact-Aware Document Retrieval for Information Extraction. Datenbank Spektrum 12, 89–100 (2012). https://doi.org/10.1007/s13222-012-0088-4

Download citation

Received: 04 April 2012
Accepted: 16 April 2012
Published: 16 May 2012
Issue Date: July 2012
DOI: https://doi.org/10.1007/s13222-012-0088-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fact-Aware Document Retrieval for Information Extraction

Abstract

Access this article

Similar content being viewed by others

Fact Based Search Engine: News Fact Finder Utilizing Naive Bayes Classification

Knowledge Extraction for Information Retrieval

Retrieving Textual Evidence for Knowledge Graph Facts

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fact-Aware Document Retrieval for Information Extraction

Abstract

Access this article

Similar content being viewed by others

Fact Based Search Engine: News Fact Finder Utilizing Naive Bayes Classification

Knowledge Extraction for Information Retrieval

Retrieving Textual Evidence for Knowledge Graph Facts

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation