Skip to main content
Log in

Fact-Aware Document Retrieval for Information Extraction

  • Schwerpunktbeitrag
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

Exploiting textual information from large document collections such as the Web with structured queries is an often requested, but still unsolved requirement of many users. We present BlueFact, a framework for efficiently retrieving documents containing structured, factual information from a full-text index. This is an essential building block for information extraction systems that enable ad-hoc analytical queries on unstructured text data as well as knowledge harvesting in a digital archive scenario.

Our approach is based on the observation that documents share a set of common grammatical structures and words for expressing facts. Our system observes these keyword phrases using structural, syntactic, lexical and semantic features in an iterative, cost effective training process and systematically queries the search engine index with these automatically generated phrases. Next, BlueFact retrieves a list of document identifiers, combines observed keywords as evidence for a factual information and infers the relevance for each document identifier. Finally, we forward the documents in the order of their estimated relevance to an information extraction service. That way BlueFact can efficiently retrieve all the structured, factual information contained in an indexed collection of text documents.

We report results of a comprehensive experimental evaluation over 20 different fact types on the Reuters News Corpus Volume I (RCV1). BlueFact’s scoring model and feature generation methods significantly outperform existing approaches in terms of fact retrieval performance. BlueFact fires significantly fewer queries against the index, requires significantly less execution time and achieves very high fact recall across different domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The average execution time for an average sized document in the Reuters news corpus with a size of 1.4 Kbyte was 1.3 seconds. The average extraction time for the larges document in the corpus (52.45 KB) was 28 seconds with the OpenCalais.com information extraction service we used.

  2. Reuters Corpus Volume I (RCV1) available at http://trec.nist.gov/data/reuters/reuters.html.

  3. Noun phrase.

  4. http://lucene.apache.org/.

  5. http://hsqldb.org/.

References

  1. Agichtein E, Gravano L (2003) Querying text databases for efficient information extraction. In: Proceedings of the 19th IEEE international conference on data engineering (ICDE), pp 113–124

    Google Scholar 

  2. Alias-i.: Lingpipe 4.0.1. http://alias-i.com/lingpipe. Last visited 01/10/10

  3. Boden C, Häfele T, Löser A (2011) Classification algorithms for relation prediction. In: DaLi workshop at ICDE 2011

    Google Scholar 

  4. Bohannon P, Merugu S, Yu C, Agarwal V, DeRose P, Iyer A, Jain A, Kakade V, Muralidharan M, Ramakrishnan R, Shen W (2009) Purple sox extraction management system. SIGMOD Rec 37:21–27. doi:10.1145/1519103.1519107. http://doi.acm.org/10.1145/1519103.1519107

    Article  Google Scholar 

  5. Chiticariu L, Krishnamurthy R, Li Y, Raghavan S, Reiss FR, Vaithyanathan S (2010) Systemt: an algebraic approach to declarative information extraction. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL ’10), pp 128–137. Association for Computational Linguistics, Stroudsburg. http://portal.acm.org/citation.cfm?id=1858681.1858695

    Google Scholar 

  6. Cohen WW (1995) Fast effective rule induction. In: ICML. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.8204

    Google Scholar 

  7. Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19:61–74. http://portal.acm.org/citation.cfm?id=972450.972454

    Google Scholar 

  8. Etzioni O, Banko M, Soderland S, Weld DS (2008) Open information extraction from the web. Commun ACM 51:68–74. doi:10.1145/1409360.1409378. http://doi.acm.org/10.1145/1409360.1409378

    Article  Google Scholar 

  9. Fang Y, Chang KCC (2011) Searching patterns for relation extraction over the web: rediscovering the pattern-relation duality. In: Proceedings of the fourth ACM international conference on Web search and data mining. doi:10.1145/1935826.1935933. http://doi.acm.org/10.1145/1935826.1935933

    Google Scholar 

  10. Feldman R, Regev Y, Gorodetsky M (2008) A modular information extraction system. Intell Data Anal 12:51–71. http://portal.acm.org/citation.cfm?id=1368027.1368031

    Google Scholar 

  11. Fung GPC, Yu JX, Lu H (2002) Discriminative category matching: efficient text classification for huge document collections. In: Proceedings of the 19th IEEE international conference on data mining, pp 187–194

    Google Scholar 

  12. Ipeirotis PG, Agichtein E, Jain P, Gravano L (2006) To search or to crawl? Towards a query optimizer for text-centric tasks. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM, New York, pp 265–276

    Chapter  Google Scholar 

  13. Jain A, Doan A, Gravano L (2008) Optimizing SQL queries over text databases. In: IEEE 24th international conference on data engineering. IEEE Press, New York, pp 636–645

    Chapter  Google Scholar 

  14. Kasneci G, Ramanath M, Suchanek F, Weikum G (2009) The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec 37:41–47. doi:10.1145/1519103.1519110. http://doi.acm.org/10.1145/1519103.1519110

    Article  Google Scholar 

  15. Liu J (2006) Answering structured queries on unstructured data. In: WebDB, pp 25–30

    Google Scholar 

  16. Löser A, Hueske F, Markl V (2009) Situational business intelligence. In: Aalst W, Mylopoulos J, Sadeh NM, Shaw MJ, Szyperski C, Castellanos M, Dayal U, Sellis T (Eds) Business intelligence for the real-time enterprise. Lecture notes in business information processing, vol 27. Springer, Berlin, pp 1–11. http://dx.doi.org/10.1007/978-3-642-03422-0_1

    Chapter  Google Scholar 

  17. Löser A, Nagel C, Pieper S (2011) Augmenting tables by self-supervised web search. In: Enabling real-time business intelligence, pp 84–99

    Chapter  Google Scholar 

  18. Löser A, Nagel C, Pieper S, Boden C (2011) Factcrawl: a fact retrieval framework for full-text indices. In: WebDB workshop with SIGMOD 2011

    Google Scholar 

  19. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    MATH  Google Scholar 

  20. OpenCalais: Open calais. http://www.opencalais.com/. Last visited 02/25/11

  21. Robertson SE (1991) On term selection for query expansion. J Doc 46:359–364. doi:10.1108/eb026866. http://portal.acm.org/citation.cfm?id=104889.104901

    Article  Google Scholar 

  22. Shen W, DeRose P, McCann R, Doan A, Ramakrishnan R (2008) Toward best-effort information extraction. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, New York, pp 1031–1042

    Chapter  Google Scholar 

  23. Zhou M, Cheng T, Chang KCC (2010) Docqs: a prototype system for supporting data-oriented content query. In: Proceedings of the 2010 international conference on management of data. ACM, New York

    Google Scholar 

Download references

Acknowledgements

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. FP7-ICT-2009-5-257859, ‘Risk and Opportunity management of huge-scale BUSiness community cooperation’ (ROBUST). Alexander Löser also receives funding from the Federal Ministry of Economics and Technology (BMWi) under grant agreement “01MD11014A, ‘MIA-Marktplatz für Informationen und Analysen’ (MIA)”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christoph Boden.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boden, C., Löser, A., Nagel, C. et al. Fact-Aware Document Retrieval for Information Extraction. Datenbank Spektrum 12, 89–100 (2012). https://doi.org/10.1007/s13222-012-0088-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-012-0088-4

Keywords

Navigation