ABSTRACT
Information Extraction (IE) pipelines analyze text through several stages. The pipeline's algorithms determine both its effectiveness and its run-time efficiency. In real-world tasks, however, IE pipelines often fail acceptable run-times because they analyze too much task-irrelevant text. This raises two interesting questions: 1) How much "efficiency potential" depends on the scheduling of a pipeline's algorithms? 2) Is it possible to devise a reliable method to construct efficient IE pipelines? Both questions are addressed in this paper. In particular, we show how to optimize the run-time efficiency of IE pipelines under a given set of algorithms. We evaluate pipelines for three algorithm sets on an industrially relevant task: the extraction of market forecasts from news articles. Using a system-independent measure, we demonstrate that efficiency gains of up to one order of magnitude are possible without compromising a pipeline's original effectiveness.
- E. Agichtein and L. Gravano. Querying Text Databases for Efficient Information Extraction. In ICDE, pp. 113--124, 2003.Google ScholarCross Ref
- E. Agichtein. Scaling Information Extraction to Large Document Collections. Bulletin of IEEE-CS Technical Committee on Data Engineering, 28:3--10, 2005.Google Scholar
- A. Björkelund, B. Bohnet, L. Hafdell, and P. Nugues. A High-Performance Syntactic and Semantic Dependency Parser. In COLING: Demonstrations, pp. 33--36, 2010. Google ScholarDigital Library
- B. Bohnet. Very High Accuracy and Fast Dependency Parsing is not a Contradiction. In COLING, pp. 89--97, 2010. Google ScholarDigital Library
- M.J. Cafarella, D. Downey, S. Soderland, and O. Etzioni. KnowItNow: Fast, Scalable Information Extraction from the Web. In HLT and EMNLP, pp. 563--570, 2005. Google ScholarDigital Library
- G. Forman and E. Kirshenbaum. Extremely Fast Text Feature Extraction for Classification and Indexing. In CIKM, pp. 1221--1230, 2008. Google ScholarDigital Library
- U. Germann, M. Jahr, K. Knight, D. Marcu, and Y. Yamada. Fast Decoding and Optimal Decoding for Machine Translation. In ACL, pp. 228--235, 2001. Google ScholarDigital Library
- A. Pauls and D. Klein. k-best A$^*$ Parsing. In ACL and IJCNLP, pp. 958--966, 2009. Google ScholarDigital Library
- S. Petrov. Coarse-to-Fine Natural Language Processing. PhD Thesis, University of California at Berkeley, 2009. Google ScholarDigital Library
- L. Ratinov and D. Roth. Design Challenges and Misconceptions in Named Entity Recognition. In CoNLL, pp. 147--155, 2009. Google ScholarDigital Library
- H. Schmid. 1995. Improvements in Part-of-Speech Tagging with an Application to German. In ACL SIGDAT-Workshop, pp. 47--50.Google Scholar
- B. Stein, S. Meyer zu Eissen, G. Gräfe, and F. Wissbrock. Automating Market Forecast Summarization from Internet Data. In WWW/Internet, pp. 395--402, 2005.Google Scholar
- H. Wachsmuth, P. Prettenhofer, and B. Stein. Efficient Statement Identification for Automatic Market Forecasting. In COLING, pp. 1128--1136, 2010. Google ScholarDigital Library
- D.C. Wimalasuriya and D. Dou. Components for Information Extraction: Ontology-Based Information Extractors and Generic Platform. In CIKM, pp. 9--18, 2010. Google ScholarDigital Library
Index Terms
Constructing efficient information extraction pipelines
Recommendations
Information extraction as a filtering task
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementInformation extraction is usually approached as an annotation task: Input texts run through several analysis steps of an extraction process in which different semantic concepts are annotated and matched against the slots of templates. We argue that such ...
A Flexible Text Mining System for Entity and Relation Extraction in PubMed
DTMBIO '15: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical InformaticsDue to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means ...
Building a generic debugger for information extraction pipelines
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementComplex information extraction (IE) pipelines are becoming an integral component of most text processing frameworks. We introduce a first system to help IE users analyze extraction pipeline semantics and operator transformations interactively while ...
Comments