Authors:
Peter Exner
and
Pierre Nugues
Affiliation:
Lund University, Sweden
Keyword(s):
NLP Framework, Distributed Computing, Large Scale-processing, Hadoop, MapReduce.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Artificial Intelligence
;
Document Analysis and Understanding
;
Knowledge Engineering and Ontology Development
;
Knowledge-Based Systems
;
Natural Language Processing
;
Pattern Recognition
;
Software Engineering
;
Symbolic Systems
Abstract:
In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifying the original document. We used the Avro binary format to serialize the documents. Avro is designed
for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework, the annotation model, the options for querying processed data, and the parsing results on the English and Swedish editions of Wikipedia.