A scalable architecture for data-intensive natural language processing†

ZUHAITZ BELOKI; XABIER ARTOLA; AITOR SOROA

doi:10.1017/S1351324917000092

A scalable architecture for data-intensive natural language processing†

Published online by Cambridge University Press: 09 May 2017

ZUHAITZ BELOKI

XABIER ARTOLA and

AITOR SOROA

Show author details

ZUHAITZ BELOKI: Affiliation:
IXA NLP Group, University of the Basque Country (UPV/EHU), Donostia-San Sebastián e-mail: zuhaitz.beloki@ehu.eus, xabier.artola@ehu.eus, a.soroa@ehu.eus
XABIER ARTOLA: Affiliation:
IXA NLP Group, University of the Basque Country (UPV/EHU), Donostia-San Sebastián e-mail: zuhaitz.beloki@ehu.eus, xabier.artola@ehu.eus, a.soroa@ehu.eus
AITOR SOROA: Affiliation:
IXA NLP Group, University of the Basque Country (UPV/EHU), Donostia-San Sebastián e-mail: zuhaitz.beloki@ehu.eus, xabier.artola@ehu.eus, a.soroa@ehu.eus

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Computational power needs have greatly increased during the last years, and this is also the case in the Natural Language Processing (NLP) area, where thousands of documents must be processed, i.e., linguistically analyzed, in a reasonable time frame. These computing needs have implied a radical change in the computing architectures and big-scale text processing techniques used in NLP. In this paper, we present a scalable architecture for distributed language processing. The architecture uses Storm to combine diverse NLP modules into a processing chain, which carries out the linguistic analysis of documents. Scalability requires designing solutions that are able to run distributed programs in parallel and across large machine clusters. Using the architecture presented here, it is possible to integrate a set of third-party NLP modules into a unique processing chain which can be deployed onto a distributed environment, i.e., a cluster of machines, so allowing the language-processing modules run in parallel. No restrictions are placed a priori on the NLP modules apart of being able to consume and produce linguistic annotations following a given format. We show the feasibility of our approach by integrating two linguistic processing chains for English and Spanish. Moreover, we provide several scripts that allow building from scratch a whole distributed architecture that can be then easily installed and deployed onto a cluster of machines. The scripts and the NLP modules used in the paper are publicly available and distributed under free licenses. In the paper, we also describe a series of experiments carried out in the context of the NewsReader project with the goal of testing how the system behaves in different scenarios.

Type: Articles
Information: Natural Language Engineering , Volume 23 , Issue 5 , September 2017 , pp. 709 - 731

DOI: https://doi.org/10.1017/S1351324917000092 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

†

This work has been partially funded by the NewsReader (FP7-ICT-2011-8-316404) project. Zuhaitz Beloki’s work is funded by a PhD grant from the University of the Basque Country.

References

Agerri, R., Aldabe, I., Beloki, Z., Laparra, E., Rigau, G., Soroa, A., van Erp, M., Fokkens, A., Ilievski, F., Izquierdo, R., Morante, R., van Son, C., Vossen, P., and Minard, A.-L. 2016. Event detection, version 3. NewsReader Deliverable 4.2.3.Google Scholar

Agerri, R., Artola, X., Beloki, Z., Rigau, G., and Soroa, A., 2015. Big data for natural language processing: a streaming approach. Knowledge-Based Systems 79: 36–42.CrossRef Google Scholar

Agerri, R., Bermudez, J., and Rigau, G. 2014. IXA Pipeline: efficient and ready to use multilingual NLP tools. In Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), Reykjavik, Iceland.Google Scholar

Agerri, R., and Rigau, G. (2016). Robust multilingual named entity recognition with shallow semi-supervised features. Artificial Intelligence 238: 63–82.CrossRef Google Scholar

Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., and Zdonik, S. 2003. Scalable distributed stream processing. In CIDR 2003 – First Biennial Conference on Innovative Data Systems Research, Asilomar, CA.Google Scholar

Cunningham, H., 2002. Gate, a general architecture for text engineering. Computers and the Humanities 36 (2): 223–54.CrossRef Google Scholar

Dean, J., and Ghemawat, S., 2008. Mapreduce: simplified data processing on large clusters. Communications of the ACM 51 (1): 107–13.CrossRef Google Scholar

Derivière, J., Hamon, T., and Nazarenko, A. 2006. A scalable and distributed nlp architecture for web document annotation. In Advances in Natural Language Processing, pp. 56–67. Springer.CrossRef Google Scholar

Epstein, E. A., Schor, M. I., Iyer, B. S., Lally, A., Brown, E. W., and Cwiklik, J., 2012. Making watson fast. IBM Journal of Research and Development 56 (3): 15.CrossRef Google Scholar

Evans, N., Asahara, M., and Matsumoto, Y., 2008. Cocytus: parallel NLP over disparate data. TAL 49 (2): 271–93.Google Scholar

Exner, P., and Nugues, P. 2014. KOSHIK: a large-scale distributed computing framework for NLP. In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods, pp. 463–70.Google Scholar

Fokkens, A., Soroa, A., Beloki, Z., Ockeloen, N., Rigau, G., van Hage, W. R., and Vossen, P. 2014. NAF and GAF: linking linguistic annotations. In Proceedings of 10th Joint ACL/ISO Workshop on Interoperable Semantic Annotation (ISA-10).Google Scholar

Ide, N., Romary, L., and de La Clergerie, É. V. 2003. International standard for a linguistic annotation framework. In Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS). Association for Computational Linguistics.CrossRef Google Scholar

Nesi, P., Pantaleo, G., and Sanesi, G. 2015. A distributed framework for NLP-based keyword and keyphrase extraction from web pages and documents. In Proceedings of the 21st International Conference on Distributed Multimedia Systems DMS '15, Hyatt Regency.Google Scholar

Otero, G., Pichel, J., García, M., Abuín, J. M., and Fernández, T., 2014. Análisis morfosintáctico y clasificación de entidades nombradas en un entorno Big Data. Procesamiento del Lenguaje Natural 53: 17–24.Google Scholar

Padró, L., and Stanilovsky, E. 2012. Freeling 3.0: towards wider multilinguality. In Proceedings of the Language Resources and Evaluation Conference (LREC '12), Istanbul, Turkey, ELRA.Google Scholar

Padró, L., and Turmo, J. 2015. Textserver: cloud-based multilingual natural language processing. In Proceedings of the IEEE International Conference on Data Mining Workshop (ICDMW), IEEE, pp. 1636–39.Google Scholar

Padró, L., and Turmo, J., 2015. Textserver: cloud-based multilingual natural language processing. In Proceedings of the 15th IEEE International Conference on Data Mining Workshop (ICDMW '15), Atlantic City, USA, IEEE, pp. 1636–39.Google Scholar

Tablan, V., Roberts, I., Cunningham, H., and Bontcheva, K. 2012. GATECloud.net: a platform for large-scale, open-source text processing on the cloud. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical, and Engineering Sciences 371 (1983).Google Scholar PubMed

Wu, H., Fei, Z., Dai, A., Sammons, M., Roth, D., and Mayhew, S. D. 2014. Illinoiscloudnlp: text analytics services in the cloud. In Proceedings of International Conference on Language Resources and Evaluation (LREC), pp. 14–21.Google Scholar

Article contents

A scalable architecture for data-intensive natural language processing†

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests