Abstract
The growth of electronically readable Arabic content available on the web has become a rich source from which to build new corpora or update the existing ones. The availability of such corpora will be beneficial for Arabic corpus linguistics, computational linguistics, and natural language processing. In this paper, we present ARARSS, a tool capable of automatically constructing and updating textual corpora benefiting from the Rich Site Summary (RSS) feeds. ARARSS is capable of collecting the texts in a properly categorized manner according to user needs, in addition to their metadata (for example, location, time, and topic) as provided by RSS sources. We used ARARSS to construct a modern standard Arabic corpus comprising 117,819 texts and more than 28 million words. ARARSS is an open source tool and freely available to download (http://corpus.kacst.edu.sa/more_info.jsp) along with the constructed corpus.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Manning, C.D.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Computational Linguistics and Intelligent Text Processing, pp. 171–189. Springer, Heidelberg (2011)
Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)
Suchomel, V., Pomikálek, J.: Efficient web crawling for large text corpora. In: Proceedings of the Seventh Web as Corpus Workshop (WAC7), pp. 39–43 (2012)
Schäfer, R., Bildhauer, F.: Building large corpora from the web using a new efficient tool chain. In: LREC, pp. 486–493 (2012)
Barbaresi, A.: Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources. In: Proceedings of the 9th Web as Corpus Workshop, WaC-9, Gothenburg, Sweden, pp. 1–8 (2014)
Baroni, M., Bernardini, S.: BootCaT: bootstrapping corpora and terms from the web. In: Proceedings of LREC, p. 1313. ELDA, Lisbon (2004)
Ueyama, M.: Evaluation of Japanese web-based reference corpora: effects of seed selection and time interval, Wacky, pp. 99–126 (2006)
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., Suchomel, V.: The TenTen corpus family. In: 7th International Corpus Linguistics Conference CL, pp. 125–127. UCREL, Lancaster (2013)
Luo, C., Zheng, Y., Liu, Y., Wang, X., Xu, J., Zhang, M., Ma, S.: SogouT-16: a new web corpus to embrace IR research. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1233–1236. ACM (2017). https://doi.org/10.1145/3077136.3080694
Schäfer, R.: Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Lang. Resour. Eval. 51(3), 873–889 (2017). https://doi.org/10.1007/s10579-016-9359-2
Ringlstetter, C., Schulz, K.U., Mihov, S.: Orthographic errors in web pages: toward cleaner web corpora. Comput. Linguist. 32(3), 295–340 (2006)
Ojokoh, B.A.: Automated online news content extraction. Int. J. Comput. Sci. Res. Appl. 2, 2–12 (2012)
George, A., Bouras, C., & Poulopoulos, V.: Efficient extraction of news articles based on RSS crawling. In: International Conference on Machine and Web Intelligence, ICMWI, pp. 1–7. IEEE, Algiers (2010)
Qingcheng, L., Youmeng, L.: Extracting content from web pages based on RSS. In: 2008 International Conference on Computer Science and Software Engineering, vol. 5, pp. 218–221. IEEE (2008)
Alzahrani, S. M.: Building, profiling, analysing and publishing an Arabic news corpus based on Google news RSS feeds. In: Information Retrieval Technology, pp. 488–499. Springer, Heidelberg (2013)
Khoja, S.: An RSS feed analysis application and corpus builder. In: The Second International Conference on Arabic Language Resources and Tools, pp. 01–04. The MEDAR Consortium, Cairo (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Al-Thubaity, A., Alhoshan, M. (2019). ARARSS: A System for Constructing and Updating Arabic Textual Resources. In: Hassanien, A., Tolba, M., Shaalan, K., Azar, A. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2018. AISI 2018. Advances in Intelligent Systems and Computing, vol 845. Springer, Cham. https://doi.org/10.1007/978-3-319-99010-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-99010-1_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99009-5
Online ISBN: 978-3-319-99010-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)