ARARSS: A System for Constructing and Updating Arabic Textual Resources

Al-Thubaity, Abdulmohsen; Alhoshan, Muneera

doi:10.1007/978-3-319-99010-1_24

ARARSS: A System for Constructing and Updating Arabic Textual Resources

Conference paper
First Online: 29 August 2018

1287 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 845))

Abstract

The growth of electronically readable Arabic content available on the web has become a rich source from which to build new corpora or update the existing ones. The availability of such corpora will be beneficial for Arabic corpus linguistics, computational linguistics, and natural language processing. In this paper, we present ARARSS, a tool capable of automatically constructing and updating textual corpora benefiting from the Rich Site Summary (RSS) feeds. ARARSS is capable of collecting the texts in a properly categorized manner according to user needs, in addition to their metadata (for example, location, time, and topic) as provided by RSS sources. We used ARARSS to construct a modern standard Arabic corpus comprising 117,819 texts and more than 28 million words. ARARSS is an open source tool and freely available to download (http://corpus.kacst.edu.sa/more_info.jsp) along with the constructed corpus.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Manning, C.D.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Computational Linguistics and Intelligent Text Processing, pp. 171–189. Springer, Heidelberg (2011)
Google Scholar
Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)
Google Scholar
Suchomel, V., Pomikálek, J.: Efficient web crawling for large text corpora. In: Proceedings of the Seventh Web as Corpus Workshop (WAC7), pp. 39–43 (2012)
Google Scholar
Schäfer, R., Bildhauer, F.: Building large corpora from the web using a new efficient tool chain. In: LREC, pp. 486–493 (2012)
Google Scholar
Barbaresi, A.: Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources. In: Proceedings of the 9th Web as Corpus Workshop, WaC-9, Gothenburg, Sweden, pp. 1–8 (2014)
Google Scholar
Baroni, M., Bernardini, S.: BootCaT: bootstrapping corpora and terms from the web. In: Proceedings of LREC, p. 1313. ELDA, Lisbon (2004)
Google Scholar
Ueyama, M.: Evaluation of Japanese web-based reference corpora: effects of seed selection and time interval, Wacky, pp. 99–126 (2006)
Google Scholar
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., Suchomel, V.: The TenTen corpus family. In: 7th International Corpus Linguistics Conference CL, pp. 125–127. UCREL, Lancaster (2013)
Google Scholar
Luo, C., Zheng, Y., Liu, Y., Wang, X., Xu, J., Zhang, M., Ma, S.: SogouT-16: a new web corpus to embrace IR research. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1233–1236. ACM (2017). https://doi.org/10.1145/3077136.3080694
Schäfer, R.: Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Lang. Resour. Eval. 51(3), 873–889 (2017). https://doi.org/10.1007/s10579-016-9359-2
Article Google Scholar
Ringlstetter, C., Schulz, K.U., Mihov, S.: Orthographic errors in web pages: toward cleaner web corpora. Comput. Linguist. 32(3), 295–340 (2006)
Article Google Scholar
Ojokoh, B.A.: Automated online news content extraction. Int. J. Comput. Sci. Res. Appl. 2, 2–12 (2012)
Google Scholar
George, A., Bouras, C., & Poulopoulos, V.: Efficient extraction of news articles based on RSS crawling. In: International Conference on Machine and Web Intelligence, ICMWI, pp. 1–7. IEEE, Algiers (2010)
Google Scholar
Qingcheng, L., Youmeng, L.: Extracting content from web pages based on RSS. In: 2008 International Conference on Computer Science and Software Engineering, vol. 5, pp. 218–221. IEEE‏ (2008)
Google Scholar
Alzahrani, S. M.: Building, profiling, analysing and publishing an Arabic news corpus based on Google news RSS feeds. In: Information Retrieval Technology, pp. 488–499. Springer, Heidelberg (2013)
Google Scholar
Khoja, S.: An RSS feed analysis application and corpus builder. In: The Second International Conference on Arabic Language Resources and Tools, pp. 01–04. The MEDAR Consortium, Cairo (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

The National Center for Artificial Intelligence and Big Data, King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia
Abdulmohsen Al-Thubaity & Muneera Alhoshan

Authors

Abdulmohsen Al-Thubaity
View author publications
You can also search for this author in PubMed Google Scholar
Muneera Alhoshan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muneera Alhoshan .

Editor information

Editors and Affiliations

Information Technology Department, Faculty of Computers and Information, Cairo University, Giza, Egypt
Aboul Ella Hassanien
Ain Shams University, Cairo, Egypt
Mohamed F. Tolba
Dubai International Academic City, The British University in Dubai, Dubai, United Arab Emirates
Khaled Shaalan
Faculty of Computers and Information, Benha University, Benha, Egypt
Ahmad Taher Azar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Al-Thubaity, A., Alhoshan, M. (2019). ARARSS: A System for Constructing and Updating Arabic Textual Resources. In: Hassanien, A., Tolba, M., Shaalan, K., Azar, A. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2018. AISI 2018. Advances in Intelligent Systems and Computing, vol 845. Springer, Cham. https://doi.org/10.1007/978-3-319-99010-1_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-99010-1_24
Published: 29 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99009-5
Online ISBN: 978-3-319-99010-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics