Skip to main content

ARARSS: A System for Constructing and Updating Arabic Textual Resources

  • Conference paper
  • First Online:
  • 1287 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 845))

Abstract

The growth of electronically readable Arabic content available on the web has become a rich source from which to build new corpora or update the existing ones. The availability of such corpora will be beneficial for Arabic corpus linguistics, computational linguistics, and natural language processing. In this paper, we present ARARSS, a tool capable of automatically constructing and updating textual corpora benefiting from the Rich Site Summary (RSS) feeds. ARARSS is capable of collecting the texts in a properly categorized manner according to user needs, in addition to their metadata (for example, location, time, and topic) as provided by RSS sources. We used ARARSS to construct a modern standard Arabic corpus comprising 117,819 texts and more than 28 million words. ARARSS is an open source tool and freely available to download (http://corpus.kacst.edu.sa/more_info.jsp) along with the constructed corpus.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.sqlabs.com/sqlitemanager.php.

  2. 2.

    http://rometools.github.io/rome/.

  3. 3.

    https://jsoup.org.

References

  1. Manning, C.D.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Computational Linguistics and Intelligent Text Processing, pp. 171–189. Springer, Heidelberg (2011)

    Google Scholar 

  2. Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)

    Google Scholar 

  3. Suchomel, V., Pomikálek, J.: Efficient web crawling for large text corpora. In: Proceedings of the Seventh Web as Corpus Workshop (WAC7), pp. 39–43 (2012)

    Google Scholar 

  4. Schäfer, R., Bildhauer, F.: Building large corpora from the web using a new efficient tool chain. In: LREC, pp. 486–493 (2012)

    Google Scholar 

  5. Barbaresi, A.: Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources. In: Proceedings of the 9th Web as Corpus Workshop, WaC-9, Gothenburg, Sweden, pp. 1–8 (2014)

    Google Scholar 

  6. Baroni, M., Bernardini, S.: BootCaT: bootstrapping corpora and terms from the web. In: Proceedings of LREC, p. 1313. ELDA, Lisbon (2004)

    Google Scholar 

  7. Ueyama, M.: Evaluation of Japanese web-based reference corpora: effects of seed selection and time interval, Wacky, pp. 99–126 (2006)

    Google Scholar 

  8. Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., Suchomel, V.: The TenTen corpus family. In: 7th International Corpus Linguistics Conference CL, pp. 125–127. UCREL, Lancaster (2013)

    Google Scholar 

  9. Luo, C., Zheng, Y., Liu, Y., Wang, X., Xu, J., Zhang, M., Ma, S.: SogouT-16: a new web corpus to embrace IR research. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1233–1236. ACM (2017). https://doi.org/10.1145/3077136.3080694

  10. Schäfer, R.: Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Lang. Resour. Eval. 51(3), 873–889 (2017). https://doi.org/10.1007/s10579-016-9359-2

    Article  Google Scholar 

  11. Ringlstetter, C., Schulz, K.U., Mihov, S.: Orthographic errors in web pages: toward cleaner web corpora. Comput. Linguist. 32(3), 295–340 (2006)

    Article  Google Scholar 

  12. Ojokoh, B.A.: Automated online news content extraction. Int. J. Comput. Sci. Res. Appl. 2, 2–12 (2012)

    Google Scholar 

  13. George, A., Bouras, C., & Poulopoulos, V.: Efficient extraction of news articles based on RSS crawling. In: International Conference on Machine and Web Intelligence, ICMWI, pp. 1–7. IEEE, Algiers (2010)

    Google Scholar 

  14. Qingcheng, L., Youmeng, L.: Extracting content from web pages based on RSS. In: 2008 International Conference on Computer Science and Software Engineering, vol. 5, pp. 218–221. IEEE‏ (2008)

    Google Scholar 

  15. Alzahrani, S. M.: Building, profiling, analysing and publishing an Arabic news corpus based on Google news RSS feeds. In: Information Retrieval Technology, pp. 488–499. Springer, Heidelberg (2013)

    Google Scholar 

  16. Khoja, S.: An RSS feed analysis application and corpus builder. In: The Second International Conference on Arabic Language Resources and Tools, pp. 01–04. The MEDAR Consortium, Cairo (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muneera Alhoshan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Al-Thubaity, A., Alhoshan, M. (2019). ARARSS: A System for Constructing and Updating Arabic Textual Resources. In: Hassanien, A., Tolba, M., Shaalan, K., Azar, A. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2018. AISI 2018. Advances in Intelligent Systems and Computing, vol 845. Springer, Cham. https://doi.org/10.1007/978-3-319-99010-1_24

Download citation

Publish with us

Policies and ethics