Semantic similarity based approach for reducing Arabic texts dimensionality

Awajan, Arafat

doi:10.1007/s10772-015-9284-6

Semantic similarity based approach for reducing Arabic texts dimensionality

Special Issue Article
Published: 09 June 2015

Volume 19, pages 191–201, (2016)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Arafat Awajan ORCID: orcid.org/0000-0002-7067-5658¹

397 Accesses
10 Citations
3 Altmetric
Explore all metrics

Abstract

An efficient method is introduced to represent large Arabic texts in comparatively smaller size without losing significant information. The proposed method uses the distributional semantics to build the word-context matrix representing the distribution of words across contexts and to transform the text into a vector space model (VSM) representation based on word semantic similarity. The linguistic features of the Arabic language, in addition to the semantic information extracted from different lexical-semantic resources such as Arabic WordNet and named entities’ gazetteers are used to improve the text representation and to create word clusters of similar and related words. Distributional similarity measures have been used to capture the words’ semantic similarity and to create clusters of similar words. The conducted experiments have shown that the proposed method significantly reduces the size of text representation by about 27 % compared with the stem-based VSM and by about 50 % compared with the traditional bag-of-words model. Their results have shown that the amount of dimension reduction depends on the size and shape of the windows of analysis as well as on the content of the text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Almaany. (2014). Dictionary and glossary. http://www.almaany.com/.
Awajan, A. (2007). Arabic text preprocessing for the natural language processing applications. Arab Gulf Journal of Scientific Research, 25(4), 179–189.
Google Scholar
Awajan, A. (2015). Semantic vector space model for reducing arabic text dimensionality. In Proceedings of the 5th international conference on digital information and communication technology and its applications, Lebanon, (pp. 129–135). April 29–May 1, 2015.
Baker, K. (2013). Singular value decomposition tutorial. Note for NLP Seminar. 1–24. Accessed December 2013, from www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf.
Beesley, K. R. (1998). Consonant spreading in Arabic stems. In COLING-ACL’98, vol 1, pp 117–123, Montreal, Quebec, Canada, August 10–14.
Biemann, C. (2006). Chinese whispers—An efficient graph clustering algorithm and its application to natural language processing problems. Workshop on TextGraphs, at HLT-NAACL 2006, pp. 73–80
Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A., Ould Abdallahi, O. B. M., & Shoul, M. (2010). Alkhalil Morpho Sys: A morphosyntactic analysis system for Arabic texts. In International Arab conference on information technology. http://www.itpapers.info/acit10/Papers/f653.
Bullinaria, J. A., & Levy, J. P. (2012). Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming and SVD. Behavior Research Methods, 44, 890–907.
Article Google Scholar
Duwairi, R., Al-Refai, M. N., & Khasawneh, N. (2009). Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science and Technology, 60(11), 2347–2352.
Article Google Scholar
Elkateb, S., Black, W., Rodríguez, H., Alkhalifa, M., Vossen, P., Pease, A., & Fellbaum, C. (2006). Building a WordNet for Arabic. In Proceedings of the fifth international conference on language resources and evaluation (LREC 2006). Genoa, Italy, May 22–28, 2006.
Froud, H., Lachkar, A., & Ouatik, S. A. (2012). A comparative study of root-based and stem-based approaches for measuring similarity between Arabic words for Arabic text mining applications. Advanced Computing: An International Journal (ACIJ), 3(6).
Green, S., & Manning, C. D. (2010). Better Arabic parsing: Baselines, evaluations, and analysis. In COLING, Beijing (pp. 394–402).
Habash, N. (2010). Introduction to Arabic natural language processing. San Rafael: Morgan & Claypool Publishers.
Google Scholar
Hagiwara, M. (2008). A supervised learning approach to automatic synonym identification based on distributional features. In Proceedings of the ACL-08, Columbus, June 2008 (pp. 1–6).
Harrag, F., El-Qawasmah, E., & Al-Salman, A. M. (2010). Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In IEEE first international conference on integrated intelligent computing, pp. 6–11.
Harris, Z. (1954). Distributional structure. Word, 10(23), 146–162.
Article Google Scholar
Hasnah, A. M., & Al-Ja’am, J. M. (2002). Thesaurus-based query disambiguation method for cross-language information retrieval. International Journal Intelligent Computing and Information Sciences, 2(2), 58–68.
Google Scholar
Heintz, I. (2010). Arabic language modeling with stem-derived morphemes for automatic speech recognition. Ph.D. thesis, Graduate School of The Ohio State University.
Hmeidi, I., Kanaan, G., & Evens, M. (1997). Design and implementation of automatic indexing for information retrieval with arabic documents. Journal of the American Society for Information Science, 48(10), 867–881.
Article Google Scholar
Kirchhoff, K., Vergyri, D., Duh, K., Bilmes, J., & Stolcke, A. (2006). Morphology-based language modeling for conversational Arabic speech recognition. Computer Speech & Language, 20(4), 589–608.
Article Google Scholar
Martins, C. A., Monard, M. C., & Matsubara, E. T. (2003). Reducing the dimensionality of bag-of-words text representation used by learning algorithms. In Proceedings of 3rd IASTED international conference on artificial intelligence and applications (AIA2003), Benalmádena, Espanha (pp. 228–233). Calgary: Acta Press.
Mihalcea, R., & Tarau, P. (2004). TextRank: Brining order into texts. In Proceedings of EMNLP 2004. Association for Computational Linguistics, Barcelona, Spain (pp. 404–411).
Parkinson, D. B. (2005). Using Arabic synonyms. Cambridge: Cambridge University Press.
Google Scholar
Saad, M. K., & Ashour, W. (2010). OSAC: Open Source Arabic Corpus, the 6th International Symposium on Electrical and Electronics Engineering and Computer Science, European University of Lefke, Cyprus, from http://sourceforge.net/projects/ar-text-mining/files/ArabicCorpora.
Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York, NY: McGraw-Hill. Inc.
MATH Google Scholar
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communication of the ACM, 18(11), 613–620.
Article MATH Google Scholar
Shaalan, K. (2014). A survey of Arabic named entity recognition and classification. Computational Linguistics, 40(2), 469–510. doi:10.1162/COLIa00178.
Article Google Scholar
Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37, 141–188.
MathSciNet MATH Google Scholar
Van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). Cambridge: Computer Laboratory, University of Cambridge.
MATH Google Scholar
Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In SIGIR’02, Proceedings of the 25th annual international ACMSIGIR conference on Research and development in information retrieval, Tampere, Finland (pp. 269–274). August 11–15, 2002.

Download references

Author information

Authors and Affiliations

The King Hussein Faculty of Computing Sciences, Princess Sumaya University for Technology, Khalil Saket Street, Al-Jubaiha, P.O. Box 1438, Amman, 11941, Jordan
Arafat Awajan

Authors

Arafat Awajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arafat Awajan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Awajan, A. Semantic similarity based approach for reducing Arabic texts dimensionality. Int J Speech Technol 19, 191–201 (2016). https://doi.org/10.1007/s10772-015-9284-6

Download citation

Received: 19 February 2015
Accepted: 28 May 2015
Published: 09 June 2015
Issue Date: June 2016
DOI: https://doi.org/10.1007/s10772-015-9284-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic similarity based approach for reducing Arabic texts dimensionality

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Language Framework for Measuring Semantic and Syntactic Similarity for Arabic Texts

Arabic Text Categorization Algorithm Using Vector Space Model

Arabic text clustering using improved clustering algorithms with dimensionality reduction

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Semantic similarity based approach for reducing Arabic texts dimensionality

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Language Framework for Measuring Semantic and Syntactic Similarity for Arabic Texts

Arabic Text Categorization Algorithm Using Vector Space Model

Arabic text clustering using improved clustering algorithms with dimensionality reduction

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation