Skip to main content
Log in

An effective short text conceptualization based on new short text similarity

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Recently short text messages, tweets, comments and so on, have become a large portion of the online text data. They are limited in length and different from traditional documents in their shortness and sparseness. As a result, short text tends to be ambiguous and its degree is not the same for all languages; and as Arabic is a very high flexional language, where a single word can have multiple meanings, the short text representation plays a vital role in any Text Mining task. To address these issues, we propose an efficient representation for short text based on concepts instead of terms using BabelNet as an external knowledge. However, in the conceptualization process, while searching polysemic term-corresponding concepts, multiple matches are detected. Therefore, assigning a term to a concept is a crucial step and we believe that short text similarity can be useful to overcome the problem of mapping term to the corresponding concept. In this paper, we reintroduce Web-based Kernel function for measuring the semantic relatedness between concepts to disambiguate an expression versus multiple concepts. The proposed method has been evaluated using an Arabic short text categorization system and the obtained results illustrate the interest of our contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://developers.google.com/custom-search/docs/start.

  2. https://datamarket.azure.com/dataset/5BA839F1-12CE-4CCE-BF57-A49D98D29A44.

  3. http://www.dmoz.org.

  4. http://social-dynamics.org/twitter-network-data/.

References

  • Alahmadi A, Joorabchi A, Mahdi AE (2014) Arabic text classification using bag-of-concepts representation. In: Proceedings of the international conference on knowledge discovery and information retrieval (KDIR), pp 374–380

  • Albitar S, Fournier S, Espinasse B (2012) The impact of conceptualization on text classification. In: WISE 2012, LNCS 7651, pp. 326–339

  • Aly M, Atiya A (2013) LABR: large-scale Arabic book reviews dataset. In: Proceedings of the 51st annual meeting of the association for computational linguistics, Sofia, Bulgaria, pp 494–498

  • Banerjee S, Pedersen T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the 18th international joint conference on artificial intelligence, pp 805–810

  • Bekkali M, Lachkar A (2017) Web search engine-based representation for Arabic tweets categorization. In: Kaya M, Erdoǧan Ö, Rokne J (eds) From social data mining and analysis to prediction and community detection. Lecture notes in social networks, Springer, New York, pp 79–101. ISBN: 978-3-319-51367-6

  • Bekkali M, Lachkar. SahmoudiI A (2015) Enriching Arabic tweets representation based on web search engine and the rough set theory. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining, pp 1573–1574

  • Blei DM, Ng A, Jordan. M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Duan L, Xu T (2016) A short text similarity algorithm for finding similar police 110 incidents. In: Proceedings of the 7th international conference on cloud computing and big data, Macau, China, pp 260–264

  • Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: 21st National conference on artificial intelligence, vol 2, pp 1301–1306

  • Guo W, Diab M (2012) Learning the latent semantics of a concept by its definition. In: Proceedings of the 50th annual meeting of the association for computational linguistics, pp 140–144

  • Hu X, Zhang X, Lu C, Park EK, Zhou X (2009a) Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, Paris, pp 389–396

  • Hu X, Sun N, Zhang C, Chua T-S (2009b) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceedings of 18th ACM conference on information and knowledge management, pp 919–928

  • Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: International conference research on computational linguistics

  • Kenter T, de Rijke M (2015) Short text similarity with word embeddings. In CIKM, pp 1411–1420

  • Khoja S, Garside R (1999) Stemming Arabic text. Computer Science Department, Lancaster University, Lancaster

    Google Scholar 

  • Komorowski J, Polkowski L, Andrzej S (1998) Rough sets: a tutorial

  • Landauer TK, Foltz PW, Laham D (1998) Introduction to latent semantic analysis. Discourse Process 25:259–284

    Article  Google Scholar 

  • Larkey L, Ballesteros L, Connell ME (2002) Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of SIGIR’02, pp 275–282

  • Li J, Cai Y, Cai Z, Leung H, Yang K (2017) Wikipedia based short text classification method. DASFAA 2017 Workshops, LNCS 10179, pp 275–286

  • Lund K, Burgess C, Atchley RA (1995) Semantic and associative priming in a high-dimensional semantic space. In: Cognitive SCIENCE PROCEEDINgs (LEA), pp 660–665

  • Nagoudi EMB, Schwab D (2016) Semantic similarity of arabic sentences with word embeddings. In: Proceedings of the third arabic natural language processing workshop (WANLP), Valencia, pp 18–24

  • Navigli R, Ponzetto S (2012) BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, Elsevier, pp217–250

  • Ngo CL (2003) A tolerance rough set approach to clustering web search results. Warsaw University, Poland

    Google Scholar 

  • Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer, Dordrecht

    Book  Google Scholar 

  • Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of 17th international conference on World Wide Web, pp 91–100

  • Sahami M, Heilman T (2006) A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of international World Wide Web, Edinburgh, Scotland, pp 377–386

  • Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  • Tang J, Wang X, Gao H, Hu X, Liu H (2012) Enriching short text representation in microblog for clustering. Front Comput Sci Chin 6(1):88–101

    MathSciNet  MATH  Google Scholar 

  • Wang X, Chen R, Jia Y, Zhou B (2013) Short text classification using Wikipedia concept based document representation. In: The international conference on information technology and applications, pp 471–474

  • Yih W-T, Meek C (2007) Improving similarity measures for short segments of text. In: Proceeding AAAI’07 proceedings of the 22nd national conference on artificial intelligence, V2, pp 1489–1494

  • Yousif SA, Samawi VW, Elkabani I, Member IAENG (2017) Arabic text classification: the effect of the AWN relations weighting scheme. In: Proceedings of the world congress on engineering, London

  • Zhang J, Chen S (2013) A study on clustering algorithm of Web search results based on rough set. In: Software engineering and service science (ICSESS), pp 292–295

  • Zhixing L, Zhongyang X, Yufang Z, Chunyong L, Kuan L (2011) Fast text categorization using concise semantic analysis. Pattern Recogn Lett 32:441–448

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammed Bekkali.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bekkali, M., Lachkar, A. An effective short text conceptualization based on new short text similarity. Soc. Netw. Anal. Min. 9, 1 (2019). https://doi.org/10.1007/s13278-018-0544-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-018-0544-8

Keywords

Navigation