Keywords

1 Introduction

With 490 million speakers [1] across the world, Hindi stands fourth in popularity after Mandarin, Spanish, and English [2]. In social media such as Twitter, Facebook, WhatsApp, etc., most of the Indians now prefer Hindi for communication, and this generates large volumes of data. The manual process of mining the sentiments from these large data is a tedious job for individuals as well as organizations. Therefore, an automated system is required to identify the sentiment automatically from Hindi text.

Sentiment analysis is a task which identifies the orientation of a text towards a specific target such as products, individuals, organizations, etc. With the presence of sarcasm, the prediction of sentiment in text often goes wrong in the analysis. Sarcasm often conveys negative meaning using positive or intensified positive words. For example, “I love waiting forever for the doctor”. In the first look, the sentence conveys positive sentiment; but, it is sarcastic. Due to this, most of the existing sentiment analyzers fail to detect real sentiment.

Recently, many sarcasm detectors were developed by researchers for text scripted in English [3,4,5,6,7,8,9]. But, there is only one reported work available for detection of sarcasm in Hindi scripted text [10]. The existing work [10] does not consider the natural Hindi tweetsFootnote 1 for the experiment. Their training and testing set consists of Hindi tweets translated from English scripted tweets. In this article, we proposed a framework for sarcasm detection in natural Hindi tweets using online Hindi news as the context. A sample of natural Hindi sarcastic tweets is shown in Fig. 1.

Fig. 1.
figure 1

A sample of Hindi sarcastic tweets.

Tweets and news are very similar in nature as both describes current happenings in their way. The news gives us the authenticated knowledge about real-time happenings across the world. Similarly, users‘ from worldwide shares their feeling on current happening through tweets. It may or may not be authentic. It depends on the individual user and their likes and dislikes. If a user likes any current happenings, then they will share positive feeling on that happenings. If they do not like, then they may share either direct negative or sarcastic feeling. In this approach, news has been utilized as the context of the given tweet to predict the authenticity of the tweet with the truth. If a given tweet follows the orientation of the related news, then is be considered as a simple tweet, and the obtained sentiment is correct. If the tweet does not follow the orientation of the related news, then the tweet is classified as sarcastic, and the obtained sentiment is opposite.

The rest of the paper is organized as follows: Sect. 2 describes related work. The proposed scheme is discussed in Sect. 3. Analysis of the results are given in Sect. 4 and conclusion of the article is drawn in Sect. 5.

2 Related Work

Sarcasm detection in resource rich language like English is well explored [3,4,5,6,7,8,9]. In the context of Indian languages, it is yet to be explored. The main reason is the unavailability of benchmark resources for training and testing.

Desai and Dave [10] proposed a Support Vector Machine (SVM) based sarcasm detector for Hindi sentences. They used Hindi tweets as the dataset for training and testing using SVM classifier. In the absence of annotated datasets for training and testing, they converted English tweets into Hindi. Therefore, they focused on a similar set of features like emoticons and punctuation marks for sarcasm detection in English text. These methods are not applied directly for the natural Hindi sarcastic tweets as shown in Fig. 1.

3 Proposed Scheme

This section describes the proposed framework for sarcasm detection in Hindi tweets as shown in Fig. 2. Here, online news is used as a context which authenticates the given tweets with actual happenings. Here, we assume that online news is correct and authenticated.

Fig. 2.
figure 2

Proposed framework for sarcasm detection in Hindi tweets.

For every news in the authenticated news corpus, keywords are extracted using Algorithm 1. These keywords are used to obtain the possible tweets. Further, for prediction of a sarcastic tweet, it takes a tweet as an input and extracts the important keywords using Algorithm 1. Then, the extracted keywords are used to map the related authenticated news in news corpus. Finally, it fed both the sets of keywords (input tweet and related news) to sarcasm detection algorithm to classify the tweet is sarcastic or not.

3.1 News Collection

After browsing several online news sources, we have collected a total of around 5000 one liner Hindi news manually on recent topics from top rated news sources as mentioned in Fig. 3. The collected news belongs to different categories such as sports, movies, business, politics, etc. In the preprocessing, redundant news are eliminated. News related to murder, rape, bomb blast, etc. were discarded. We believe that sarcastic tweets will not be floated on serious topics. It was thus eliminated. After preprocessing, the news corpus consists of a total of 2000 authenticated unique news.

Fig. 3.
figure 3

Procedure for news collection.

3.2 Keyword Extraction

This section describes the procedure of keyword extraction from sentences as shown in Algorithm 1.

figure a

Algorithm 1 takes authenticated news corpus \((\complement )\) as an input and find Part-of-Speech (POS) tag information for every news in the corpus. For every news, the tags noun (NN), verb (V), adjective (ADJ) and adverb (ADV) are extracted from the tagged set, and the corresponding tokens are extracted as \(\langle Set~of~Keywords \rangle \) for that news.

POS Tagging. To identify the POS tag information in Hindi sentences, we have developed a Hidden Markov Model (HMM) based POS tagger. It uses Indian Language (IL) standard tagset which consists of 24 tags [11]. For example, the POS tag information of Hindi sentence is - WQ | - PRP | - NN | - NNP | - VAUX | ? - SYM |. The Hindi POS tagger tool is available on URL: http://www.taghindi.herokuapp.com.

3.3 Tweets Collection

To get the news related tweets, we used extracted \(\langle Set~of~keywords \rangle \) for every news from news corpus to collect the possible tweets from Twitter as shown in Fig. 4. On deploying all the sets of keywords from 2000 unique news, a total of around 5000 Hindi tweets is collected. A sample set of news and related tweets are released on URL: https://github.com/rkp768/hindi-pos-tagger/tree/master/News%20and%20tweets.

Fig. 4.
figure 4

Procedure of tweets collection.

3.4 Sarcasm Detection

In this section, an algorithm is proposed to classify the tweet as sarcastic or not in the context of online news information. The procedure of identifying sarcastic tweet is given in Algorithm 2.

figure h

The Algorithm 2 takes both the sets of keywords (one for input tweet and other for related news) as the input. Then, it compares both the sets of keywords. If both the sets contain similar keywords, it means the orientation of the news and tweet are same. Therefore, the tweet is authentic and not sarcastic. If both sets do not contain similar keywords, then it calculates the number of positive and negative keywords in both news and tweet using a predefined list of Hindi words with polarity value. The list of Hindi SentiWordNet is available on URL: https://github.com/smadha/SarcasmDetector/blob/master/Hindi%20SentiWordNet/HSWN_WN.txt. Further, it compares the count of positive and negative keywords. If the news contains more positive keywords than an input tweet, it indicates the user intentionally negate the temporal fact (news). In this case, the orientation of the news is positive, and the orientation of the tweet is negative. Due to this contradiction, given input tweet is classified as sarcastic. Similarly, in the case of more negative keywords in the news than input tweet, given tweet is classified as sarcastic. For rest of the cases, tweets are not sarcastic.

4 Results and Discussion

This section describes the experimental results of the proposed approach to identify sarcasm in Hindi tweets. To test the performance, four experimental parameters have been used namely, Precision, Recall, F1-measure and Accuracy. A set of 500 random tweets from collected Hindi tweets corpus is used as a testing set to experiment. To annotate the testing set as sarcastic or not, three annotators are used, and the results of annotators are used as ground truth while testing. A confusion matrix for identifying sarcasm in 500 tweets are given in Table 1. Using the confusion matrix given in Table 1, the values of precision, recall, F1-measure and accuracy attained by the proposed approach for identifying sarcasm in Hindi tweets are given in Table 2.

Table 1. Confusion matrix for sarcasm detection in Hindi tweets.
Table 2. Precision, Recall, F1-measure and Accuracy attained by proposed approach

While identifying sarcasm in Hindi tweets concerning news context, we consider the comparison of \(\langle Set~of~Keywords \rangle \) for both input tweet and corresponding related news. We assume all the news have neutral sentiments whereas tweets contain either positive, negative or neutral sentiment. Therefore, instead of sentiment comparison, we preferred the comparison of individual keywords and its orientation. If both news and tweets describe same orientation, then the tweet is non-sarcastic. If the orientation of news and tweet are not same, it means the user is trying to negate this temporal fact intentionally. Hence, the given input tweet is sarcastic.

Limitations. The proposed framework has the following limitations:

  1. 1.

    In this research, news time-stamp is not available. Hence, while mapping a tweet to a unique related news, we are fully dependent on keywords, which does not give full assurance that the news and tweet belong to the same time-stamp.

  2. 2.

    If few keywords are matched for news and tweet, but both belong to different time-stamp. In such situation prediction of sarcasm may or may not be correct.

5 Conclusion and Future Direction

In the absence of sufficient annotated dataset for training and testing, one can not apply traditional methods for sarcasm detection in Hindi tweets that are used in examples. Therefore, this article proposes a novel framework for sarcasm detection in Hindi tweets using the online news as context. As news usually carry neutral sentiment, we used the important keywords for both input tweet and its related news to decide the tweet is sarcastic or not concerning the related news. The proposed approach attains 79.4% accuracy.

In future, we will resolve the current limitation of the article. The framework will be updated with time-stamp verification while mapping a tweet to the news.