skip to main content
research-article
Open access

EiAP-BC: A Novel Emoji Aware Inter-Attention Pair Model for Contextual Spam Comment Detection Based on Posting Text

Published: 23 November 2024 Publication History

Abstract

Detecting spam comments on social media remains a continuously discussed research topic to this day, especially on public figure/celebrity accounts in Indonesia. However, the previous studies only focused on the comments themselves, without considering the context of the posting and the use of emojis in social media. This study proposes a new deep learning model called EiAP-BC (Emoji-aware Inter-Attention Pair BiLSTM CNN) for spam comment detection through a novel approach that considers the contextual information of the posts, enabling the detection of spam comments using the relatedness between the comment and its corresponding post that is usually discarded. This model can also handle emoji content in comments and posts, which is widely used in social media. The model was tested using the SPAMID-PAIR dataset created from social media in the Indonesian language, achieving the highest accuracy of 88% and performing competitively with existing deep learning models. To assess its generalization capabilities, the EiAP-BC model was also evaluated using similar public datasets and models in sentence-pair classification tasks, and an ablation study was conducted to determine the importance of each layer and its coordination. The EiAP-BC model exhibits several advantages in size, training speed, and parameter count compared to existing state-of-the-art models.

1 Introduction

Spam on social media can be categorized into two types: spam content within user posts and spammer accounts [1, 2]. The content is often called spam if a user's post contains spam. Spam content can take the form of text [3], images [4], or videos [5]. In the case of text content, it can be in the form of posts or comments on specific posts. On popular social media platforms like Facebook [6, 7], Twitter [8, 9], YouTube [10, 11], and Instagram [1216], spam comments are more commonly found. A spam comment can be defined as a comment containing text unrelated (or only loosely related) to the post's context. On social media, many spam comments have commercial spam content, such as product advertisements or enticing requests to click on a link. Spam comments can consist of regular text (in ASCII/Latin Unicode format) or a combination of emojis and text. There are even cases where the entire comment consists only of emoji symbols. The prevalence of spam content disrupts the flow of information between posts and comments, resulting in confusion and ambiguity.
Detecting spam comments has emerged as a prevalent issue within natural language processing studies. Various solution methods, encompassing machine and deep learning techniques, have been employed to detect spam content and comments. According to existing research, it has been observed that (1) spam content detection is often performed solely based on the comment content without considering the context of the post and (2) spam content detection tends to largely ignore or discard content that includes emojis and symbols, which could carry important meanings in the case of social media [7, 1721]. The approach to spam comment detection still has weaknesses that need to be addressed for the following reasons: (1) To detect whether a comment is spam accurately, it is crucial to understand the relationship and context between the comment and the related post. This means that relying solely on the comment in spam comment detection would not capture the contextual information from the post. (2) Spam comments on social media often heavily involve emojis. Emojis are widely used symbols by users to express their intentions. Emojis have become an informal language widely used on social media platforms. In our recent study [22], we tried using the ensemble machine learning method to solve this problem, but the performance still needs to be improved with a new technique, such as deep learning with a particular pair input.
Based on the two issues mentioned in the previous paragraph, this research proposes an approach, method, and novel model to address them. The study develops a spam comment detection model by using both the post and the comment as inputs for training in the proposed deep learning model. This study employs a novel approach in which post and comment data are treated as paired data, and the issue of identifying spam comments is categorized as a subtask of sentence-pair classification. In previous research so far, sentence-pair classification was applied to situations such as text matching [23], text similarity [24, 25], duplicate text detection [2628], paraphrase identification, entailment text classification [29, 30], and more. This novel approach detects spam comments by analyzing the post's context and the prevalence of emojis on social media. The developed model utilizes paired data as an inseparable data pair, consisting of the post and the corresponding comment, enabling the exploration of the interdependencies and relationships between these two components to capture the appropriate context between the comment and a post. A comment is classified as spam if there is no meaningful connection between the comment and the post. Some deep learning models are needed to fulfill these requirements.
In accordance with the previously mentioned specifications, a deep learning model architecture is needed that contains an input, embedding, and encoding layer that consists of three parts (an inter-attention layer, a fusion layer, and a feed-forward neural network layer) and ends with a final prediction layer, which will be developed in this study. The primary dataset used in this study is the SPAMID-PAIR dataset [15] and several other supporting datasets. This dataset was selected due to its format as paired post–comment data, including labels, which is rare for Indonesian language datasets. To compare the proposed model performance with similar model architectures, we also use some other datasets, including Entailment Wrete [31], SNLI (English) [32], SciTail [33], and IndoNLI [34].
The main contributions of this research are as follows: (1) Developing a spam comment detection model called the Emoji Aware Inter Attention Pair BiLSTM-CNN for Contextual Spam Comment Detection (EiAP-BC) model. This model is capable of identifying spam comments by analyzing the contextual information of their associated posts, utilizing paired post–comment data. as input, and the incorporation of embedding, first and second Bi-LSTM encoding, inter-attention pair, fusion, and final CNN encoding layers to obtain the desired prediction. To the best of our knowledge, this approach has not been previously conducted for spam comment detection on social media, especially in Indonesia. This model possesses several advantages over other architectures and demonstrates competitive performance against fine-tuned BERT architecture as a state of the art in NLP. (2) The EiAP-BC model architecture is an emoji-aware model, allowing it to accept, handle, and utilize emoji features in text and symbol forms that previous researchers have often overlooked. (3) This research also conducts comparisons with similar models and datasets to demonstrate the generalization ability of the proposed model. The ablation study is also performed to determine which layers contribute significantly to this architecture. Lastly, (4) this research develops a web service prototype to implement the EiAP model, enabling its direct utilization by other applications using token security.
The rest of the article is organized as follows: the next section is the related works section, which discusses various studies related to the topic, providing an explanation of previous research as references and establishing a novel contribution of this research; the methodology section, which describes the research methods and procedures employed in this study logically and scientifically; the results and discussion section, which presents and analyzes the findings of this research to conclude; and finally the conclusion section, which summarizes the research findings and outlines future research directions and limitations of the study and provides recommendations for further research development.

2 Related Works

Research on spam content detection, especially in text spam on social media, has been extensively conducted. Previous studies have employed various methods, ranging from statistical-based [3537], rule-based [38], and machine learning-based [3941] to deep learning methods [4244]. Most of these studies have been conducted using data from the English language. However, this presents a challenge for research conducted in languages other than English.
Indonesia, the fourth most populous country in the world, is characterized by a wide range of regional languages and dialects. Despite having a national language, Bahasa Indonesia still exhibits distinct characteristics due to the influence of regional and foreign languages. In natural language processing research, Indonesian language data is not readily available. Until now, there are only a few public dataset repositories in Indonesia collected by some researchers [31, 34, 45, 46]. Among the available datasets, text spam content datasets are rarely found, especially in the case of social media in Indonesia.
On social media, spam content is often found in comments related to a specific post. Most NLP research across different languages overlooks emoji content in their datasets and treats spam comment detection as individual data, separate from the related posts. [47]. This situation means that the detection of spam comments is solely based on the comment data. However, the appropriate approach to determining whether a comment is spam should consider the context of the comment within the post it refers to. Such datasets are challenging to find, even in the English language. Based on our knowledge, the SPAMID-PAIR dataset [48] is the first publicly available dataset that collects post–comment pairs from Indonesian-language social media for classification. This research will use this dataset as the main dataset for training and evaluation.
Some related studies on methods for classifying/categorizing text data as pairs were conducted using machine learning or deep learning methods. The architecture developed by [49] aimed to address the problem of semantic textual similarity (STS) by utilizing a Siamese architecture with a pair of BiLSTMs. This architecture used semantic vector differences to capture the relationship between feature pairs. In 2019, the RE2 architecture was introduced to overcome the complexity of previous models, which resulted in lengthy training processes. The RE2 architecture introduced three essential features for obtaining similarity between vector pairs: residual features, embeddings, and encodings [23]. This architecture was further developed and served as the basis for the ASIM model [26], designed to identify duplicate questions on Stack Overflow. ASIM made modifications to specific components of RE2, such as the encoding, attention, and fusion parts, and introduced an additional encoding layer with BiLSTMs. Another architecture, Bi-ISCA [50], was employed for sarcasm detection in comments from Twitter and Reddit. Based on the case study, Bi-ISCA addressed a similar scenario involving social media and comment pairs. Recently, in 2017, the Transformers model created by Vaswani [51] emerged as a state-of-the-art NLP model, serving as the foundation for subsequent models, including BERT, which was introduced in 2018 [52]. The remarkable generalization capability of BERT across various cases has allowed it to be applied to diverse classification problems, including input data pairs. BERT has been extensively fine-tuned for other cases, resulting in numerous fine-tuned models accessible in the Huggingface repository. Unfortunately, BERT requires significant computational resources and training time, making it computationally expensive. Nevertheless, BERT has become a benchmark for other models due to its excellent performance in various scenarios. One advantage of BERT, based on the Transformers architecture, is that it only requires attention mechanisms, token embeddings, and positional embeddings, eliminating the need for pre-trained word embeddings such as Word2Vec [53], FastText [54], GloVe [55], and ELMo [56].
Indeed, the previously mentioned architectures may not be directly applicable to spam comment detection based on the context of the posts associated with the Indonesian language. The reason is mainly because those architectures were primarily designed for text similarity tasks. The main difference between these two cases is that in text similarity tasks (such as duplicate questions or text inference), a significant portion of the text content in the data pairs is likely to have overlapping sections. Nevertheless, when it comes to identifying spam comments using the context of a post, there is no assurance of any overlapping textual parts. This problem presents a unique challenge since the existing reference models aim to identify similarities between vector pairs and manipulate them through various operations to obtain their similarity.
To the best of our knowledge, applying Siamese model architectures in spam comment detection based on post contexts is a novel approach that has not been extensively explored. We employed a novel approach to detect spam comments using post–comment data pairs to capture contextual information. This research was inspired by the ASIM and Bi-ISCA models, aiming to combine their strengths by adding an encoding feature at the final stage. This encoding feature utilizes the original encoding vectors from both the post and comment components to preserve their original features. This approach proves beneficial in handling cases where the vector pairs have no overlapping segments.
To highlight and state the differences between the previously studied Siamese-based deep learning models and the proposed EiAP-BC model, a comprehensive comparison is presented in Table 1. This table showcases various categories of differences that were considered in this research. These differences demonstrate the contributions made in this study, particularly in the development of the novel EiAP-BC model.
Table 1.
CategoryEiAP-BC Model ArchitectureASIM ArchitectureBi-ISCA ArchitectureRE2 ArchitectureBERT Architecture for Pair Input
Main Layers8 (Input, Embedding, Encoding, Inter-
Alignment, Integration, Second Encoding (+Original Encoding), Pooling and Concatenation, and Prediction)
8 (Input, Embedding, Encoding, Attention, Fusion, Matching Composition, Pooling, and Prediction)5 (Input, Embedding, Encoding, Bi-ISCA, Integration, and Prediction)7 (Input, Embedding, Encoder, Fusion, Alignment, Pooling, and Prediction)7 (Input, Multi-head Self-attention, Dropout, Add and Norms, Feed-forward Layer, Dropout, Add and Norms) in Transformer Encoder
Total No of Blocks-3-1–3 (as hyperparameters)12 blocks of Transformer Encoder
Total Training Parameters6.8 MNot mentioned1.12 M2.8 M109.5 M
Epochs103020103
Input LayerSiamese PairSiamese PairSiamese PairSiamese PairInput Pair
Embedding LayerWord embedding: Fasttext, GloVe, Word2Vec, Emoji2Vec (scenario: average and concatenation) trained from the datasetWord embedding: GloVe Specific of Stack OverflowFasttext with dimension 30. trained from TwitterWord embedding GloVe Pretrained 840B 300 dimension
Pre-trained mdhugol/indonesia-bert-sentiment-classification, Token and Positional Embedding, Max positional embedding = 512
Encoding LayerBi-LSTM pair with 128 units each, stackedBiLSTM pair, each with 200 unitsBiLSTM pair3 CNNsBERT Tokenizer
Attention LayerInter-attention, similarity score using dot product, attention score using softmax between similarity score of vector A and similar position of vector BInter-attention, similarity score using dot product, attention score using softmax between similarity score vector A and vector BIntra-attention from the last cell state of BiLSTM

Inter-attention from the hidden state of the dot product between the final state of BiLSTM

Inter-attention, similarity score using dot product, attention score using weighted sum between similarity score between vector A and vector BBERT Attention mask
Prediction LayerConcatenation from original encoding, pooling output, the difference between pooling output, and element-wise multiplication between pooling outputConcatenation from pooling output, the difference between pooling output, and element-wise multiplication between pooling outputFeed-forward neural network from concatenation flattens layer from CNN outputThe multi-feed-forward neural network between the result relation between final vector A (v1) and final vector B (v2) using this equation: [v1;v2;v1-v2;v1 dot v2]Sigmoid
Fusion/
Integration Layer
This layer uses a feed-forward neural network from concatenation between the initial vector using attention, the initial vector using subtraction of the initial vector and attention, the initial vector with multiplication between the initial vector and attention, then passes to the second encoding of the BiLSTM layerThis layer uses a feed-forward neural network from concatenation between the initial vector using attention, the initial vector using subtraction of the initial vector and attention, and the initial vector with multiplication between the initial vector and attentionThis layer uses four CNN layers with kernel filter 64 units, which accept the result from the comment in forward and backward inter-attention and reply in forward and backward inter-attention; after that, it continues to flatten layers from each CNN unitThis layer uses a feed-forward neural network from concatenation between the original vector with attention, the initial vector with subtraction of the initial vector and attention, and the initial vector with multiplication between the initial vector and attention-
Emoji FeaturesYes (emoji text and symbol)---Added emoji symbol on the Tokenizer layer
Auxiliary FeaturesYes (using 19 auxiliary generated selected features)----
Alignment/
Matching Composition Layer
Original encoding features, previously aligned features, aligned features of contextual information, and second encoding features

Matching composition layer using CNN with two layers (filter: 64)
Using attention in the alignment layer

Matching composition layer using BiLSTM pair
Matching composition layer using four CNN units from each forward and backward on Bi-ISCA (filter 64)Using initial point-wise features, previously aligned features, and context information

This model did not use a matching composition layer
BERT Model
Pooling layerGlobal Max pooling layerMax pooling layer-Max overtime polling layerUsing 128 units in accordance with the 12 heads of self-attention
Dense LayerUnit: 128, 64, 32, Activation: ReLUUnit: 150Unit: 1920Unit: 150, activation: GeLU-
Dropout Layer0.20.2Not mentioned0.20.1
DatasetsSpam comments based on posting context on social media, SPAMID-PAIRThe duplicate question, Stack Overflow question answeringSarcasm comments detection; SARC Reddit, FigLang 2020 WorkshopCommon text pair classification; SNLI, SciTail, Quora Question Pair, WikiQASpam comments based on posting context on social media, SPAMID-PAIR
ResultsSPAMID-PAIR: emoji symbol: 87%, emoji text: 88%Stack Overflow: 96%SARC Reddit: 75.7%, FigLang 2020 Workshop: 91.7%SNLI: 88.9%, SciTail: 86%, Quora: 89.2%SPAMID-PAIR: emoji symbol: 88%, emoji text: 89%
Training Time1 hour 55 minutesNot mentionedNot mentionedNot mentioned3 hours 47 minutes
Model Size179 MBNot mentionedNot mentionedNot mentioned396 MB
Vocab Size50,992Not mentionedNot mentionedNot mentioned30,525
ReferencesOur proposed modelPei et al. [26]Mishra et al. [50]Yang et al. [23]The Model uses Mdhugol1
Table 1. Differences between the EiAP-BC Model and Similar Deep Learning Models

3 Proposed Method

This research uses specific steps, as can be seen in Figure 1, with detailed explanations as follows:
Fig. 1.
Fig. 1. Research methodology.

3.1 The Input Stage Using the SPAMID-PAIR Dataset

This research uses the SPAMID-PAIR dataset that was created from social media platforms in the Indonesian language. The dataset comprises pairs of posts and comments that have been classified as either “spam” or “not spam” levels of relevance between the comment and the contextual information of the post being commented on. The reasons for using this dataset are as follows: (1) Indonesia is a developing country where the Indonesian language is widely used and ranks fourth in the world regarding language users. However, it is still categorized as an under-resourced language, resulting in limited studies and datasets available for research purposes. (2) There is a lack of datasets designed explicitly for detecting spam comments based on the contextual information of the associated posts, which are publicly available as training data for systems. This dataset was collected from 14 celebrity Instagram accounts with over 15 million followers, containing 72,874 pairs of data. Additionally, this dataset encompasses a relevant use case in social media platforms, including a substantial amount of informal text and emoji symbols.

3.2 Data Exploration Stage and Scenario Creation

During the data exploration phase, the SPAMID-PAIR2 dataset was analyzed, including metadata that explains the dataset's information. This metadata can be used as supplementary information in the pre-processing stage, such as the presence of emojis, word length, number of unique emojis, and more [22]. It can be observed that the SPAMID-PAIR dataset exhibits imbalanced class distribution in terms of the number of instances in each class not being proportional (72,874 pairs of data that are divided into 53,837 labeled as non-spam and 19,037 labeled as spam). This issue requires special handling during the training/learning phase of the system to ensure balanced representation and prevent biases toward the majority class. This research will use three scenarios from the data exploration stage for training, validation, and testing. The first scenario includes the entire dataset text, both comment text only and post and comment texts as pairs with emoji text and emoji symbols. The second scenario is the same as the first; however, it adds auxiliary data from the dataset. The third scenario uses class weighting because the SPAMID-PAIR dataset is imbalanced.

3.3 Data Pre-processing Stage

Pre-processing is the stage before feature extraction and selection. The pre-processing steps are not extensive because deep learning models can directly process data by transforming tokens into vector features, which are then used as input for the model. The following pre-processing steps were performed: (1) reading the dataset from an Excel file, (2) removing rows with null data (if any) from the dataset, (3) performing case folding to convert all text data to lowercase, and (4) removing unnecessary characters, specifically '!∃%&+,-./:;[]^`{|.' After these processes, the data can be divided into train, validation, and test sets. The data allocation within these subsets adheres to a predetermined ratio, precisely 65% for the training set, 15% for the validation set, and 20% for the test set. In the emoji-text scenario, all emoji symbols are converted into text based on their descriptions from the Unicode list. For example, the emoji symbol will be converted into a “grinning face.” In the emoji-symbol scenario, all emoji symbols are used “as is” without any conversion. Lastly, all additional metadata features from the SPAMID-PAIR dataset, which amounts to 19 features, are utilized in the auxiliary feature scenario.

3.4 Tokenization and Feature Selection Stage

In this stage, the prepared training data includes the following attributes: clean comment data, clean post data, label data, and auxiliary attribute data. The data is tokenized using the Keras Tokenizer library. The tokenization process generates a list of tokens created from the combined text of the dataset (comment and post texts). Once the list of tokens and their token indexes is formed, it is saved as a pickle file for future use in the web service. The process continues by applying the Tokenizer to convert the comment and post texts into sequential vectors. The comment text is transformed into a sequential vector of indices based on the predefined token list, and the same applies to the post text. These sequential vectors will serve as inputs to the EiAP-BC deep learning model. As the input vectors to this model must have a fixed length, both comment and post sequences need to be padded to ensure they have the same vector length. Padding is performed using the pad_sequences function on the previous sequences, with a maximum length of 600. The value of 600 is determined based on the statistics of the maximum number of words observed in the dataset during the data exploration process. The scenario used in this experiment is conducted according to the following:
(1)
The first scenario includes the entire dataset text, both comment text only and the combination of post and comment texts as pairs. The text features are transformed into sequence vectors per word using Tokenizer. In the EiAP-BC model, the paired comment and post vectors are specifically processed because the model is designed to accept paired input vectors in a Siamese architecture. These paired vectors will then proceed sequentially through the subsequent layers of the model for further processing and analysis.
(2)
The second scenario is the same as the first but includes additional vector features from the existing auxiliary data in the dataset. The additional features are processed into an Input layer with 19 vectors for each dataset row, as in [22]. The auxiliary features will be appended to the output of either the comment-only model or the post–comment pair model in the final stage of the trained model. They will then proceed to the feed-forward neural network layers, including dense layers, until the final prediction layer.
(3)
The third scenario involves class weighting for each training data because the SPAMID-PAIR dataset is imbalanced. With this weighting, the dataset becomes balanced and can be processed similarly to other scenarios. The text data used in this case includes emoji text and emoji symbols.

3.5 Embedding Layer Generation Stage

This study uses word embedding in the EiAP-BC model as weights in the Embedding layer, eliminating the need to train this layer again in the primary model. The types of word embeddings utilized as representations are Fasttext [57], Word2Vec [53], GloVe [58], and Emoji2Vec [59]. All word embeddings are built directly from the SPAMID-PAIR dataset. The embedding layer is not constructed using pre-trained word embeddings due to the different nature of the text data compared to formal text typically used in pre-training, such as Wikipedia or news websites. Specifically, Emoji2Vec embedding is only used in the emoji-symbol scenario, while the three other word embeddings are used for all scenarios. Detailed specifications of the word embedding architecture used in this study can be found in Table 2. Fasttext word embedding exhibits the best representation as it effectively handles out-of-vocabulary (OOV) words in the dataset, while GloVe and Word2Vec perform adequately (both are similar). Unlike the three other embeddings, Emoji2Vec can only represent emoji data. In the model, Emoji2Vec will be used in the embedding layer alongside the three other embedding vectors specifically for the emoji-symbol scenario.
Table 2.
Word EmbeddingParameter(s)Value (s)Information
Fasttextdim300Fasttext is built using the original Fasttext from www.fasttext.cc with dimensions of 300
 sg1Using skip-gram
 wordsNGram1 or 2N-gram
 minn3N-word minimal neighbor
 maxn6N-word maximal neighbor
 epoch30Number of epochs
 min_count1Minimal word occurrences
 lr0.01Learning rate
 outputBinaryResult output of Fasttext model in the form of binary (bin) file
Word2Vecvector_size300Word2Vec is built using Gensim Word2Vec with dimensions of 300
 window5Window context from the left and right from the current token position
 min_count1Minimal token occurrences
 sg1Using skip-gram
 epoch30Number of epochs
GloVevector_size300The glove is built using the original Glove https://nlp.stanford.edu/projects/glove/ with vector dimensions of 300
 memory5Memory usage
 vocab_min_count1Minimal word occurrences
 max_iter30Number of epochs
 window_size15Window context from the left and right from the current token position
 num_threads8Number of threads used by the system
Emoji2Vecout_dim300Emoji2vec is built from the original emoji2vec from the Github repository https://github.com/uclnlp/emoji2vec dengan dimensi 300
 dropout0No dropout
 learning0.01Learning rate
 max_epoch40Number of epochs
Word Embedding Layerdimension300Each dimension
dimension300Average dimension from both three embeddings or four embeddings
dimension900Dimension concat from Fasttext, Word2Vec, and Glove
dimension1200Dimension concat from Fasttext, Word2Vec, Glove, dan Emoji2Vec
Table 2. Word Embedding Layer Configuration

3.6 Proposed EiAP-BC Model and The Training Stage

The model developed and proposed in this research is called the EiAP-BC Model. To compare the performance of the EiAP-BC model to other models, several different case studies are also utilized as validation for the proposed model, including the SNLI [32] study from Stanford in both English and Indonesian languages,3 IndoNLU Wrete,4 and SciTail [60], all of which involve pairs of text/sentences as input. To obtain the context between comments and their related posts, a deep learning model that incorporates this mechanism is required: (1) store alignment features from each component in the previous layer (residual vectors), (2) retain the original features of each component (embedding vectors), (3) capture the context of each component (encoded vectors), (4) extract essential features from each component using multi-alignment, (5) assign weights to the relevant vector components using attention mechanisms, (6) capture the intent of the emoji in symbol form, (7) leverage auxiliary features in the model to enrich the training features, and (8) design a simpler architecture to expedite and optimize the model training process without compromising its capabilities.
This study presents a novel approach for contextually detecting spam comments. Our proposed model, EiAP-BC, leverages contextual information from related posts. This architecture combines the advantages of BiLSTM-CNN, ASIM, and Bi-ISCA architectures, incorporating attention mechanisms to handle the emoji context effectively. This section evaluates the proposed model's performance on the SPAMIDPAIR dataset, examines the functionalities of each component, explores the utilization of word embeddings, and investigates the incorporation of auxiliary features in emoji-text and emoji-symbol scenarios. The EiAP-BC model consists of the following layers:
(1)
Input Layer. The input layer receives vector sequences of the post and comment components. The input layer represents a sequential token vector of input data based on the result of the Tokenizer, which has been right-padded to maintain a consistent size. This input layer for posting (P) is defined in Equation (1) as
\begin{equation}P = {{x}_i}\ldots {{x}_m}\end{equation}
(1)
where x is posting, and m represents the sequence length of posting and has a maximum length of 600 in this experiment. The input layer for comment (K) is defined in Equation (2) as
\begin{equation} K = {{y}_i}\ldots {{y}_n} \end{equation}
(2)
where y is comment, and n represents the sequence length of the comment, and also has a maximum length of 600.
(2)
Embedding Layers. The embedding layer for the posting input is called EP, and the comment input is called EK. The EP layer is defined using Equation (3) for the posting and Equation (4) for the comment:
\begin{equation}EP = {{w}_i}\ldots {{w}_m}\end{equation}
(3)
where \({{w}_m} \in {{\mathcal{R}}^d}\) is embedding for posting, and m represents the number of posting vectors in d embedding dimensions, in this case 300 dimensions. The embedding vector from EP comes from pre-trained Fasttext, Word2Vec, and GloVe from the dataset with concatenation or averaging operation. The comment part (EK) is defined as
\begin{equation} EK = {{w}_j}\ldots {{w}_n} \end{equation}
(4)
where \({{w}_n} \in {{\mathcal{R}}^d}\) is embedding for comment, and n represents the number of comment vectors in d dimensions, in this case 300 dimensions. Like EP, the embedding vector of EK is derived from pre-trained models such as Fasttext, Word2Vec, and GloVe, using concatenation or averaging operations on the dataset. In the emoji-symbol scenario, vector Emoji2Vec is also added to the embedding vectors (with concatenation or averaging operations). After the embedding layer is generated, the original embedding for the posting (OP) and the original embedding for the comment (OK) are saved for further processing. The definitions are defined in Equations (5) and (6):
\begin{equation} OP = EP \\ \end{equation}
(5)
\begin{equation} OK = EK \end{equation}
(6)
(3)
The Features Encoding Layer. This model uses Bi-LSTM [61] for the features encoding layer for the posting (Bi_Posting) and comment (Bi_Comment). This layer consists of two LSTM [62] units that move forward and backward in parallel. At the same time, it is combined as a BiLSTM unit.
\begin{equation}\overrightarrow {{{h}_i}} = LSTM\big( {\overrightarrow {{{h}_{i - 1}}} ,{{w}_i}} \big){{\forall }_i} \in \left[ {1,\ldots ,m} \right]\end{equation}
(7)
The forward hidden state at the time step (i – 1) is denoted as \(\overrightarrow {{{h}_i}}\), and \({{w}_i}\) represents the posting input at time step i as seen in Equation (7).
\begin{equation} \overleftarrow {{{h}_i}} = LSTM\big( {\overleftarrow {{{h}_{i - 1}}} ,{{w}_i}} \big){{\forall }_i} \in \left[ {1,\ldots ,m} \right] \end{equation}
(8)
The backward hidden state at time step (i – 1) is denoted as \(\overleftarrow {{{h}_i}}\), and \({{w}_i}\) represents the posting input at time step i as seen in Equation (8). Then, these two parts are combined to obtain the contextual representation of the token wi of posting, as seen in Equation (9):
\begin{equation} {{X}_i} = \big[ {\overrightarrow {{{h}_i}} ,\ \ \overleftarrow {{{h}_i}} } \big] \end{equation}
(9)
The same notation Yi is used for comments, \({{Y}_i}\ {{\forall }_i} \in [ {1,\ldots ,n} ]\). The formulas for comments are omitted because they are identical. After that, this model concatenates the embedding and encoding layers for posting (CP) and comment (CK). This concatenation combines the output of the embedding layer with the output of the corresponding Bi-LSTM layer. The definitions are defined in Equations (10) and (11):
\begin{equation} CP = EP \oplus X \\ \end{equation}
(10)
\begin{equation} CK = EK \oplus Y \end{equation}
(11)
(4)
Inter-alignment Layer. This layer is created using the concatenated output and the post–comment attention matrix, called the inter-attention matrix. The inter-attention matrix calculates the attention weights between the concatenated output and the post–comment components. The inter-attention matrix can be calculated using Equation (12):
\begin{equation}{{E}_{i,j}} = X_i^T. {{Y}_j}\end{equation}
(12)
where i and j \(\in \ {{\mathbb{R}}^{m,n}}\), with m being the length data from the post outputs, n being the length data from the comment outputs, and the dot sign representing the inner-product operation. Inter-attention is based on the Bahdanau attention concept [63]. The softmax function normalizes the attention weights to ensure they sum up to 1. The resulting inter-attention matrix represents the importance or relevance of each element in the concatenated output concerning the post–comment components. The inter-attention vector can be computed using Equations (13) and (14):
\begin{equation} {{e}_{i}} = softmax\left( {{{E}_i},} \right) \\ \end{equation}
(13)
\begin{equation} {{e}_j} = softmax\left( {{{E}_{,j}}} \right) \end{equation}
(14)
Furthermore, the contextual features are created using Equation (15) for posting and Equation (16) for comments.
\begin{equation} Fi = {{Y}^T}. {{e}_{i}} \\ \end{equation}
(15)
\begin{equation} {{F}_j} = {{X}^T}. {{e}_j} \end{equation}
(16)
The dot product operation computes the weighted sum of the concatenated output elements based on the attention weights. The resulting Fi and Fj vectors represent the context-aware features for the post and comment components, respectively, incorporating the importance assigned to each element by the attention mechanism.
(5)
The Integration Layer. This model creates three features to prepare the integration layer: the original concatenated, subtraction concatenated, and multiplication concatenated features. The original concatenated feature concatenates Fi and CP (written as Xi) for the post component (Oi). It concatenates Fj and CK (written as Yj) for the comment component (Oj). The formulas for the original concatenated feature can be seen in Equations (17) and (18):
\begin{equation} {{O}_i} = {{X}_i} \oplus {{F}_i} \\ \end{equation}
(17)
\begin{equation} {{O}_j} = {{Y}_j} \oplus {{F}_j} \end{equation}
(18)
The concatenate operation combines the context-aware features (Fi and Fj) with the original embedding and Bi-LSTM outputs (CP (Xi) and CK (Yj)) to create the final concatenated feature vectors Oi and Oj. This feature enhances the capture of the comprehensive representation of the post and comment components by integrating context-aware features and merging the original information from the preceding levels. The second feature, the subtraction concatenated feature, performs subtraction operations by subtracting the feature Fi from the context post (CP) (represented as Xi) for the posting Si and subtracting the feature Fj from the context comment CK (represented as Yj) for the comment Sj. The formulas are explicitly defined in Equations (19) and (20):
\begin{equation} {{S}_i} = {{X}_i} \oplus ({{X}_i} - {{F}_i}) \\ \end{equation}
(19)
\begin{equation} {{S}_j} = {{Y}_j} \oplus ({{Y}_j} - {{F}_j}) \end{equation}
(20)
Within this feature, the initial concatenated feature vectors (Xi and Yj) are improved through the subtraction of context-aware features (Fi and Fj) from the corresponding Bi-LSTM outputs (CP and CK). The subtraction procedure facilitates extracting precise information by comparing the original and context-aware features. This feature enables the model to focus on the unique characteristics of each component. The resulting concatenated subtraction feature vectors Si and Sj incorporate the original data and the context-aware differences. The last feature, the multiplication concatenated feature, multiplies CP (represented as Xi) with CP element-wise multiplied by Fi for the post component (referred to as Mi) and multiplies CK (represented as Yj) with CK element-wise multiplied by Fj for the comment (represented as Mj). The formulas can be found in Equations (21) and (22):
\begin{equation} {{M}_i} = {{X}_i} \oplus ({{X}_i}\, o \, {{F}_i}) \\ \end{equation}
(21)
\begin{equation} {{M}_j} = {{Y}_j} \oplus ({{Y}_j}\, o \, {{F}_j}) \end{equation}
(22)
where o is element-wise-multiplication. In this feature, the original concatenated feature vectors (Xi and Yj) are multiplied element-wise with the context-aware features to capture their informative interactions. This multiplication operation helps highlight the essential features and their relationships within each component. The resulting concatenated feature vectors Mi and Mj incorporate the original information and the enriched interactions between the components.
After that, this model passes the output to the dense layer for Oi and Oj, Si, Sj, and Mi and Mj. The formulas for the dense layer can be seen in Equations (23) and (24):
\begin{equation} X_i^1 = {{F}^1}\left({{{O}_i}} \right){\rm{\ and\ }}X_i^2 = {{F}^2}\left( {{{S}_i}} \right){\rm{\ and\ }}X_i^3 = {{F}^3}\left( {{{M}_i}} \right) \end{equation}
(23)
for posting, and F is a dense layer (FFNN).
\begin{equation} Y_j^1 = {{F}^1}\left( {{{O}_j}} \right){\rm{\ and\ }}Y_j^2 = {{F}^2}\left( {{{S}_j}} \right){\rm{\ and}}\ Y_j^3 = {{F}^3}\left( {{{M}_j}} \right) \end{equation}
(24)
for comment, and F is a dense layer (FFNN). This dense layer applies a dense transformation to the concatenated feature vectors Oi, Oj, Si, Sj, Mi, and Mj. The dense layer consists of trainable weights and biases, and it maps the input vectors to a higher-dimensional space by performing a linear transformation followed by an activation function. The resulting vectors capture the higher-level representations and extract more complex features from the concatenated inputs. The unit parameter represents the dimensionality of the output space.
Finally, the integration layer is created by the concatenation of dense Orig_Posting (\(X_i^1)\), Sub_Posting (\(X_i^2)\), and Multiply_Posting (\(X_i^3)\) and concatenation of Dense Orig_Comment (\(Y_j^1)\), Sub_Comment (\(Y_j^2\)), and Multiply_Comment (\(Y_j^3\)). The formulas for this part can be seen in Equations (25) and (26):
\begin{equation} C{{o}_i} = \left[ {X_i^1 \oplus X_i^2 \oplus X_i^3} \right] \\ \end{equation}
(25)
\begin{equation} C{{o}_j} = \left[ {Y_j^1 \oplus Y_j^2 \oplus Y_j^3} \right] \end{equation}
(26)
In this layer, the outputs of the dense layers for Orig_Posting, Sub_Posting, and Multiply_Posting are concatenated to form the vector Coi. Similarly, the outputs of the dense layers for Orig_Comment, Sub_Comment, and Multiply_Comment are concatenated to form the vector Coj. The integration layer combines the vectors element-wise along the specified axis (default is –1) to create a single concatenated vector. This layer enables the model to capture and combine the different representations obtained from the dense layers, incorporating various features and their interactions into the final representation for further processing.
This model also passes the last output to the dense and dropout layers for the posting and comment sections. The formulas for this part can be seen in Equations (27) and (28):
\begin{equation} C{{o}_i} = F\left( {C{{o}_i}} \right) \\ \end{equation}
(27)
\begin{equation} C{{o}_j} = F\left( {C{{o}_j}} \right) \end{equation}
(28)
where F is a dense and dropout layer. Dropout [64] is used to prevent the overfitting of the model. Coi and Coj are passed through a dense layer with a predetermined number of units (dense_units) and an activation function known as a rectified linear unit (ReLU). The dense layer in this model applies a linear adjustment to the input vectors to introduce nonlinearity, followed by the activation function. The dropout layer is then employed to produce the output of the dense layer in order to mitigate overfitting and enhance the model's capacity for generalization.
Finally, the integration layer concatenates the output of the dropout and dense Posting (Coi) layer with the original posting layer (OPi). Similarly, a concatenation layer is added to connect the output of the dropout and dense comment (Coj) layer with the original comment layer (OPj). The formulas for this layer can be seen in Equations (29) and (30):
\begin{equation} {{\tilde{X}}_i} = C{{o}_{i }} \oplus O{{P}_i} \\ \end{equation}
(29)
\begin{equation} {{\tilde{Y}}_j} = C{{o}_{j }} \oplus O{{P}_j} \end{equation}
(30)
Within this layer, the output of the dropout and dense layer for the posting section (Coi) is combined with the original posting layer (OPi) using the concatenation function. Similarly, the output of the dropout and dense layers for the comment section (Coj) is combined with the original comment layer (OPj) by concatenation. The concatenation technique merges the features acquired from the dropout and dense layers with the original features. This concatenation operation enables the model to effectively capture the enhanced representations from the dense layer and the unprocessed information from the original layers. By performing the concatenation, the model can leverage the transformed features obtained from the dropout and dense layers and the original features to make more informed predictions and capture comprehensive information from both sections.
(6)
The Second Feature Encoding Layer. This layer is used to obtain the latest context for the posting and comment sections based on the previous outputs. The formulas for this layer can be seen in Equations (31) and (32):
\begin{equation} V_i^x = BiLSTM \big( {{{{\tilde{X}}}_i},i} \big)\forall i \in \left[ {1,\ldots ,m} \right] \\ \end{equation}
(31)
\begin{equation} V_j^y = BiLSTM \big( {{{{\tilde{Y}}}_j},j} \big)\forall j \in \left[ {1,\ldots ,n} \right]\end{equation}
(32)
The LSTM layer captures the temporal dependencies and relationships within the input sequences. By applying the LSTM layer to the concatenated outputs, the model can refine the representations and capture the contextual information specific to the posting and comment sections. This layer helps capture the nuanced meanings and dependencies between different parts of the input data. The 1D convolutional layer is used to obtain the final features for the posting (\(V_i^x)\\)\)and comment (\(V_j^y\)). This model uses the same variable name for convenience. The formulas for this layer can be seen in Equations (33) and (34):
\begin{equation} V_i^x = CNN1D\left( {V_i^X,i} \right)\forall i \in \left[ {1,\ldots ,m} \right] \\ \end{equation}
(33)
\begin{equation} V_j^y = CNN1D\left( {V_j^Y,j} \right)\forall j \in \left[ {1,\ldots ,n} \right] \end{equation}
(34)
In this layer, the encoded representations from the previous layer are passed through 1D convolutional filters. The Conv1D layer performs convolution operations on the input sequences, enabling the identification of localized patterns and the extraction of relevant features. The number of filters employed in the convolution procedure directly influences the number of feature maps produced. Each filter detects specific patterns or features in the input sequences. The kernel_size represents the size of the convolution sliding window. By employing the Conv1D layer, the model can learn abstract features and higher-level representations from the encoded sequences. The convolution operation enables the identification of spatial dependencies and essential patterns within the input data.
(7)
The Pooling, Concatenation, and Dense Layers. The pooling layer is used to extract the most pertinent features from the convolutional outputs. This model uses a global max pooling formula, as can be seen in Equations (35) and (36):
\begin{equation} V_{max}^x = Gmax\left( {V_i^X,i} \right)\forall i \in \left[ {1,\ldots ,m} \right] \\ \end{equation}
(35)
\begin{equation} V_{max}^y = Gmax\left( {V_j^Y,j} \right)\forall j \in \left[ {1,\ldots ,n} \right]\end{equation}
(36)
The GlobalMaxPooling operation selects the maximum value across each feature map, effectively capturing the most salient features. Using global max pooling, the layer reduces the dimensionality of the feature maps while preserving the essential data. It summarizes the most discriminative features in the input sequences by aggregating the highest activation values for each feature. This model passes the original embedding into the 1D CNN to derive the final features from the original encoding sections that have not been subjected to inter-attention. The utilized formulas are identical to Equations (33) and (34), but they are applied to the initial variables X and Y. These layers' formulations can be found in Equations (37) and (38):
\begin{equation} V_i^{xori} = CNN1D \big( {X_i^{},i} \big)\forall i \in \left[ {1,\ldots ,m} \right] \\ \end{equation}
(37)
\begin{equation} V_j^{yori} = CNN1D \big({Y_j^{},j} \big)\forall j \in \left[ {1,\ldots ,n} \right] \end{equation}
(38)
Applying the CNN to the initial encoding variables makes it possible to derive additional features that capture local patterns and information that the inter-attention mechanism may not. This approach enables the model to consider the unique characteristics of the posting and comment sections before executing the inter-attention operation.
The concatenation layer combines the GlobalMaxPooling outputs of the posting and comment from the preceding layers with the posting and comment's original encoding. This layer addresses the issue when the posting and the remark contain no shared components. The formula can be found in Equation (39):
\begin{equation} V = F\left( {V_{max}^X \oplus V_{max}^Y \oplus \left( {V_{max}^X - V_{max }^y} \right) \oplus \left( {V_{max}^X{\rm{*\ }}V_{max }^y} \right) \oplus {{X}_i} \oplus {{Y}_j}} \right) \end{equation}
(39)
The model can handle cases without shared parts by incorporating the original encoding of the posting and comment. This technique enables the model to capitalize on the unique characteristics of each section while taking into account the extracted features of the preceding layers. The concatenation and final dense layer enable the model to capture an exhaustive representation of the input and make a prediction using the combined information from the posting and comment sections.
(8)
Prediction Layer. This layer uses the binary_crossentropy loss function, sigmoid activation function, and Adam optimizer [65]. The output of this model is an array with the same number of rows as the input, containing probabilities between 0 and 1. The default threshold of 0.5 determines spam, but the best threshold can be determined through hyperparameter tuning.
The architecture model of EiAP-BC can be seen in detail in Figure 2. The architecture receives a pair of inputs: the comment and post parts. The shape/dimension of each input part is determined based on the maximum token count of the comment and post texts. After the input, the next layer is the embedding layer, which has an output of 300 dimensions. This layer can use either token embedding or word embeddings trained previously using the weights from Fasttext, Word2Vec, GloVe, and Emoji2Vec word embeddings. The next layer is the Bidirectional LSTM, which serves as an encoding layer to capture the context of each input part. The concatenation layer between the embedding and BiLSTM layers has an output that combines the output from the BiLSTM layer and the embedding layer. The element-wise multiplication layer maintains the same output size as the initial input vector. The concatenation of the element-wise multiplication and subtraction layers results in an output size that originates from the previous layers and is summed.
Fig. 2.
Fig. 2. Model architecture of EiAP-BC.
The dense layer takes an input of 128 units, so the concatenation of the three sections produces a threefold output. The final concatenation layer, combining the dense output with the original embedding, yields an output size that is the sum of the embedding dimension and the dense output. The next layer is the second BiLSTM, which also uses 128 units. The concatenation of the second BiLSTM and the last concatenation result is summed together. The output size of the convolutional and GlobalMaxPooling layers is 64 each. Lastly, the dense layer consists of 128 units, 64 units, and 32 units, with an output size of 1 for prediction. This model uses the binary cross-entropy loss function and the sigmoid activation function.

3.7 Evaluation Stage

In the evaluation phase, the steps involved include preparing evaluation metrics for the deep learning model as determined in the research methodology. The evaluation measures encompass accuracy, precision, F1-score, and threshold. These metrics are stored in a Python dictionary to conveniently store and access all the metrics used in the model creation, training, evaluation, and testing processes. The F1-score is essential because of the imbalanced SpamID-Pair dataset. The binary cross-entropy is used for the loss function, the sigmoid activation is used for the last activation function, and the Adam optimizer is used for the prediction model. This choice is based on the study's focus on a binary classification task, distinguishing between spam and non-spam classes.

4 Results and Discussion

Based on the evaluation plan and scenario, this section discusses the performance comparison of the EiAP-BC model and other models using the SPAMID-Pair dataset, the performance of the EiAP-BC model using additional auxiliary features, the generalization capability of the EiAP-BC model used in different case studies and datasets in sentence-pair classification scope, and the ablation study to measure the contribution of each layer and its effect on system performance. The evaluation was carried out using accuracy and F1-score in most testing phases, plus the best threshold limit and macro average precision in the ablation study section. Based on the overall testing results, the performance of the EiAP model on the SPAMID-PAIR dataset is excellent. The EiAP-BC model can compete closely with the Google BERT model, with a less than 1% difference. Table 3 shows that benchmark models were also used for comparison, including the regular LSTM Pair model using the concatenation of three-word embeddings, the BiLSTM model with attention, and the average of three word embeddings, as well as the LSTM-CNN model with the average of three embeddings.
Table 3.
Model NameAccuracy (%)Macro Average F1-score (%)Best Threshold
Emoji-text Scenario
LSTM Pair 3 Emb Concatenation87.15 (≈87)82.72 (≈83)0.8
BiLSTM Pair Att 3 Embedding Avg87.23 (≈87)83.17 (≈83)0.73
LSTM–CNN Pair 3 Embedding Avg87.48 (≈87)83.27 (≈83)0.74
EiAP-BC using Token85.15 (≈85)80.06 (≈80)0.78
EiAP-BC 3-Embedding Concat*87.99 (≈88)83.58 (≈84)0.59
EiAP-BC 3-Embedding Avg87.94 (≈88)83.32 (≈83)0.78
EiAP-BC using BERT Embedding84.62 (≈85)79.07 (≈79)0.63
Fine-tuned BERT using Mdhugol88.07 (≈88)83.78 (≈84)0.50
Emoji-symbol Scenario
LSTM Pair 4 Emb Concatenation85.02 (≈85)78.92 (≈79)0.84
BiLSTM Pair 4 Embedding Avg84.77 (≈85)78.08 (≈78)0.68
LSTM–CNN Pair 4 Embedding Avg84.4 (≈84)78.06 (≈78)0.69
EiAP-BC using Token82.65 (≈83)77.68 (≈78)0.89
EiAP-BC 4-Embedding Concat85.35 (≈85)80.22 (≈80)0.88
EiAP-BC 4-Embedding Avg*85.94 (≈86)80.20 (≈80)0.77
EiAP-BC using BERT Embedding84.41 (≈84)77.01 (≈77)0.65
Fine-tuned BERT using Mdhugol86.94 (≈87)83.78 (≈84)0.5
Table 3. Accuracy and Macro Average F1-score Result of EiAP-BC Model and Other Models Using SPAMID-PAIR (Weighted)
The EiAP-BC model was also tested without using pre-trained word embeddings, and the results were not as good. However, in scenarios where the EiAP-BC model utilizes concatenation and averaging of word embeddings, the performance is significantly better and outperforms other scenarios. Interestingly, the performance of word embeddings built with BERT embeddings was inferior. Lastly, when comparing the performance of the EiAP-BC model with fine-tuned BERT from the Mdhugol study on sentiment analysis in the Indonesian language, the EiAP-BC model only has a 1% difference. The EiAP-BC model was also tested with scenarios involving text and symbol emojis. Based on the results, text emojis showed superior performance as they do not contain many symbols that may fail to be adequately trained in pre-trained word embeddings.
EiAP-BC utilizes three pre-trained word embeddings (FastText, Word2Vec, and GloVe) for emoji-text scenarios and four pre-trained word embeddings (FastText, Word2Vec, GloVe, and Emoji2Vec) for emoji-symbol scenarios. As observed in Table 3, concatenation of pre-trained embeddings proves more effective in the emoji-text scenario, while averaging pre-trained embeddings performs better in the emoji-symbol scenario. This approach is likely because, in the text scenario, the richness of word vectors from various embeddings improves performance through concatenation, providing a more comprehensive set of embedding vectors. Conversely, in the emoji-symbol scenario, the diversity of different emoji vectors fails to capture the intended meaning adequately. Table 4 shows the Emoji2Vec effectiveness in the emoji-symbol scenario. The use of Emoji2Vec embedding improves the accuracy and F1-score performance.
Table 4.
ModelEmoji-symbol Scenario and DimensionAccuracy (%)F1-score (%)
EiAP-BC 4 Embedding ConcatWith Emoji2Vec embedding concatenation (1,200 dimensions)85.4679.63
EiAP-BC 3 Embedding ConcatWithout Emoji2Vec embedding concatenation (900 dimensions)85.5680.35
Difference+0.1+0.72
EiAP-BC 4 Embedding AverageWith Emoji2Vec embedding average (300 dimensions)85.5879.72
EiAP-BC 3 Embedding AverageWithout Emoji2Vec embedding average (300 dimensions)86.0281.12
Difference+0.44+1.14
Table 4. Effectiveness of Emoji2Vec Embedding on EiAPC-BC Emoji-symbol Scenario
The authors hypothesize that the following reasons contribute to this outcome: (1) The differences among emoji symbols are often not due to semantic variations but rather differences in visual representation across platforms like Android, Windows, and Samsung, complicating the training process. (2) Word2Vec, GloVe, and FastText vectors are trained on mixed datasets that include both text and emoji symbols, leading to a lack of semantic clarity when combined with Emoji2Vec, which predominantly contains emoticons rather than textual elements. Consequently, the averaging technique, which requires consistency across all four vectors, reduces the ambiguity during model training. However, Emoji2Vec demonstrates superior performance when used solely as a word embedding for detecting spam comments that consist entirely of symbols. This condition is because Emoji2Vec is specifically trained to classify emoji symbols alone.
The EiAP-BC model was trained using the SpamID-Pair dataset, which underwent an exploratory data analysis (EDA) beforehand. The EiAP-BC model learns from social media data through paired posts and comments. Based on the EDA results, the EiAP-BC model exhibits limitations in handling variable-length text inputs due to the constraints of the embedding layer, LSTM, and CNN layers. In this study, a maximum text length of 650 characters was set, as determined by the statistics from SpamID-Pair. We use padding and truncation techniques. This ensures that all input sequences have the same length, making them compatible with the fixed-size input requirements of neural network models. Conversely, truncation involves cutting off longer texts at the pre-defined maximum length, ensuring that overly long sequences do not exceed the input size limits. Padding is applied if a text input is shorter than 650 characters, whereas the text is truncated if it exceeds 650 characters.
The EiAP-BC model employs a combination of three pre-trained word embeddings for the emoji-text scenario and four pre-trained word embeddings for the emoji-symbol scenario, both trained on the SpamID-Pair dataset. These combinations are achieved through averaging and concatenation techniques. Averaging produces vectors of the same dimensionality, while concatenation results in vectors that are three to four times larger in dimension. In this study, a vector dimensionality of 300 was used, consistent with the original vector dimensions in FastText, GloVe, and Word2Vec trained on the SpamID-Pair dataset, as well as in Emoji2Vec, which uses its original pre-trained 300-dimensional vectors.
Although concatenation has proven to deliver better performance than averaging, it significantly increases computation time and requires substantial memory resources. The authors propose potential optimization strategies, such as reducing the vector dimensionality from 300 to a lower dimension, like 100 or 150. However, this approach would necessitate retraining Emoji2Vec to match the new dimensionality, introducing additional computational costs. Despite this, the overall tradeoff might be favorable, as retraining Emoji2Vec is likely less computationally expensive than continuing to use the 300-dimensional concatenated vectors. Future research should test and document whether the performance with reduced dimensions meets or exceeds that of the original 300-dimensional setup. This research uses an attention mechanism and parameter sharing between pair units to minimize the embedding layer's high-dimension problem.
This study employed word embeddings from the pre-trained Emoji2Vec, concatenated with two other word embeddings. This approach limits our research as it relies solely on the pre-trained Emoji2Vec. Emoji2Vec combines data from unicode.org and iemoji.com into a dataset of 6,088 entries. For pre-processing, normalization, and conversion of emoji symbols to text, we used the Full Emoji Image Dataset from Kaggle, which includes 1,816 emoji data points from Apple, DoCoMo, Facebook, Gmail, Google, JoyPixels, KDDI, Samsung, Softbank, Twitter, and Windows. The attention mechanism in the EiAP-BC model was employed primarily to assign higher weights to the alignment relations between the vectors of posts and comments derived from the preceding LSTM layer. Therefore, attention was not explicitly applied to the unique context of emojis but rather in conjunction with the textual context of other embeddings. In this research, attention mechanisms might not adequately differentiate between emojis with similar appearances but different connotations or emojis conveying sentiment versus those that serve more as linguistic or stylistic elements.
The EiAP-BC model can also incorporate additional auxiliary features, which results in improved performance, particularly in the emoji-symbol scenario, as seen in Table 5. This enhancement even outperforms other scenarios. The results in the table indicate that the inclusion of 19 auxiliary features applied to the EiAP-BC model in both emoji-text and emoji-symbol scenarios, whether through concatenation or averaging of three and four embeddings, enhances both accuracy and F1-score. In both scenarios, concatenation and auxiliary features outperform the baseline EiAP-BC model. The improvement in accuracy and F1-score is more pronounced in the emoji-text scenario compared to the emoji-symbol scenario. These findings suggest the potential for incorporating more than the 19 auxiliary features used in this study. Future research could explore additional features, focusing on those that can be automatically extracted through programming to facilitate the inclusion of supplementary features without added complexity. Moreover, identifying which auxiliary features are most beneficial could guide future model design, enabling the development of more sophisticated and nuanced approaches to handling emojis in text analysis tasks.
Table 5.
EiAP-BC Model (Auxiliary)Accuracy (%)Macro Average F1-score (%)Best Threshold
Emoji-text Scenario
EiAP 3 Embedding Concat*88.07 (≈88)83.75 (≈84)0.76
EiAP 3 Embedding Avg87.95 (≈88)83.26 (≈83)0.77
Emoji-symbol Scenario
EiAP 4 Embedding Concat*85.55 (≈86)80.15 (≈80)0.71
EiAP 4 Embedding Avg85.34 (≈85)79.83 (≈80)0.72
Table 5. Accuracy and Macro Average F1-score of Auxiliary Features of EiAP-BC Model on SPAMID-PAIR (Weighted)
Despite slightly lower performance from Table 3, Table 4, and Table 5, the EiAP-BC model can still compete with the BERT model, with only a 1% difference in text and symbol emoji scenarios. Although there is a 1% difference, the EiAP-BC model has several advantages:
(1)
File size: The EiAP-BC model has a smaller file size than the fine-tuned BERT model, making it more convenient for production stages. In this research, the size of the EiAP-BC model is 179 MB, whereas the BERT size is 396 MB.
(2)
Training parameters: The EiAP-BC model has fewer trainable parameters than the fine-tuned BERT model. The EiAP-BC model has fewer training parameters than the fine-tuned BERT model. EiAP-BC has 6,822,785 parameters, whereas BERT has 109,486,082 parameters. However, the vocabulary stored by the EiAP-BC model is larger than that of the pre-trained BERT (50,922 vs. 30,525).
(3)
Pre-trained embeddings: The EiAP-BC model can utilize pre-trained embedding weights from various state-of-the-art word embedding models through concatenation or averaging.
(4)
Auxiliary feature capability: The EiAP-BC model allows for including auxiliary features.
(5)
Training time: The EiAP-BC model has a faster training time than the fine-tuned BERT model. In this research, the EiAP-BC model was trained in 1 hour and 55 minutes, while BERT required 3 hours and 47 minutes, using the same TPU machine on Kaggle.
However, the EiAP-BC model still has some limitations that need further research and development, especially regarding accuracy and dependence on embedding weights. While BERT excels at understanding more complex contexts due to its transformer architecture, EiAP-BC demonstrates advantages in scenarios involving short texts or emoji usage, where it is more efficient and better at capturing specific semantic nuances. Empirical testing has even been applied to cases involving ChatGPT, which uses a transformer architecture.
Testing was also carried out using the examples of post and comment pairs that were deliberately custom-made by humans from social media and modified to test the capabilities of this EiAP-BC model. The custom five test data illustrate examples of data outside of the training and testing datasets used so far. This data is also tricky, with similar commentary sentences with different meanings. The label spam/not spam also has medium and hard difficulty levels because of the ambiguity. This test data is helpful to show the model's ability to determine the context of the relationship between posting data and comments. Comparisons were made using four machine learning models, six EiAP-BC models, and the ChatGPT model with zero-shot (without training). Here is a list of five expert post–comment pairs used for testing:
Test Data A:
Posting: “Olaharaga yoga dulu.” (En: “Exercise yoga first”)
Comment 1 (A1): “Hot mama!” Label: SPAM (En: “Hot mama!”)
Comment 2 (A2): “Cuaca hot, mama masih olaharaga. luar biasa!” Label: NOT SPAM (En: “The weather is hot, mom still does sports. it's amazing!”)
Test Data B:
Posting: “Iklan apa yang paling sering muncul di Instagram?” (En: “What ads appear most frequently on Instagram?”)
Comment 1 (B1): “Jual pemutih dan pelangsing, siapa minat segera kontak saya.” Label: SPAM (En: “Selling whitening and slimming, anyone interested contact me immediately”)
Comment 2 (B2): “Di tempat saya sering muncul jual pemutih dan pelangsing.” Label: NOT SPAM (En: “In my place, there are often selling bleach and slimming products”)
Test Data C:
Posting: “Liburan dengan keluarga sungguh membahagiakan.” (En: “Holidays with family are really happy”)
Comment 1 (C1): “kakak cantik sekali, selamat berlibur.” Label: NOT SPAM (En: “sister is very beautiful, happy holidays”)
Comment 2 (C2): “cantik, sexy, putih banget.” Label: SPAM (En: “beautiful, sexy, very white”)
Test Data D:
Posting: “Maju terus film indonesia! jayalah selalu...” (En: “Keep moving forward Indonesian films! always triumph.”)
Comment 1 (D1): “Bingung, pengen tau caranya menghilangkan perut buncit tanpa kesulitan? aku punya solusinya, mudah dan cepat...DM aku ya.” Label: SPAM (En: “Confused, want to know how to get rid of a distended stomach without difficulty? I have a solution, easy and fast... DM me okay”)
Comment 2 (D2): “Adegan lucu dari kak Reza waktu perut buncitnya kesulitan pakai baju yang mudah dan cepat...” Label: NOT SPAM (En: “Funny scene from Sis Reza when her big belly has trouble wearing clothes that are easy and fast...”)
Test Data E:
Posting: “Siap-siap politik indonesia tahun 2024 seperti apa ya.” (En: “Get ready for Indonesian politics in 2024, what will it be like?”)
Comment 1 (E1): “Sedia followers 1000 followers 20rb, 2000 followers 35rb, 3000 followers 50rb. Sedia followers Tiktok dan shopee juga... untuk youtube juga bisa lho!” Label: SPAM (En: “Ready for followers 1000 followers 20 thousand, 2000 followers 35 thousand, 3000 followers 50 thousand. Tiktok and Shopee followers are also available... for YouTube you can too!”)
Comment 2 (E2): “Siap ya mas, pilih siapa njenengan?” Label: NOT SPAM (En: “Ready, bro, who will you choose, sir?”)
The test results can be seen in Table 6. The results show that the soft ensemble machine learning model for text emoji and symbol emoji, the EiAP-BC text and symbol models, both standard and using auxiliary features, all achieve an accuracy rate of 80%. In comparison, ensemble machine learning uses auxiliary features, which drops to 60% accuracy. Finally, using ChatGPT v3.5, only 70% accuracy was obtained. It can be said that the EiAP-BC model has had a good performance against post–comment pair test data other than the dataset it studied and can already detect spam comments based on the context of the post.
Table 6.
Test DataEMTEMSEMT addEMS addEiAP BCTEiAP BCSEiAP BCT addEiAP BCS addChat GPT
A-1 (S)TrueTrueTrueTrueTrueFalseTrueTrueFalse
A-2 (NS)TrueTrueFalseFalseTrueTrueTrueTrueTrue
B-1 (S)TrueTrueTrueTrueTrueTrueTrueTrueTrue
B-2 (NS)FalseFalseFalseFalseFalseTrueFalseTrueFalse
C-1 (NS)TrueTrueTrueFalseTrueTrueTrueTrueTrue
C-2 (S)FalseFalseFalseTrueFalseTrueFalseFalseFalse
D-1 (S)TrueTrueTrueTrueTrueTrueTrueTrueTrue
D-2 (NS)TrueTrueFalseFalseTrueFalseTrueFalseTrue
E-1 (S)TrueTrueFalseTrueTrueTrueTrueTrueTrue
E-2 (NS)TrueTrueTrueFalseTrueTrueTrueTrueTrue
Accuracy80%80%50%50%80%80%80%80%70%
Table 6. Accuracy Results of Custom Test Data
Abbreviations: EMT, Ensemble Machine Learning Soft Voting Emoji Text; EMS, Ensemble Machine Learning Soft Voting Emoji Symbol; EMT add, EMT and auxiliary; EMS add, EMS and auxiliary; EiAP BCT, EiAP BC Emoji Text model; EiAP BCS, model EiAP BC Emoji Symbol; EiAP BCT add, EiAP BCT and auxiliary; EiAP BCS add, EiAP BCS and auxiliary; Chat GPT, test data entered into the ChatGPT prompt version 3.5 (free version); A-1 to E-2, names of the commentary test data; NS, not spam; S, spam.
The EiAP model was also tested for generalization and reliability using several other datasets with a similar case study scope, which involves using text pairs for classification purposes. The datasets and case studies used for comparison are Entailment Wrete, SNLI (English), SciTail, and IndoNLI. The EiAP-BC model excels in the Indonesian dataset, specifically the Entailment-Wrete IndoNLU, achieving accuracy comparable to the RE2 model on the SciTail dataset. Table 6 shows that EiAB-BC has a 3% difference in accuracy compared to the ESIM+Elmo model from Stanford on the SNLI English dataset, and it performs the best on the IndoNLI dataset. This result indicates that the EiAP-BC model has good generalization capabilities as it can capture the context of both text pairs in the case study.
The EiAP-BC model was intensively trained using deep learning techniques on the SpamID-Pair dataset, which consists of post–comment pairs containing emojis. Theoretically, the SpamID-Pair dataset inherently carries potential biases due to the limitations in labeling, which prior annotators performed. Despite efforts to achieve consistent labeling through a shared manual understanding among annotators, there remains a possibility of errors or subjective label discrepancies if conducted by different annotators. This labeling process presents a limitation for the EiAP-BC model. Nevertheless, the EiAP-BC model has undergone generalization testing on similar problems (in this case, sentence pair classification), as discussed in Table 7. The results in Table 7 demonstrate that the EiAP-BC model remains applicable to similar datasets, even outperforming previous models on some datasets. Although EiAP-BC is a pre-trained model using the SpamID-Pair dataset derived from Indonesian social media data, its performance in the more formal Indonesian language (IndoNLI) remains robust. The EiAP-BC model can be retrained on other datasets, including English ones.
Table 7.
Model NameDataset/Study CaseAccuracy (%)Parameter (millions)Best Threshold
IndoNLU- IndoBERT-base-p2 (Benchmark)5Entailment Wrete (IndoNLU)78.68124.5-
IndoNLU- fastText CC-ID (6L) (Benchmark)Entailment Wrete (IndoNLU)61.1315.1-
IndoNLU- IndoBERT-large-p2 (Benchmark)Entailment Wrete (IndoNLU)80.30335.2-
EiAP-BC Embedding TokenEntailment Wrete (IndoNLU)705.070.11
EiAP-BC 3 Embedding ConcatenationEntailment Wrete (IndoNLU)81*6.90.52
EiAP-BC 3 Embedding AverageEntailment Wrete (IndoNLU)733.20.12
ESIM+Elmo (Stanford)6SNLI English898-
EiAP-BC Embedding TokenSNLI English74250.5
EiAP-BC FasttextSNLI English86*3.20.5
EiAP-BC 3 Embedding ConcatenationSNLI English836.80.5
EiAP-BC 3 Embedding AverageSNLI English843.20.5
RE2 (Benchmark) [23]SciTAIL86--
Hierarchical BiLSTM Max Pooling (Benchmark) [66]SciTAIL86--
EiAP-BC Embedding TokenSciTAIL7118.60.5
EiAP-BC Emebdding FasttextSciTAIL85.5*3.20.5
EiAP-BC 3 Embedding ConcatenationSciTAIL84.26.80.5
EiAP-BC 3 Embedding AverageSciTAIL84.53.70.5
Finetuned RoBERTa (SotA) [34]IndoNLI60.7--
EiAP-BC Embedding TokenIndoNLI46.6212.2-
EiAP-BC Embedding FasttextIndoNLI653.2-
EiAP-BC 3 Embedding ConcatenationIndoNLI66*6.8-
EiAP-BC 3 Embedding AverageIndoNLI633.2-
Table 7. Performance of EiAP-BC Model in Other Datasets (Study Cases)
In addition to testing on other case studies, this research also examined the capabilities and usefulness of each main layer in the EiAP-BC model and its impact on overall performance. Ablation studies were conducted by removing/turning off specific layers in the EiAP-BC model using the SPAMID-Pair dataset for emoji-text and emoji-symbol scenarios. The detailed results can be seen in Table 8. The results show that in both emoji-text and emoji-symbol scenarios, the layers that significantly affect performance are the embedding layer and the CNN encoding layer at the end. Removing these two layers decreased accuracy, ranging from –0.6% to –2.84%. The accuracy decrease in the emoji-symbol scenario ranged from –0.01% to –3.15%. In the emoji-symbol scenario, removing the concatenation layer between the attention results and its features through subtraction and multiplication caused the most significant performance decrease. However, in the emoji-text scenario, the three layers that caused a performance decrease were removing the first and second encoding layers.
Table 8.
Removed LayersInformation (Using Emoji-Text Scenario) 3 Embedding AverageAccuracy (%)Precision Macro Average (%)F1-score Macro Average (%)Best Threshold
Normal (Base)EiAP-BC 3 Embedding Concat87.9985.9883.580.59
Embedding*Without using the weighted pre-trained embedding–2.84–4.66–3.520.78
Original EmbeddingWithout original embedding for concatenation before the second encoding layer–0.2–0.74–0.110.69
Encoding***Without using the first and second encoding layer–0.46–1.78–0.040.66
AttentionWithout inter-attention layer–0.24–1.35–0.160.67
CNN**Without CNN layer–0,6–0.21–1.240.77
Enc1 + AttWithout first encoding and attention layer–0.3–0.8–0.290.69
Enc2 + AttWithout a second encoding and attention layer–0.36–1.17–0.210.77
Att + CNNWithout attention and CNN layer–0.1–12.70.420.7
Concat Att FeaturesWithout concatenation of the attention layer with original embedding, subtraction, and multiplication–0.13–0.5–0.310.68
Removed LayersInformation (Using Emoji-Text Scenario) 3 Embedding ConcatAccuracy (%)Precision Macro Average (%)F1-score Macro Average (%)Best Threshold
Original Embedding and ConcatWithout original embedding for concatenation before the second encoding layer (concat embedding)–0.20.04–0.50.76
Encoding***Without using the first and second encoding layers (concat embedding)–0.53–1.47–0.390.66
AttentionWithout an inter-attention layer (concat embedding)–0.20.04–0.510.74
CNN**Without CNN layer (concat embedding)–0.65–1.31–0.770.67
Enc1 + AttWithout first encoding and attention layer (concat embedding)–0.13–0.64–0.030.62
Enc2 + Att****Without a second encoding and attention layer (concat embedding)–0.57–1.96–0.150.64
Att + CNNWithout attention and CNN layer (concat embedding)–0.07–0.26–0.080.79
Concat Att FeaturesWithout concatenation of the attention layer with original embedding, subtraction, and multiplication (concat embedding)–0.16–0.37–0.230.76
Removed LayersInformation (Using Emoji-Symbol Scenario) 4 Embedding AverageAccuracy (%)Precision Macro Average (%)F1-score Macro Average (%)Best Threshold
Normal (Base)EiAP 4 Embedding Average85.9483.7580.200.77
Embedding*Without using the weighted pre-trained embedding–3.15–5.36–2.760.89
Original EmbeddingWithout original embedding for concatenation before the second encoding layer–0.28–0.87–0.120.65
EncodingWithout using the first and second encoding layer–0.16–0.49–0.080.74
AttentionWithout inter-attention layer–0.15–0.49–0.060.81
CNN***Without CNN layer–0.37–0.94–0.280.70
Enc1 + AttWithout first encoding and attention layer–0.03–0.5–0.190.66
Enc2 + AttWithout a second encoding and attention layer–0.06–1.421.40.66
Att + CNNWithout attention and CNN layer0.0–0.61–0.790.72
Concat Att Features**Without concatenation of the attention layer with original embedding, subtraction, and multiplication–0.51–1.58–0,160.79
Removed LayersInformation (using Emoji-Symbol Scenario) 4 Embedding ConcatAccuracy (%)Precision Macro Average (%)F1-score Macro Average (%)Best Threshold
Original Embedding and ConcatWithout original embedding for concatenation before the second encoding layer (concat embedding)–0.19–0.21–0.320.77
Encoding***Without using the first and second encoding layers (concat embedding)–0.57–1.63–0.270.59
AttentionWithout an inter-attention layer (concat embedding)0.11–0.920.870.7
CNN**Without CNN layer (concat embedding)–2.6–0.93–0.840.59
Enc1 + AttWithout first encoding and attention layer (concat embedding)–0.26–1.160.140.76
Enc2 + AttWithout a second encoding and attention layer (concat embedding)–0.25–0.64–0.190.78
Att + CNNWithout attention and CNN layer (concat embedding)–0.190.68–0.790.67
Concat Att Features****Without concatenation of the attention layer with original embedding, subtraction, and multiplication (concat embedding)–0.27–1.160.10.63
Table 8. The Results of the Ablation Study (Layer Removal) on the EIAP-BC Architecture and the SPAMID-PAIR Dataset for Both Emoji Text and Symbol
Based on this ablation study, the embedding layer plays a crucial role. In contrast, the inter-attention layer, hypothesized to be essential in capturing the most critical information between vector pairs, is not as significant in the SPAMID-PAIR dataset case study. This result is because, in spam comment detection, based on posting context, there are often scenarios where the comment and posting vectors do not have any overlapping parts, rendering the inter-attention layer less influential. On the other hand, the CNN layer strongly influences extracting the final features for the detection capability of the EiAP-BC model. In the emoji-symbol scenario, the encoding layer does not have a significant impact, likely due to the inherent difficulty in tokenizing and forming vector contexts for symbol characters.
Regarding the scalability of the EiAP-BC model, the authors acknowledge that scalability is a limitation of this study. Future research should focus on exploring the scalability of the EiAP-BC model in terms of computational time (speed), model size, and compatibility with deep learning models for mobile versions, among other aspects. Due to the time constraints of the current research, scalability testing was not feasible. A potential area for practical application could involve developing a web service prototype to act as an intermediary between the model and clients and a web extension on the client side to demonstrate real-world implementation within the browser environment.
Using word embeddings built with BERT embeddings can indeed capture richer and more dynamic word contexts. However, several considerations must be taken into account when using BERT embeddings. First, BERT embeddings have very high dimensionality, which can lead to significantly increased memory and computational time requirements, especially when used in complex models like EiAP-BC. Based on experiments using the Kaggle environment, there were instances where the Kaggle machine automatically restarted due to memory exhaustion. Second, because BERT is a context-based model, the embeddings it produces can be highly sensitive to minor changes in the input text, resulting in output variations, particularly with data containing short texts or emojis. Third, BERT embeddings are still less optimal when applied to particular domains or languages that are underrepresented in the training data, requiring additional adjustments or fine-tuning. Addressing these challenges requires high computational power, potentially unstable results, and domain-specific adjustments, which involve expensive training processes.
Lastly, compared with the machine learning models from our previous research [22], the deep learning method using the EiAP-BC model achieved a higher performance of 88%. In the ML model, the scenarios employed included using comment-only input and post–comment pairs, with the best-performing ML method being the SVM-RBF method, followed by Random Forest, and finally, Extra Tree in the scenario of adding auxiliary features to emoji text with a balanced dataset. The proposed ensemble soft voting method did not achieve the best performance but had the highest average performance compared to the individual classifiers in the ML approach. The best result obtained from the ML methods was only 86%.

5 Conclusion

The proposed article introduces a context-based comment spam detection model for social media platforms. The proposed EiAP-BC model performs superior to existing models and competes effectively with state-of-the-art fine-tuned BERT architectures. EiAP-BC exhibits several advantages over alternative models, making it suitable for similar case studies and enabling the incorporation of auxiliary features. Furthermore, EiAP-BC is reliable when applied to various sentence-pair classification datasets. Ablation studies reveal the significance of each component within the model, particularly the embedding layer and the final CNN encoding layer, which greatly influence its performance. Despite its inherent limitations, the strengths of EiAP-BC render these shortcomings negligible, as the model is ready for prototyping implementation. In future research, it will focus on developing a web service and browser extension to facilitate individual usage by general users. We will also improve the model using hyperparameter tuning and optimization.

Acknowledgments

The author expresses heartfelt gratitude for the generous financial support, assistance, and facilities provided by Universitas Gadjah Mada and Universitas Kristen Duta Wacana for their additional facilities and support, which enabled the smooth progress and completion of this research.

Footnotes

References

[1]
S. Rao, A. K. Verma, and T. Bhatia. 2021. A review on social spam detection: Challenges, open issues, and future directions. Expert Syst. Appl. 186 (2021), 115742. DOI:
[2]
I. Inuwa-Dutse, M. Liptrott, and I. Korkontzelos. 2018. Detection of spam-posting accounts on Twitter. Neurocomputing 315 (2018), 496–511. DOI:
[3]
Isra AbdulNabi and Qussai Yaseen. 2021. Spam email detection using deep learning techniques. Procedia Computer Science 184, 2019 (2021), 853--858. DOI:
[4]
S. Khawandi, F. Abdallah, and A. Ismail. 2019. A survey on image spam detection techniques. In 3rd International Conference on Computer Science and Information Technology (COMIT’19). 13–27. DOI:
[5]
N. Alias, C. F. M. Foozy, and S. N. Ramli. 2019. Video spam comment features selection using machine learning techniques. Indones. J. Electr. Eng. Comput. Sci 15, 2 (2019), 1046–1053. DOI:
[6]
J. Kim, D. Seo, H. Kim, and P. Kang. 2017. Facebook spam post filtering based on Instagram-based transfer learning and meta information of posts. J. Korean Inst. Ind. Eng 43, 3 (2017), 192–202. DOI:
[7]
P. Shil, U. S. Rahman, M. Rahman, and M. S. Islam. 2021. An approach for detecting Bangla spam comments on Facebook. In Proceedings of the International Conference Electronics, Communications, and Information Technology (ICECIT’21). 14–16. DOI:
[8]
T. Wu, S. Wen, Y. Xiang, and W. Zhou. 2018. Twitter spam detection: Survey of new approaches and comparative study. Comput. Secur 76 (2018), 265–284. DOI:
[9]
M. Thomas and B. B. Meshram. 2023. ChSO-DNFNet: Spam detection in Twitter using feature fusion and optimized deep neuro fuzzy network. Adv. Eng. Softw 175 (2022), 103333. DOI:
[10]
N. M. Samsudin, C. F. B. Mohd Foozy, N. Alias, P. Shamala, N. F. Othman, and W. I. S. Wan Din. 2019. YouTube spam detection framework using naïve Bayes and logistic regression. Indones. J. Electr. Eng. Comput. Sci. 14, 3 (2019), 1508–1517. DOI:
[11]
H. Oh. 2021. A YouTube spam comments detection scheme using cascaded ensemble machine learning model. IEEE Access 9 (2021), 144121–144128. DOI:
[12]
W. Zhang and H.-M. Sun. 2017. Instagram spam detection. In 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC’17). IEEE, 227–228. DOI:
[13]
F. Prabowo and A. Purwarianti. 2018. Instagram online shop's comment classification using statistical approach. In Proceedings of the 2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE’17). IEEE, 282–287. DOI:
[14]
N. A. Haqimi, N. Rokhman, and S. Priyanta. 2019. Detection of spam comments on Instagram using complementary naïve Bayes. IJCCS Indonesian J. Comput. Cybern. Syst 13, 3 (2019), 263. DOI:
[15]
A. Chrismanto, Y. Lukito, and A. Susilo. 2020. Implementasi distance weighted k-nearest neighbor Untuk Klasifikasi spam and non-spam pada komentar instagram. J. Edukasi Dan Penelit. Inform 6, 2 (2020), 236. DOI:
[16]
A. R. Chrismanto, A. Afiahayati, Y. Sari, A. K. Sari, and Y. Suyanto. 2022. Spam comments detection on Instagram using machine learning and deep learning methods. Lontar Komput. J. Ilm. Teknol. Inf 13, 1 (2022), 46. DOI:
[17]
M. Li, B. Wu, and Y. Wang. 2019. Comment spam detection via effective features combination. In 2019 IEEE International Conference on Communications (ICC’19). IEEE, 1–6. DOI:
[18]
A. Sinhal and M. Maheshwari. 2022. YouTube: Spam comments filtration using hybrid ensemble machine learning models. Int. J. Emerg. Technol. Adv. Eng 12, 10 (2022), 169–182. DOI:
[19]
N. Banik and M. H. H. Rahman. 2019. Toxicity detection on Bengali social media comments using supervised models. In 2nd International Conference on Innovation in Engineering and Technology (ICIET’19). 23–24. DOI:
[20]
Y. Tashtoush, A. Magableh, O. Darwish, L. Smadi, O. Alomari, and A. ALghazoo. 2022. Detecting Arabic YouTube spam using data mining techniques. In 2022 10th International Symposium on Digital Forensics and Security (ISDFS’22). IEEE, 1–5. DOI:
[21]
S. Sharmin and Z. Zaman. 2017. Spam detection in social media employing machine learning tool for text mining. In 2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS’17). IEEE, 137–142. DOI:
[22]
A. R. Chrismanto, A. K. Sari, and Y. Suyanto. 2023. Enhancing spam comment detection on social media with emoji feature and post-comment pairs approach using ensemble methods of machine learning. IEEE Access 11 (2023), 80246–80265. DOI:
[23]
R. Yang, J. Zhang, X. Gao, F. Ji, and H. Chen. 2019. Simple and effective text matching with richer alignment features. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4699–4709. DOI:
[24]
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigaa, and Janyce Wiebe. 2016. SemEval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval’16). 497–511. DOI:
[25]
M. Mansoor, Z. ur Rehman, M. Shaheen, M. A. Khan, and M. Habib. 2020. Deep learning based semantic similarity detection using text data. Inf. Technol. Control 49, 4 (2020), 495–510. DOI:
[26]
J. Pei, Y. Wu, Z. Qin, Y. Cong, and J. Guan. 2021. Attention-based model for predicting question relatedness on stack overflow. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR’21). IEEE, 97–107. DOI:
[27]
A. Godbole, A. Dalmia, and S. K. Sahu. 2018. Siamese neural networks with random forest for detecting duplicate question pairs. arXiv (2018), 1–5 [Online]. http://arxiv.org/abs/1801.07288
[28]
L. Wang, L. Zhang, and J. Jiang. 2020. Duplicate question detection with deep learning in stack overflow. IEEE Access 8 (2020), 25964–25975. DOI:
[29]
K. N. Setya and R. Mahendra. 2023. Semi-supervised textual entailment on indonesian wikipedia data. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), 13396 LNCS (2023), 416–427. DOI:
[30]
E. Fonseca and J. P. R. Alvarenga. 2020. Wide and deep transformers applied to semantic relatedness and textual entailment. CEUR Workshop Proc 2583, 1 (2020), 68–76.
[31]
B. Wilie et al. 2020. IndoNLU: Benchmark and resources for evaluating indonesian natural language understanding. arXiv (Sep. 2020) [Online]. https://www.aclweb.org/anthology/2020.aacl-main.85
[32]
S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). Association for Computational Linguistics.
[33]
T. Khot, A. Sabharwal, and P. Clark. 2018. SciTaiL: A textual entailment dataset from science question answering. Proc. AAAI Conf. Artif. Intell 32, 1 (2018), 5189–5197. DOI:
[34]
R. Mahendra, A. F. Aji, S. Louvan, F. Rahman, and C. Vania. 2021. IndoNLI: A natural language inference dataset for indonesian. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 10511–10527. DOI:
[35]
F. Ahmed and M. Abulaish. 2013. A generic statistical approach for spam detection in online social networks. Comput. Commun 36, 10–11 (2013), 1120–1129. DOI:
[36]
C. Chen, Y. Wang, J. Zhang, Y. Xiang, W. Zhou, and G. Min. 2017. Statistical features-based real-time detection of drifted twitter spam. IEEE Trans. Inf. Forensics Secur 12, 4 (2017), 914–925. DOI:
[37]
G. Lin, N. Sun, S. Nepal, J. Zhang, Y. Xiang, and H. Hassan. 2017. Statistical Twitter spam detection demystified: performance, stability and scalability. IEEE Access 5 (2017), 11142–11154. DOI:
[38]
T. Xia. 2020. A constant time complexity spam detection algorithm for boosting throughput on rule-based filtering systems. IEEE Access 8 (2020), 82653–82661. DOI:
[39]
Y. Kontsewaya, E. Antonov, and A. Artamonov. 2021. Evaluating the effectiveness of machine learning methods for spam detection. Procedia Comput. Sci. 190 (2021), 479–486. DOI:
[40]
N. Sun, G. Lin, J. Qiu, and P. Rimba. 2020. Near real-time Twitter spam detection with machine learning techniques. International Journal of Computers and Applications 44, 4 (2020), 338–348. DOI:
[41]
N. Govil, K. Agarwal, A. Bansal, and A. Varshney. 2020. A machine learning based spam detection mechanism. In Proceedings of the 4th International Conference on Computing Methodologies and Communication (ICCMC’20). 954–957. DOI:
[42]
G. Jain, M. Sharma, and B. Agarwal. 2019. Spam detection in social media using convolutional and long short term memory neural network. Ann. Math. Artif. Intell 85, 1 (2019), 21–44. DOI:
[43]
Z. Alom, B. Carminati, and E. Ferrari. 2020. A deep learning model for Twitter spam detection. Online Soc. Networks Media 18, (2020) 100079. DOI:
[44]
A. Barushka and P. Hajek. 2020. Spam detection on social networks using cost-sensitive feature selection and ensemble-based regularized deep neural networks. Neural Comput. Appl 32, 9 (2020), 4239–4257. DOI:
[45]
A. Abdiansah, A. Azhari, and A. K. Sari. 2018. INARTE: An Indonesian dataset for recognition textual entailment. In Proc. 2018 4th Int. Conf. Sci. Technol. ICST 2018 1 (2018), 1–5. DOI:
[46]
S. Cahyawijaya et al. 2021. IndoNLG: Benchmark and resources for evaluating indonesian natural language generation. In 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP’21), 8875–8898. DOI:
[47]
A. R. Chrismanto, A. K. Sari, and Y. Suyanto. 2022. Critical evaluation on spam content detection in social media. J. Theor. Appl. Inf. Technol 100, 8 (2022), 2642–2667 [Online]. http://www.jatit.org/volumes/Vol100No8/29Vol100No8.pdf
[48]
A. R. Chrismanto, A. K. Sari, and Y. Suyanto. 2022. SPAMID-PAIR: A novel Indonesian post–comment pairs dataset containing emoji. Int. J. Adv. Comput. Sci. Appl. 13, 11 (2022), 92–100. DOI:
[49]
L. Zhang and D. Moldovan. 2018. Rule-based vs. neural net approaches to semantic textual similarity. In Proceedings of the First Workshop on Linguistic Resoures for Natural Language Processing, 12–17 [Online]. https://www.aclweb.org/anthology/W18-3803
[50]
P. Mishra, S. Kaushik, and K. Dey. 2021. Bi-ISCA: Bidirectional inter-sentence contextual attention mechanism for detecting sarcasm in user generated noisy short text. CEUR Workshop Proc. 2995 (2021), 1–8.
[51]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, 6000--6010. Retrieved from
[52]
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference on the North American Chapter of the Association of Computational Linguistics: Human Language Technologies, 4171–4186.
[53]
T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space [Online]. http://ronan.collobert.com/senna/
[54]
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5 (2017), 135–146. DOI:
[55]
J. Pennington, R. Socher, and C. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 1532–1543. DOI:
[56]
M. E. Peters, M. Neumann, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the Conference on the North American Chapter of the Association of Computational Linguistics: Human Language Technologies (NAACL-HLT’18). Association for Computational Linguistics, 2227–2237 [Online]. http://allennlp.org/elmo
[57]
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. In Transactions of the Association for Computational Linguistics, 135–146. DOI:
[58]
J. Pennington, R. Socher, and C. D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP'14), 1532–1543. DOI:
[59]
B. Eisner, I. Augenstein, T. Rockt, and S. Riedel. 2016. emoji2vec : Learning emoji representations from their description. In Proceedings of the 4th International Workshop on Natural Language Processing for Social Media. Association for Computational Linguistics.
[60]
T. Khot, A. Sabharwal, and P. Clark. 2018. SciTaiL: A textual entailment dataset from science question answering. Proc. AAAI Conf. Artif. Intell 32, 1 (2018), 5189–5197. DOI:
[61]
M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process 45, 11 (1997), 2673–2681. DOI:
[62]
S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780. DOI:
[63]
D. Bahdanau, K. H. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), 1–15.
[64]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res 15 (2014), 1929–1958. DOI:
[65]
D. P. Kingma and J. Lei Ba. 2015. ADAM: A method for stochastic optimization. In International Conference on Learning Representations (ICLR'15). Ithaca, NY: ArXiv, San Diego. Retrieved April 23, 2023 from https://arxiv.org/abs/1412.6980
[66]
A. Talman, A. Yli-Jyrä, and J. Tiedemann. 2019. Sentence embeddings in NLI with iterative refinement encoders. Nat. Lang. Eng. 25, 4 (2019), 467–482. DOI:

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 12
December 2024
237 pages
EISSN:2375-4702
DOI:10.1145/3613720
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 November 2024
Online AM: 25 September 2024
Accepted: 14 September 2024
Revised: 14 September 2024
Received: 29 September 2023
Published in TALLIP Volume 23, Issue 12

Check for updates

Author Tags

  1. Context-based spam comment detection
  2. Sentence-pair classification
  3. Deep learning
  4. EIAP-BC model
  5. Emoji aware

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 396
    Total Downloads
  • Downloads (Last 12 months)396
  • Downloads (Last 6 weeks)99
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media