Early detection of deception and aggressiveness using profile-based representations
Introduction
Social media is perhaps the most used communication channel nowadays: anyone can express their opinion about any topic in any context (Kuz, Falco, & Giandini, 2016). In spite of this easiness of communication, this kind of media and – in general – e-communication media comprise a major threat to users, who are exposed to a number of risks and potential attacks. Consider, for example, the problem of detecting sexual predators approaching minors or the identification of aggressive users. These threats pose a challenge to the research community, that has to develop protective and preventive tools for avoiding potential risks.
A considerable amount of research has been devoted to detect these threats. However, current solutions work in a forensics scenario, i.e., they are applied once the attack has been accomplished. Although these solutions can be useful in certain contexts, preventive mechanisms would have a greater and immediate impact into user security.
Taking into account the latter scenario, this paper proposes a novel and effective methodology to detect potential attacks as early as possible (while communication is being performed). A difficulty that arises with early recognition tasks concerns information scarcity, since only partial information is available to detect the attack before it is consummated. To face this problem, the proposed approach considers the use of profile and subprofile-based representations. Under these representations, each term (e.g., word) is associated to a vector that accounts for its semantics, where a document can be represented by aggregating the vectors of the terms it contains. As a result, documents and terms can lie in the same semantic space. Even when only a few terms are available, these representations can still be obtained – a convenient property that makes them suitable for early text classification. These representations, in addition, have the advantage of being non-sparse, low dimensional, and highly discriminative. This paper shows the benefits of using these representations to recognize the category of a document before it is available entirely. Specifically, the problems of sexual predator and aggressive text early recognition are approached. An extensive experimental evaluation reveals that the proposed methodology is able to obtain state of the art performance in the aforementioned tasks, while requiring a minimum amount of information from documents to make a decision. We foresee this work will pave the way for the development of novel methodologies for the problem, and will motivate further research from the intelligent systems and text mining communities.
The contributions of this paper are as follows:
- •
The use and performance evaluation of profile and subprofile-based representations for the problems of sexual predator detection and aggressive text identification. This is the first time that the previously mentioned representations are employed for these problems using full documents. It has been shown that state of the art performance can be obtained in the considered data sets, with the additional benefits of working with low dimensional and non-sparse representations.
- •
The use, adaptation, and suitability evaluation of profile and subprofile-based representations for the problem of early text classification. It is shown how they can naturally be used to represent documents containing partial information. This is the first time this feature is noticed and exploited. More importantly, results on early recognition performance outperform existing work in the sexual predator detection task by a large margin, while achieving comparable performance in the aggressive text detection problem.
- •
A comprehensive and extensive literature review on the automated detection of sexual predators and aggressive text in digital documents.
The rest of this paper is organized as follows. The next section provides a review of related work on automated deception detection in social media and early text classification. Section 3 describes the profile-based representations and how they are used for early recognition. Section 4 describes the experimental settings and the evaluation protocol. Section 5 reports experimental results and their analysis. Finally, Section 6 summarizes our main findings and outlines future work directions.
Section snippets
Related work
With the continued growth and use of Internet as a tool for communication worldwide, more and more people are enjoying and becoming more dependent on the convenience of its provided services. Unfortunately, the wide use of computers and mobile devices in conjunction with Internet has also been convenient to cyber-attackers. Nowadays, there are many types of attacks that an Internet user has to face: computer viruses, flaws in the operating system (backdoors opened), phishing, fraud activities,
Profile-based representations for early recognition
This paper proposes the use of profile-based representations (PBRs) for early text classification of deception. PBRs fall within the category of second order representations (Li, Xiong, Zhang, Liu, Li, 2011, López-Monroy, y Gómez, Escalante, Villasenor-Pineda, Stamatatos, 2015), which aim to extract/learn concepts (i.e. artificial dimensions capturing word usage patterns) from simple co-occurrence statistics (see e.g., Kolesnikova, 2016). PBRs capture discriminative information in a very low
Experimental settings
The performed experiments employ the three data sets described in Table 3. As explained above, two tasks were approached: sexual predator detection and aggressive text identification. For the former task we used the only publicly available data set for sexual predator detection (Inches & Crestani, 2012). This data set was released in the context of the sexual predator identification task at PAN-CLEF’12 and comprises a large number of chat conversations that include real sexual predators. Thus,
Experiments and results
This section presents experimental results that evaluate the performance of the proposed methods in the three data sets described in Section 4. Section 5.1 reports recognition performance using full-length documents. Section 5.2 reports performance using increasing partial information. Finally, Section 5.3 highlights the main findings and their impact.
Conclusions
This paper proposed the use of profile-based representations for early recognition of deception and aggressiveness in written documents. Profile-based representations use class term-occurrence information to derive a non-sparse, low dimensional, discriminative representation for documents, where profiles can be further divided into subprofiles or subclasses. Because these representations can be estimated even when a single term is available, they are well suited to address problems where the
Acknowledgments
This work was supported by CONACYT under project grants CB-2014-241306, PDCPN2014-01-247870 and CB-2015-1-258588. The work was also supported by Red Temática CONACYT en Tecnologías del Lenguaje (projects 260178 and 271622). The authors are grateful to Juan David Carrillo for collecting and preparing the UANL data set.
References (62)
- et al.
Aggressive text detection for cyberbullying (Human-inspired computing and its applications)
- et al.
Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making
Policy & Internet
(2015) - et al.
Recognizing predatory chat documents using semi-supervised anomaly detection
Electronic Imaging
(2016) Grooming: An operational definition and coding scheme
Sex Offender Law Report
(2007)- et al.
Follower behavior analysis via influential transmitters on social issues in twitter
Computacion y Sistemas
(2016) - et al.
Fast text categorization using concise semantic analysis
Pattern Recognition Letters
(2011) - et al.
A new document author representation for authorship attribution
- et al.
Utilizing document classification for grooming attack recognition
Proceedings of IEEE symposium on computers and communications
(2011) - Morris, C. (2013). Identifying online sexual predators by svm classification with lexical and behavioral features....
- et al.
Profanity use in online communities
Proceedings of the SIGCHI conference on human factors in computing systems
(2012)
A two-step approach for effective detection of misbehaving users in chats
Proceedings of working notes of CLEF and CEUR workshop
INAOE’S participation at PAN ’15: Author profiling task
Proceedings of CEUR workshop, working notes of CLEF 2015 – Conference and labs of the evaluation forum
A step towards detecting online grooming-identifying adults pretending to be children
Proceedings of the 2015 European intelligence and security informatics conference
Verbal offense detection in social network comments using novel fusion approach
AI Communications
Insult detection in social network comments using possibilistic based fusion approach
A linguistic analysis of grooming strategies of online child sex offenders: Implications for our understanding of predatory sexual behavior in an increasingly computer-mediated world
Child Abuse & Neglect
Exploring high-level features for detecting cyberpedophilia
Computer Speech & Language
Detecting child grooming behaviour patterns on social media
Detecting sexual predators in chats using behavioral features and imbalanced learning
Natural Language Engineering
Detecting offensive language in social media to protect adolescent online safety
Proceedings of international conference on privacy, security, risk and trust and 2012 international conference on social computing
Detecting predatory behavior in game chats
IEEE Transactions on Computational Intelligence and AI in Games
Learning rules that classify e-mail
Proceedings AAAI spring symposium on machine learning in information access.
Improving cyberbullying detection with user context
Suicide emotion detection in suicide notes
Expert Systems with Applications
Common sense reasoning for detection, prevention, and mitigation of cyberbullying
ACM Transactions on Interactive Intelligent Systems
Modeling the detection of textual cyberbullying
The Social Mobile Web
Text classification: A sequential reading approach
Detecting predatory conversations in social media by deep convolutional neural networks
Digital Investigation
Early text classification: A naive solution
Proceedings of NAACL-HLT 2016, 7th workshop on computational approaches to subjectivity, sentiment and social media analysis
Sexual predator detection in chats with chained classifiers
Proceedings of NAACL-HLT 2013, 4th workshop on computational approaches to subjectivity, sentiment and social media analysis
Techniques and applications for sentiment analysis
Communications of the ACM
Cited by (34)
Online grooming detection: A comprehensive survey of child exploitation in chat logs
2023, Knowledge-Based SystemsCitation Excerpt :The method has two main components: it can include either the following sentence or make a final decision about the topic of the text. Escalante et al. [65] provided a model to detect sexual predator threats and aggressive acts in the early stages. Their proposed method uses profile and sub-profile representations and the document vector space representation for the investigation of threats.
How to take advantage of behavioral features for the early detection of grooming in online conversations
2022, Knowledge-Based SystemsCitation Excerpt :It allows one to generate multiple “views” of the text, so to capture different semantic meanings for words and documents at different levels of granularity. The idea consists into associating words that have a similar meaning with high-level features, named as meta-words, considering different term representations, such as W2V [23], PSR [1] and TVT [24]. Then, the documents (chat-level) are represented by multiple meta-words, i.e., coarse/general words that can summarize the overall context of the text.
Cyberbullying detection: Utilizing social media features
2021, Expert Systems with Applicationsτ-SS3: A text classifier with dynamic n-grams for early risk detection over text streams
2020, Pattern Recognition LettersEarly author profiling on Twitter using profile features with multi-resolution
2020, Expert Systems with ApplicationsCitation Excerpt :Baselines: We compare the proposed representation with Latent Semantic Analysis3 (LSA) and BoW using the term frequency normalized by l1. Furthermore, we also evaluate avg-SOA and Naive Bayes since both of them are strategies that have been used in state-of-the-art and relevant works for early prediction (Dulac-Arnold et al., 2011; Errecalde et al., 2017; Escalante et al., 2016; Escalante et al., 2017). Similar behavior can be observed when comparing PHM-SOA and the reference approach for early prediction; Avg-SOA.
A text classification framework for simple and effective early depression detection over social media streams
2019, Expert Systems with ApplicationsCitation Excerpt :However, the most important (and interesting) cases are when the delay in that decision could also have negative or risky implications. This scenario, known as “early risk detection” have gained increasing interest in recent years with potential applications in rumor detection (Kwon, Cha, & Jung, 2017; Ma et al., 2016; Ma, Gao, Wei, Lu, & Wong, 2015), sexual predator detection and aggressive text identification (Escalante et al., 2017), depression detection (Losada & Crestani, 2016; Losada, Crestani, & Parapar, 2017) or terrorism detection (Iskandar, 2017). The key issue in real early sequence classification is that learned models usually do not provide guidance about how to decide the correct moment to stop reading a stream and classify it with reasonable accuracy.