Elsevier

Expert Systems with Applications

Volume 89, 15 December 2017, Pages 99-111
Expert Systems with Applications

Early detection of deception and aggressiveness using profile-based representations

https://doi.org/10.1016/j.eswa.2017.07.040Get rights and content

Highlights

  • Profile based representations are used for early recognition of deception.

  • This is the first application of these representations for early recognition.

  • Sexual predator detection and aggressive text detection tasks are approached.

  • Profile based representations outperform state of the art.

Abstract

E-communication represents a major threat to users who are exposed to a number of risks and potential attacks. Detecting these risks with as much anticipation as possible is crucial for prevention. However, much research so far has focused on forensic tools that can be applied only when an attack has been performed. This paper proposes a novel and effective methodology for the early detection of threats in written social media. The goal is to recognize a potential attack before it is consummated, and using a minimum amount of information. The proposed approach considers the use of profile-based representations (PBRs) for this goal. PBRs have multiple benefits, including non-sparsity, low dimensionality, and a proved discriminative power. Moreover, representations for partial documents can be derived naturally with PBRs, which makes them suitable for the addressed problem. Results include empirical evidence on the usefulness of PBRs in the early recognition setting for two tasks in which anticipation is critical: sexual predator detection and aggressive text identification. These results reveal, on the one hand, that PBRs achieve state of the art performance when using full-length documents (i.e., the classical task), and, on the other hand, that the proposed methodology outperforms previous work on early recognition of sexual predators by a considerable margin, while obtaining state of the art performance in aggressive text identification. To the best of our knowledge, these are the best results reported on early recognition for the approached problems. We foresee this work will pave the way for the development of novel methodologies for the problem and will motivate further research from the intelligent systems and text mining communities.

Introduction

Social media is perhaps the most used communication channel nowadays: anyone can express their opinion about any topic in any context (Kuz, Falco, & Giandini, 2016). In spite of this easiness of communication, this kind of media and – in general – e-communication media comprise a major threat to users, who are exposed to a number of risks and potential attacks. Consider, for example, the problem of detecting sexual predators approaching minors or the identification of aggressive users. These threats pose a challenge to the research community, that has to develop protective and preventive tools for avoiding potential risks.

A considerable amount of research has been devoted to detect these threats. However, current solutions work in a forensics scenario, i.e., they are applied once the attack has been accomplished. Although these solutions can be useful in certain contexts, preventive mechanisms would have a greater and immediate impact into user security.

Taking into account the latter scenario, this paper proposes a novel and effective methodology to detect potential attacks as early as possible (while communication is being performed). A difficulty that arises with early recognition tasks concerns information scarcity, since only partial information is available to detect the attack before it is consummated. To face this problem, the proposed approach considers the use of profile and subprofile-based representations. Under these representations, each term (e.g., word) is associated to a vector that accounts for its semantics, where a document can be represented by aggregating the vectors of the terms it contains. As a result, documents and terms can lie in the same semantic space. Even when only a few terms are available, these representations can still be obtained – a convenient property that makes them suitable for early text classification. These representations, in addition, have the advantage of being non-sparse, low dimensional, and highly discriminative. This paper shows the benefits of using these representations to recognize the category of a document before it is available entirely. Specifically, the problems of sexual predator and aggressive text early recognition are approached. An extensive experimental evaluation reveals that the proposed methodology is able to obtain state of the art performance in the aforementioned tasks, while requiring a minimum amount of information from documents to make a decision. We foresee this work will pave the way for the development of novel methodologies for the problem, and will motivate further research from the intelligent systems and text mining communities.

The contributions of this paper are as follows:

  • The use and performance evaluation of profile and subprofile-based representations for the problems of sexual predator detection and aggressive text identification. This is the first time that the previously mentioned representations are employed for these problems using full documents. It has been shown that state of the art performance can be obtained in the considered data sets, with the additional benefits of working with low dimensional and non-sparse representations.

  • The use, adaptation, and suitability evaluation of profile and subprofile-based representations for the problem of early text classification. It is shown how they can naturally be used to represent documents containing partial information. This is the first time this feature is noticed and exploited. More importantly, results on early recognition performance outperform existing work in the sexual predator detection task by a large margin, while achieving comparable performance in the aggressive text detection problem.

  • A comprehensive and extensive literature review on the automated detection of sexual predators and aggressive text in digital documents.

The rest of this paper is organized as follows. The next section provides a review of related work on automated deception detection in social media and early text classification. Section 3 describes the profile-based representations and how they are used for early recognition. Section 4 describes the experimental settings and the evaluation protocol. Section 5 reports experimental results and their analysis. Finally, Section 6 summarizes our main findings and outlines future work directions.

Section snippets

Related work

With the continued growth and use of Internet as a tool for communication worldwide, more and more people are enjoying and becoming more dependent on the convenience of its provided services. Unfortunately, the wide use of computers and mobile devices in conjunction with Internet has also been convenient to cyber-attackers. Nowadays, there are many types of attacks that an Internet user has to face: computer viruses, flaws in the operating system (backdoors opened), phishing, fraud activities,

Profile-based representations for early recognition

This paper proposes the use of profile-based representations (PBRs) for early text classification of deception. PBRs fall within the category of second order representations (Li, Xiong, Zhang, Liu, Li, 2011, López-Monroy, y Gómez, Escalante, Villasenor-Pineda, Stamatatos, 2015), which aim to extract/learn concepts (i.e. artificial dimensions capturing word usage patterns) from simple co-occurrence statistics (see e.g., Kolesnikova, 2016). PBRs capture discriminative information in a very low

Experimental settings

The performed experiments employ the three data sets described in Table 3. As explained above, two tasks were approached: sexual predator detection and aggressive text identification. For the former task we used the only publicly available data set for sexual predator detection (Inches & Crestani, 2012). This data set was released in the context of the sexual predator identification task at PAN-CLEF’12 and comprises a large number of chat conversations that include real sexual predators. Thus,

Experiments and results

This section presents experimental results that evaluate the performance of the proposed methods in the three data sets described in Section 4. Section 5.1 reports recognition performance using full-length documents. Section 5.2 reports performance using increasing partial information. Finally, Section 5.3 highlights the main findings and their impact.

Conclusions

This paper proposed the use of profile-based representations for early recognition of deception and aggressiveness in written documents. Profile-based representations use class term-occurrence information to derive a non-sparse, low dimensional, discriminative representation for documents, where profiles can be further divided into subprofiles or subclasses. Because these representations can be estimated even when a single term is available, they are well suited to address problems where the

Acknowledgments

This work was supported by CONACYT under project grants CB-2014-241306, PDCPN2014-01-247870 and CB-2015-1-258588. The work was also supported by Red Temática CONACYT en Tecnologías del Lenguaje (projects 260178 and 271622). The authors are grateful to Juan David Carrillo for collecting and preparing the UANL data set.

References (62)

  • E. Villatoro-Tello et al.

    A two-step approach for effective detection of misbehaving users in chats

    Proceedings of working notes of CLEF and CEUR workshop

    (2012)
  • M. Álvarez-Carmona et al.

    INAOE’S participation at PAN ’15: Author profiling task

    Proceedings of CEUR workshop, working notes of CLEF 2015 – Conference and labs of the evaluation forum

    (2015)
  • M. Ashcroft et al.

    A step towards detecting online grooming-identifying adults pretending to be children

    Proceedings of the 2015 European intelligence and security informatics conference

    (2015)
  • O. Bchir et al.

    Verbal offense detection in social network comments using novel fusion approach

    AI Communications

    (2015)
  • M.M. Ben Ismail et al.

    Insult detection in social network comments using possibilistic based fusion approach

  • P.J. Black et al.

    A linguistic analysis of grooming strategies of online child sex offenders: Implications for our understanding of predatory sexual behavior in an increasingly computer-mediated world

    Child Abuse & Neglect

    (2015)
  • D. Bogdanova et al.

    Exploring high-level features for detecting cyberpedophilia

    Computer Speech & Language

    (2014)
  • A.E. Cano et al.

    Detecting child grooming behaviour patterns on social media

  • C. Cardei et al.

    Detecting sexual predators in chats using behavioral features and imbalanced learning

    Natural Language Engineering

    (2017)
  • ChenY. et al.

    Detecting offensive language in social media to protect adolescent online safety

    Proceedings of international conference on privacy, security, risk and trust and 2012 international conference on social computing

    (2012)
  • CheongY.-G. et al.

    Detecting predatory behavior in game chats

    IEEE Transactions on Computational Intelligence and AI in Games

    (2015)
  • CohenW.

    Learning rules that classify e-mail

    Proceedings AAAI spring symposium on machine learning in information access.

    (1996)
  • M. Dadvar et al.

    Improving cyberbullying detection with user context

  • B. Desmet et al.

    Suicide emotion detection in suicide notes

    Expert Systems with Applications

    (2013)
  • K. Dinakar et al.

    Common sense reasoning for detection, prevention, and mitigation of cyberbullying

    ACM Transactions on Interactive Intelligent Systems

    (2012)
  • K. Dinakar et al.

    Modeling the detection of textual cyberbullying

    The Social Mobile Web

    (2011)
  • G. Dulac-Arnold et al.

    Text classification: A sequential reading approach

  • M. Ebrahimi et al.

    Detecting predatory conversations in social media by deep convolutional neural networks

    Digital Investigation

    (2016)
  • H.J. Escalante et al.

    Early text classification: A naive solution

    Proceedings of NAACL-HLT 2016, 7th workshop on computational approaches to subjectivity, sentiment and social media analysis

    (2016)
  • H.J. Escalante et al.

    Sexual predator detection in chats with chained classifiers

    Proceedings of NAACL-HLT 2013, 4th workshop on computational approaches to subjectivity, sentiment and social media analysis

    (2013)
  • R. Feldman

    Techniques and applications for sentiment analysis

    Communications of the ACM

    (2013)
  • Cited by (34)

    • Online grooming detection: A comprehensive survey of child exploitation in chat logs

      2023, Knowledge-Based Systems
      Citation Excerpt :

      The method has two main components: it can include either the following sentence or make a final decision about the topic of the text. Escalante et al. [65] provided a model to detect sexual predator threats and aggressive acts in the early stages. Their proposed method uses profile and sub-profile representations and the document vector space representation for the investigation of threats.

    • How to take advantage of behavioral features for the early detection of grooming in online conversations

      2022, Knowledge-Based Systems
      Citation Excerpt :

      It allows one to generate multiple “views” of the text, so to capture different semantic meanings for words and documents at different levels of granularity. The idea consists into associating words that have a similar meaning with high-level features, named as meta-words, considering different term representations, such as W2V [23], PSR [1] and TVT [24]. Then, the documents (chat-level) are represented by multiple meta-words, i.e., coarse/general words that can summarize the overall context of the text.

    • Cyberbullying detection: Utilizing social media features

      2021, Expert Systems with Applications
    • Early author profiling on Twitter using profile features with multi-resolution

      2020, Expert Systems with Applications
      Citation Excerpt :

      Baselines: We compare the proposed representation with Latent Semantic Analysis3 (LSA) and BoW using the term frequency normalized by l1. Furthermore, we also evaluate avg-SOA and Naive Bayes since both of them are strategies that have been used in state-of-the-art and relevant works for early prediction (Dulac-Arnold et al., 2011; Errecalde et al., 2017; Escalante et al., 2016; Escalante et al., 2017). Similar behavior can be observed when comparing PHM-SOA and the reference approach for early prediction; Avg-SOA.

    • A text classification framework for simple and effective early depression detection over social media streams

      2019, Expert Systems with Applications
      Citation Excerpt :

      However, the most important (and interesting) cases are when the delay in that decision could also have negative or risky implications. This scenario, known as “early risk detection” have gained increasing interest in recent years with potential applications in rumor detection (Kwon, Cha, & Jung, 2017; Ma et al., 2016; Ma, Gao, Wei, Lu, & Wong, 2015), sexual predator detection and aggressive text identification (Escalante et al., 2017), depression detection (Losada & Crestani, 2016; Losada, Crestani, & Parapar, 2017) or terrorism detection (Iskandar, 2017). The key issue in real early sequence classification is that learned models usually do not provide guidance about how to decide the correct moment to stop reading a stream and classify it with reasonable accuracy.

    View all citing articles on Scopus
    View full text