Proppy: Organizing the news based on their propagandistic content

https://doi.org/10.1016/j.ipm.2019.03.005Get rights and content

Abstract

Propaganda is a mechanism to influence public opinion, which is inherently present in extremely biased and fake news. Here, we propose a model to automatically assess the level of propagandistic content in an article based on different representations, from writing style and readability level to the presence of certain keywords. We experiment thoroughly with different variations of such a model on a new publicly available corpus, and we show that character n-grams and other style features outperform existing alternatives to identify propaganda based on word n-grams. Unlike previous work, we make sure that the test data comes from news sources that were unseen on training, thus penalizing learning algorithms that model the news sources used at training time as opposed to solving the actual task. We integrate our supervised model in a public website, which organizes recent articles covering the same event on the basis of their propagandistic contents. This allows users to quickly explore different perspectives of the same story, and it also enables investigative journalists to dig further into how different media use stories and propaganda to pursue their agenda.

Introduction

The landscape of news outlets is wide: from supposedly neutral to clearly biased. When reading a news article, every reader should be aware that, at least to some extent, it inevitably reflects the bias of both the author and the news outlet where the article is published. However, it is difficult to identify exactly what the bias is. It could be that the author herself may not be conscious about her own bias. Or it could be that the article is part of the author’s agenda to persuade readers about something on a specific topic. The latter situation represents propaganda. According to the now classical work from the (Institute for Propaganda Analysis, 1938), propaganda can be defined as follows:

Definition 1

Propaganda is expression of opinion or action by individuals or groups deliberately designed to influence opinions or actions of other individuals or groups with reference to predetermined ends.

Propaganda is most effective when it can go unnoticed. That is, if a person reads a journalistic text, in a formal or an informal news outlet (e.g., in a blog or in social media) she should not be able to identify it as propagandistic. In that case, the reader is exposed to the propagandistic content without her knowledge and some of her opinions might change as a result. A striking example of the use of propaganda was allegedly put in place to influence the 2016 US Presidential elections (Muller, 2018). Given the wide landscape of news outlets —from tabloids to broadsheets, from printed to digital, from objective to biased— we believe that both news consumers and institutions might benefit from an automatic tool that can detect propagandistic articles.

Here we propose proppy, a system to organize news events according to the level of propagandistic contents in the articles covering them. Proppy is a full architecture (cf. Fig. 3) that takes a batch of news articles as input, identifies the covered events, and organizes each event according to the level of propaganda in each article. Our major contribution, and the focus of this manuscript, is a supervised model to compute what we refer to as propaganda score: the estimated likelihood of a text document to contain propagandistic mechanisms to deliberately influence the reader’s opinion.

Proppy computes a propaganda score using a maximum entropy classifier. We chose this classifier in order to facilitate direct comparison to previous work (Rashkin, Choi, Jang, Volkova, & Choi, 2017) and to focus our efforts on improving the representation of the data in terms of features. In Rashkin et al. (2017), word n-grams were used but, as the authors themselves pointed out, this yielded significant drop in performance when testing on articles from sources that were not seen on training. Here we aim to shed some light about why this could be the case. Therefore, we formulate the following hypothesis:

Hypothesis 1 H1

Representations based on writing style and readability can generalize better than currently-used approaches based on word-level representations.

We argue that this is because word-level representations tend to learn topic and source, rather than whether the target article is propagandistic or not. In order to test the above hypothesis, we first replicated a pre-existing model for propaganda detection (Rashkin et al., 2017).2 Later on, we compiled a new corpus —QProp— which, unlike most pre-existing corpora, keeps explicit information about the source of each article, thus allowing us to train on articles from some sources and to test on articles from different sources that have not been used for training. We design experiments that involve training and evaluating several supervised models using features based on text readability and style; such features have been widely used in authorship attribution tasks (Stamatatos, 2009). In our thorough experimentation, we obtain statistically significant improvements over existing approaches in terms of classification performance, especially when testing on articles from unseen sources.

Our contributions can be summarized as follows:

  • 1.

    We experiment with different families of feature representations (some of them used for the first time for this task) spanning readability, vocabulary richness, and style in an effective propaganda estimation model, and we demonstrate empirically that they are effective for actually detecting propaganda, as opposed to learning the article’s source or its topic as it is the case in most previous work.

  • 2.

    We release a new dataset of 51k full-text news articles,3 together with the source code of our implementation. Unlike previous datasets, for each article, we provide metadata including the source and whether it is considered propagandistic.

  • 3.

    We release a webapp that allows users to explore the coverage of the current news events on the basis of their propagandistic content.4

The remainder of this article is organized as follows. Section 2 offers a soft introduction to propaganda. Section 3 presents related work on (automatic) propaganda identification and authorship-derived representations. Section 4 introduces our propaganda detection model. Section 5 presents the datasets we experiment with, including our new dataset. Section 6 covers our experiments and discusses the results. Section 7 describes the full architecture of proppy —as running on the Web— which includes retrieving the articles, grouping them into events, computing their propaganda score, and displaying the results. Finally, Section 8 concludes and points to possible directions for future work.

Section snippets

Background

The term propaganda was coined in the 17th century, meaning propagation of the Catholic faith (Jowett and O’Donnell, 2012, p. 2). The term soon took a pejorative connotation, as it was not only intended to spread the faith in the New World, but also to oppose Protestantism; i.e. it was not neutral. Here, we are interested in a journalistic point of view of propaganda: how news management lacking neutrality shapes information by emphasizing positive or negative aspects purposefully (Jowett and

Related work

Recently, there has been a lot of interest in studying disinformation and bias in the news and in social media. This includes challenging the truthiness of news (Brill, 2001, Finberg, Stone, Lynch, 2002, Hardalov, Koychev, Nakov, 2016, Potthast, Kiesel, Reinartz, Bevendorff, Stein, 2018), of news sources (Baly, Karadzhov, Alexandrov, Glass, & Nakov, 2018), and of social media posts (Canini, Suh, Pirolli, 2011, Castillo, Mendoza, Poblete, 2011, Zubiaga, Liakata, Procter, Wong Sak Hoi, Tolmie,

Representations

We use a maximum entropy classifier with L2 regularization and default parameters to discriminate propagandistic from non-propagandistic articles. This is the same classifier as the one used by Rashkin et al. (2017), and we chose it in order to facilitate direct comparison with their work. We consider four families of features, which we describe below.

Corpora

We use two corpora in our experiments. In Section 5.1, we introduce the corpus created by Rashkin et al. (2017), while in Section 5.2, we present QProp, our new corpus, which is tailored for the kind of analysis we want to do.14

Experiments and evaluation

We designed three experiments to verify hypothesis H1. The first one aims at comparing our features with the ones used in Rashkin et al. (2017), and thus we experimented with a 4-way classifier: trusted vs. propaganda vs. hoax vs. satire. The second experiment focuses on our main 2-way classification task: propaganda vs. non-propaganda. We perform this experiment on both the TSHP-17 and the QProp corpora. As we observe a sizable drop in performance when testing on news coming from sources never

Prototype architecture

We further developed a prototype to demonstrate our propaganda identification model in action.22 Fig. 3 shows an overview of its architecture. The process begins when a batch of news articles is fed to the system. We rely on GDELT to retrieve articles, as for the construction of the QProp corpus (cf. Section 5.2), but this time live: we process articles from 56 sources every 24 h. The first module is in charge of identifying events in the

Conclusion and future work

We performed a thorough experimentation into propaganda detection at the news article level. Our experimental results show that representations modeling writing style and text complexity are more effective than word n-grams, which model topics. Our comparison against existing models corroborates this hypothesis: models that consider stylistic features, such as character n-grams always outperform alternative representations, which are typically used in topic-related tasks. Different from

References (58)

  • J.M. Sproule

    Authorship and origins of the seven propaganda devices: A research note

    Rhetoric & Public Affairs,

    (2001)
  • M.L. Ba et al.

    VERA: A platform for veracity estimation over web data

    Proceedings of the 25th International Conference Companion on World Wide Web

    (2016)
  • R. Baeza-Yates

    Bias on the web

    Communications of the ACM

    (2018)
  • R. Baly et al.

    Predicting factuality of reporting and bias of news media sources

    Proceedings of the 2018 Conference on Empirical Methods in nAtural Language Processing

    (2018)
  • A. Barrón-Cedeño et al.

    Qlusty: Quick and dirty generation of event videos from written media coverage

    Proceedings of the Second International Workshop on Recent Trends in News Information Retrieval

    (2018)
  • C. Bazerman

    The informed writer: Using sources in the disciplines

    (2010)
  • S. Bird et al.

    Natural language processing with python

    (2009)
  • A.M. Brill

    Online journalists embrace new marketing function

    Newspaper Research Journal

    (2001)
  • K.R. Canini et al.

    Finding credible information sources in social networks based on content and social structure

    Proceedings of the IEEE International Conference on Privacy, Security, Risk, and Trust, and the IEEE International Conference on Social Computing

    (2011)
  • C. Castillo et al.

    Information credibility on Twitter

    Proceedings of the 20th International Conference on World Wide Web

    (2011)
  • C. Chen et al.

    Battling the Internet Water Army: Detection of hidden paid posters

    Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

    (2013)
  • H. Conserva

    Propaganda techniques

    (2003)
  • N. Cristianini et al.

    An Introduction to Support Vector Machines and other kernel-based learning methods

    (2000)
  • W. Davies

    The age of post-truth politics

    (2016)
  • L. Derczynski et al.

    SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

    Proceedings of the 11th International Workshop on Semantic Evaluation

    (2017)
  • J. Ellul

    Propaganda: The formation of men’s attitudes

    (1965)
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Proceedings of the Second International Conference on Knowledge Discovery and Data Mining

    (1996)
  • H. Finberg et al.

    Digital journalism credibility study

    (2002)
  • J. Graham et al.

    Liberals and conservatives rely on different sets of moral foundations.

    Journal of Personality and Social Psychology

    (2009)
  • R. Gunning

    The technique of clear writing

    (1968)
  • M. Hardalov et al.

    In search of credible news

    Proceedings of the 17th International Conference on Artificial Intelligence: Methodology, systems, and applications

    (2016)
  • A. Honore

    Some simple measures of richness of vocabulary

    Association for Literary and Linguistic Computing Bulletin

    (1979)
  • J. Hooper

    On assertive predicates

    (1975)
  • B. Horne et al.

    Sampling the news producers: A large news and feature data set for the study of the complex media landscape

    Proceedings of the 12th International AAAI Conference on Web and Social Media

    (2018)
  • B.D. Horne et al.

    This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news

    Proceedings of the international workshop on news and public opinion at ICWSM

    (2017)
  • K. Hyland

    The international encyclopedia of language and social interaction

    The International Encyclopedia of Language and Social Interaction

    (2015)
  • How to Detect Propaganda
  • P. Jaccard

    Étude comparative de la distribution florale dans une portion des Alpes et des Jura

    Bulletin del la Société Vaudoise des Sciences Naturelles

    (1901)
  • N. Japkowicz et al.

    Evaluating learning algorithms: A classification perspective

    (2011)
  • Cited by (156)

    View all citing articles on Scopus
    1

    Work carried out mostly while at the Qatar Computing Research Institute, HBKU.

    View full text