Proppy: Organizing the news based on their propagandistic content
Introduction
The landscape of news outlets is wide: from supposedly neutral to clearly biased. When reading a news article, every reader should be aware that, at least to some extent, it inevitably reflects the bias of both the author and the news outlet where the article is published. However, it is difficult to identify exactly what the bias is. It could be that the author herself may not be conscious about her own bias. Or it could be that the article is part of the author’s agenda to persuade readers about something on a specific topic. The latter situation represents propaganda. According to the now classical work from the (Institute for Propaganda Analysis, 1938), propaganda can be defined as follows: Definition 1 Propaganda is expression of opinion or action by individuals or groups deliberately designed to influence opinions or actions of other individuals or groups with reference to predetermined ends.
Propaganda is most effective when it can go unnoticed. That is, if a person reads a journalistic text, in a formal or an informal news outlet (e.g., in a blog or in social media) she should not be able to identify it as propagandistic. In that case, the reader is exposed to the propagandistic content without her knowledge and some of her opinions might change as a result. A striking example of the use of propaganda was allegedly put in place to influence the 2016 US Presidential elections (Muller, 2018). Given the wide landscape of news outlets —from tabloids to broadsheets, from printed to digital, from objective to biased— we believe that both news consumers and institutions might benefit from an automatic tool that can detect propagandistic articles.
Here we propose proppy, a system to organize news events according to the level of propagandistic contents in the articles covering them. Proppy is a full architecture (cf. Fig. 3) that takes a batch of news articles as input, identifies the covered events, and organizes each event according to the level of propaganda in each article. Our major contribution, and the focus of this manuscript, is a supervised model to compute what we refer to as propaganda score: the estimated likelihood of a text document to contain propagandistic mechanisms to deliberately influence the reader’s opinion.
Proppy computes a propaganda score using a maximum entropy classifier. We chose this classifier in order to facilitate direct comparison to previous work (Rashkin, Choi, Jang, Volkova, & Choi, 2017) and to focus our efforts on improving the representation of the data in terms of features. In Rashkin et al. (2017), word n-grams were used but, as the authors themselves pointed out, this yielded significant drop in performance when testing on articles from sources that were not seen on training. Here we aim to shed some light about why this could be the case. Therefore, we formulate the following hypothesis: Hypothesis 1 H1 Representations based on writing style and readability can generalize better than currently-used approaches based on word-level representations.
We argue that this is because word-level representations tend to learn topic and source, rather than whether the target article is propagandistic or not. In order to test the above hypothesis, we first replicated a pre-existing model for propaganda detection (Rashkin et al., 2017).2 Later on, we compiled a new corpus —QProp— which, unlike most pre-existing corpora, keeps explicit information about the source of each article, thus allowing us to train on articles from some sources and to test on articles from different sources that have not been used for training. We design experiments that involve training and evaluating several supervised models using features based on text readability and style; such features have been widely used in authorship attribution tasks (Stamatatos, 2009). In our thorough experimentation, we obtain statistically significant improvements over existing approaches in terms of classification performance, especially when testing on articles from unseen sources.
Our contributions can be summarized as follows:
- 1.
We experiment with different families of feature representations (some of them used for the first time for this task) spanning readability, vocabulary richness, and style in an effective propaganda estimation model, and we demonstrate empirically that they are effective for actually detecting propaganda, as opposed to learning the article’s source or its topic as it is the case in most previous work.
- 2.
We release a new dataset of 51k full-text news articles,3 together with the source code of our implementation. Unlike previous datasets, for each article, we provide metadata including the source and whether it is considered propagandistic.
- 3.
We release a webapp that allows users to explore the coverage of the current news events on the basis of their propagandistic content.4
The remainder of this article is organized as follows. Section 2 offers a soft introduction to propaganda. Section 3 presents related work on (automatic) propaganda identification and authorship-derived representations. Section 4 introduces our propaganda detection model. Section 5 presents the datasets we experiment with, including our new dataset. Section 6 covers our experiments and discusses the results. Section 7 describes the full architecture of proppy —as running on the Web— which includes retrieving the articles, grouping them into events, computing their propaganda score, and displaying the results. Finally, Section 8 concludes and points to possible directions for future work.
Section snippets
Background
The term propaganda was coined in the 17th century, meaning propagation of the Catholic faith (Jowett and O’Donnell, 2012, p. 2). The term soon took a pejorative connotation, as it was not only intended to spread the faith in the New World, but also to oppose Protestantism; i.e. it was not neutral. Here, we are interested in a journalistic point of view of propaganda: how news management lacking neutrality shapes information by emphasizing positive or negative aspects purposefully (Jowett and
Related work
Recently, there has been a lot of interest in studying disinformation and bias in the news and in social media. This includes challenging the truthiness of news (Brill, 2001, Finberg, Stone, Lynch, 2002, Hardalov, Koychev, Nakov, 2016, Potthast, Kiesel, Reinartz, Bevendorff, Stein, 2018), of news sources (Baly, Karadzhov, Alexandrov, Glass, & Nakov, 2018), and of social media posts (Canini, Suh, Pirolli, 2011, Castillo, Mendoza, Poblete, 2011, Zubiaga, Liakata, Procter, Wong Sak Hoi, Tolmie,
Representations
We use a maximum entropy classifier with L2 regularization and default parameters to discriminate propagandistic from non-propagandistic articles. This is the same classifier as the one used by Rashkin et al. (2017), and we chose it in order to facilitate direct comparison with their work. We consider four families of features, which we describe below.
Corpora
We use two corpora in our experiments. In Section 5.1, we introduce the corpus created by Rashkin et al. (2017), while in Section 5.2, we present QProp, our new corpus, which is tailored for the kind of analysis we want to do.14
Experiments and evaluation
We designed three experiments to verify hypothesis H1. The first one aims at comparing our features with the ones used in Rashkin et al. (2017), and thus we experimented with a 4-way classifier: trusted vs. propaganda vs. hoax vs. satire. The second experiment focuses on our main 2-way classification task: propaganda vs. non-propaganda. We perform this experiment on both the TSHP-17 and the QProp corpora. As we observe a sizable drop in performance when testing on news coming from sources never
Prototype architecture
We further developed a prototype to demonstrate our propaganda identification model in action.22 Fig. 3 shows an overview of its architecture. The process begins when a batch of news articles is fed to the system. We rely on GDELT to retrieve articles, as for the construction of the QProp corpus (cf. Section 5.2), but this time live: we process articles from 56 sources every 24 h. The first module is in charge of identifying events in the
Conclusion and future work
We performed a thorough experimentation into propaganda detection at the news article level. Our experimental results show that representations modeling writing style and text complexity are more effective than word n-grams, which model topics. Our comparison against existing models corroborates this hypothesis: models that consider stylistic features, such as character n-grams always outperform alternative representations, which are typically used in topic-related tasks. Different from
References (58)
Authorship and origins of the seven propaganda devices: A research note
Rhetoric & Public Affairs,
(2001)- et al.
VERA: A platform for veracity estimation over web data
Proceedings of the 25th International Conference Companion on World Wide Web
(2016) Bias on the web
Communications of the ACM
(2018)- et al.
Predicting factuality of reporting and bias of news media sources
Proceedings of the 2018 Conference on Empirical Methods in nAtural Language Processing
(2018) - et al.
Qlusty: Quick and dirty generation of event videos from written media coverage
Proceedings of the Second International Workshop on Recent Trends in News Information Retrieval
(2018) The informed writer: Using sources in the disciplines
(2010)- et al.
Natural language processing with python
(2009) Online journalists embrace new marketing function
Newspaper Research Journal
(2001)- et al.
Finding credible information sources in social networks based on content and social structure
Proceedings of the IEEE International Conference on Privacy, Security, Risk, and Trust, and the IEEE International Conference on Social Computing
(2011) - et al.
Information credibility on Twitter
Proceedings of the 20th International Conference on World Wide Web
(2011)
Battling the Internet Water Army: Detection of hidden paid posters
Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Propaganda techniques
An Introduction to Support Vector Machines and other kernel-based learning methods
The age of post-truth politics
SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours
Proceedings of the 11th International Workshop on Semantic Evaluation
Propaganda: The formation of men’s attitudes
A density-based algorithm for discovering clusters in large spatial databases with noise
Proceedings of the Second International Conference on Knowledge Discovery and Data Mining
Digital journalism credibility study
Liberals and conservatives rely on different sets of moral foundations.
Journal of Personality and Social Psychology
The technique of clear writing
In search of credible news
Proceedings of the 17th International Conference on Artificial Intelligence: Methodology, systems, and applications
Some simple measures of richness of vocabulary
Association for Literary and Linguistic Computing Bulletin
On assertive predicates
Sampling the news producers: A large news and feature data set for the study of the complex media landscape
Proceedings of the 12th International AAAI Conference on Web and Social Media
This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news
Proceedings of the international workshop on news and public opinion at ICWSM
The international encyclopedia of language and social interaction
The International Encyclopedia of Language and Social Interaction
Étude comparative de la distribution florale dans une portion des Alpes et des Jura
Bulletin del la Société Vaudoise des Sciences Naturelles
Evaluating learning algorithms: A classification perspective
Cited by (156)
G-HFIN: Graph-based Hierarchical Feature Integration Network for propaganda detection of We-media news articles
2024, Engineering Applications of Artificial IntelligenceMulti-contextual learning in disinformation research: A review of challenges, approaches, and opportunities
2023, Online Social Networks and MediaFake news detection: A survey of graph neural network methods
2023, Applied Soft ComputingContext-aware style learning and content recovery networks for neural style transfer
2023, Information Processing and ManagementRobust and explainable identification of logical fallacies in natural language arguments
2023, Knowledge-Based SystemsAlphabet Flatting as a variant of n-gram feature extraction method in ensemble classification of fake news
2023, Engineering Applications of Artificial Intelligence
- 1
Work carried out mostly while at the Qatar Computing Research Institute, HBKU.