Introduction

Artificial intelligence (AI)-driven systemsFootnote 1 have for long been recognised as crucial factors in shaping online political information environments worldwide [15]. Among other things, these systems are applied for automated information curation, a process of selecting and presenting content from a pool of data following a set of decision-making principles [60]. Ranging from search engines to recommender systems, curation mechanisms pose multiple challenges for society [60]: From affecting information flows [73] to determining individual exposure to propaganda content [45], recent studies have highlighted how curation mechanisms can be prone to problems (e.g. of algorithmic bias [16, 19]) and misused in the context of political microtargeting [32], content personalisation mechanisms on digital platforms [13], and disruptive content presence and mitigation [1].

The development of Large Language Models (LLMs), a form of AI technology capable of processing and generating textual content [58], signifies a new stage in the complex relationship between AI and political communication. Compared with earlier forms of non-generative AI, like search engines, LLMs are characterised by more advanced capacities for evaluating semantic qualities of user input and content generated in response to it. While this technology can be used to generate fake [52], unsafe content [75] or facilitate censorship [74], LLMs also offer new possibilities for content analysis, including detection of false and misleading information. This particular task has been attracting a growing amount of scholarly interest (e.g. [38]), but its realisation remains rather challenging due to difficulties of automated evaluation of information veracity (e.g. [3]).

Despite the initially promising findings concerning the potential of LLMs to facilitate political communication research [38, 72], there are still important gaps which require addressing. Similar to search engines and platforms mediated by non-generative forms of AI, generative AI technology is largely non-transparent for its users [43]. This lack of transparency amplifies the risk of LLM-based tools contributing to unequal information exposure for individual users, for instance, due to substantial variation in LLM performance depending on the language of the prompt (e.g. [29]). Therefore, understanding the influence of various factors on LLM performance for detecting information veracity is of particular relevance for academic research and policymakers.

In this paper we, therefore, set out to comparatively analyse two popular LLM-based chatbots, ChatGPT and Bing Chat (recently renamed into Microsoft Copilot), in their ability to evaluate the veracity of claims related to different issues which are often targeted with disinformation and are associated with conspiracy theories. Given that prior research suggests that the performance of different chatbots can vary substantially due to specific settings, we examine how well these chatbots are able to detect accuracy of given statements using AI auditing methodology. This novel methodology originates from the field of algorithm auditing and comprises systematic evaluation of performance of AI systems in relation to a specific issue or domain [14]. We have chosen the two chatbots that are built on the different versions of the same LLM (GPT 3.5 for ChatGPT and GPT 4 for Bing Chat) as our focus is specifically on the settings and guardrails of the two chatbots and not on the performance of LLMs themselves.

Political misinformation and LLMs

Tackling false information online

A vast body of research has been preoccupied with online information quality. Algorithmically mediated information environments can be particularly vulnerable to propagation of falsehoods due to algorithms potentially increasing the reach of false information and making it more targeted [32]. Furthermore, disruptive actors are increasingly integrating different forms of AI into their strategies of manipulation, both in authoritarian [28, 78] and democratic contexts [20].

A direct consequence of this is a growing number of attempts to manipulate public opinion in online environments. In some cases, these attempts build on existing misleading narratives and amplify them via digital media, for example, in the case of Holocaust denial [34]. In other instances, online platforms serve as a breeding ground for new false (and often conspiratorial) narratives, which became particularly alarming during the COVID-19 pandemic [2]. In both cases, however, the spread of different forms of false information raises numerous concerns due to its potential to amplify societal polarisation [8], promote hate speech [35], and undermine democratic decision-making processes. The latter concern is particularly pronounced due to the growing use of false information by authoritarian states, such as Russia or China, to interfere in the electoral processes in Western democracies [48].

There have been multiple proposals on managing the risks of misinformed societies. One suggestion for improving information quality is to counter false narratives, for example through inoculation and pre-bunking [47] which has shown promise in forming resistance to misinformation [51]. Scholars have also highlighted consistent psychological factors that underpin susceptibility to false narratives, such as the lack of analytical thinking and numeracy skills or low trust in science and reliance on intuition, showing the potential of accuracy prompts and digital literacy tips for combating misinformation [7, 61].

In the algorithmically mediated information environments, there have been developments in tools to prevent the spread of false information by its automated identification and removal [1, 67]. These have been connected to a range of pitfalls, primarily regarding the semantic complexity of the phenomenon of false information that includes a broad range of possible concepts which can be difficult to operationalise for automated content analysis approaches. Despite the multimodality of misinformation and disinformation concepts, which are simultaneously related to accuracy of content, semantics, hidden meanings and interpretations, as well as intentions of content sponsors [68], the majority of current works focus either on the content or the source of information. While one-dimensional conceptualisations can suffice when misinformation pertains to factually incorrect information, it is hardly applicable to more nuanced cases, for instance, the ones dealing with ontologically contested subjects (e.g. [36]).

In addition to the semantic complexity of the concept of false information, there are also a number of other problems related to its automated detection. Firstly, the continuous emergence of new false narratives poses difficulties for automatically identifying them on time [4], in particular when using relatively simple approaches that rely on a small set of content cues. Another problem concerns scaling of automated approaches for detecting false information given the amount of false content online [10]. Finally, the quality of datasets used for training and performance evaluation of automated approaches dealing with veracity detection tasks often raises questions, particularly those related to potential presence of biases (e.g. [18]).

LLMs and information veracity detection

The viral launch of ChatGPT, which reached an unprecedented number of 100 million users just two months into its operation, has opened up discussions about risks as well as new opportunities connected to generative AI. The growth pace of LLM-based chatbots has been connected to a variety of their applications, spanning from computer science [58], business and innovation [24], to education [9] and healthcare settings [53]. It also amplified concerns regarding the unethical uses of new technology as well as privacy concerns [50]. Other threats of LLMs concern reiteration and amplification of different forms of bias, such as gender [44, 76] or political bias [49], or the use of LLMs for censoring information [74]. In the context of false information, LLMs can facilitate its spread online or even generate new types of misleading narratives [69]. Moreover, LLMs powering chatbots are often based on “ungoverned information”, making it ever more difficult to ensure sustainable user engagement with them [24], p.14).

Several studies have attempted to measure the political bias of LLMs by prompting them with measures of political leaning commonly used in the questionnaires. For instance, Rozado [63] examined 24 conversational LLMs using 11 political orientation tests and showed that most of the models gravitate towards the left side of the political spectrum. The study also found that this is not the case for the base models (that did not undergo supervised fine-tuning and reinforcement learning), which do not express any coherent political stance, yet can also be easily fine-tuned to express one political leaning or another [63]. Similar results were obtained by Rutinowski et al. [65], who concluded that ChatGPT tends to demonstrate more progressive rather than conservative views. Motoki et al. [56], comparing the default responses of ChatGPT to the Political Compass questionnaire found that default responses are more closely aligned with the Democrats in the US, Lula in Brazil, and the Labour Party in the UK [56]. Other studies have shown that the slightest changes in the prompt may affect the generated response [62]. Overall, these findings suggest that LLMs can show political bias, albeit it depends on the context and phrasing of the prompt. This, as we assume, may also affect the way in which LLMs can evaluate different political statements in terms of their veracity.

On the other hand, LLMs could be a promising technology for mitigating risks of false information due to their capacities for recognising patterns in data and, to a certain degree, evaluating semantic aspects of content [30]. Caramancion [21] has examined the ability of ChatGPT 3.5 to test veracity of textual news with images on a small sample size and found a 100% accuracy of veracity detection. Larger scale studies have also shown promising results. Caramancion [21] compared four popular chatbots ChatGPT 3.5, 4.0, Bard/LaMDA, and Bing Chat in their ability to discern false information. On average, these chatbots had a 65.25% accuracy, with ChatGPT 4.0 performing the best. Similarly, comparing the two versions of ChatGPT, Deiana et al. [23] have found ChatGPT 4.0 performing better in evaluating correctness, clarity, and exhaustiveness of the answers related to eleven popular misconceptions about vaccination. Hoes et al. [38] highlighted the potential of ChatGPT to label true and false statements on the content before and after its training data cutoff date, finding an overall 68.79% accuracy of performance on fact-checked data. It is important to note that some studies on human verification of potentially false claims show more impressive performance (e.g. [6]), albeit human fact-checking is harder to scale compared with veracity assessment using LLMs and LLM-based chatbots.

Although research on the use of LLMs for veracity detection is a fast-growing field, there are significant gaps in the existing literature. Current studies predominantly focus on English language prompts and primarily take into account content semantics rather than sources of information. Secondly, existing literature does not tend to differentiate between various types of false content, such as false or partially true, or conspiratorial statements. Lastly, LLMs ability to work with given conceptual tasks is not included in veracity identification testing. Our study aims at remedying these limitations.

Shades of information quality

Misinformation is generally defined as information that is false, incomplete or unclear, and therefore misleading the public [41, 77], while disinformation refers to the intentional production and dissemination of such information [68]. The research on misinformation and disinformation is extensive. As a result, there exist various typologies of misinformation. These typologies are often organised either by topic, type of information, or its discursive style. For example, scholars have proposed classification of misinformation related to COVID-19 [40] and climate change [66]. Focusing on the type of information, existing misinformation categorisations differentiate between types of news-related [37] and official misinformation [64], whereas depending on its discursive style, misinformation can be classified into rumors, hoaxes, and conspiracy theories [42]. Despite the plethora of typologies, researchers still struggle navigating theoretical grounds of misinformation and disinformation scholarship.

With the exception of a few recent studies, most of the scholarship is limited to one dimension of the problem, namely, lies and falsehood, omitting the less obvious and more difficult to operationalise notion of borderline information, containing factually correct content but being nevertheless misleading. This category is often embedded in the general definition of misinformation, (e.g. [64]), making differentiating between various levels of veracity challenging. Such a general definition makes the concept particularly difficult to operationalise for the purpose of automated information identification. Even the frameworks that do account for nuance in veracity of statements [42], rarely integrated it into a joint misinformation and disinformation taxonomy. While Wardle & Derakshan (2020) have gone further and proposed an umbrella term of “information disorder” as a strategy for conceptually aligning competing definitions, there is a lack of separation between the veracity level of the content and the actors promoting this content in their proposed framework.

In our theoretical toolkit, we distinguish between veracity, discursive style, and communicative form as three levels of information which can be assessed based solely on content and without taking into consideration information’s source (Fig. 1). Overall, we argue that the focus on intent is less relevant for tackling the problem of information in algorithmically curated information environments, given that AI-driven systems are intransparent and usually include an element of stochasticity in the production as well as distribution of content. Therefore, we focus primarily on the veracity of the messages [56].

Fig. 1
figure 1

Typology of misinformation and disinformation (We acknowledge that true information can also be used with a malign intent, such as, for example, in the case of propaganda. However, in this case the phenomenon is no longer disinformation.)

Veracity. Drawing on [42], we include three types of veracity of information: true, borderline, and false. By true information, we consider factually correct and verified content. The borderline category is an umbrella term for content that ranges from what is defined by PolitiFact’s methodology from “mostly true” and “half true” to “mostly false”, referring to the lack of clarification, vagueness, omission of details or “important critical facts that would give a different impression” [39]. We suggest paying particular attention to the borderline category as it has been shown to have a substantial effect on beliefs in misinformation statements. Barchetti et al. [12] define this phenomenon as the “half-truth effect”. With a survey experiment, they have shown that individuals are more likely to believe misleading statements if a claim follows a true statement, regardless of whether the two assertions are logically connected.

Intent. The intent level introduces a second dimension to the veracity categorisation. It is particularly important to disentangle intent from veracity, given the methodological difficulty of grasping intent. Unlike Kapantai et al. [42] and in line with Guess and Lyons [33], we consider disinformation a subcategory of misinformation. In other words, all information that is misleading should be considered misinformation. Only in situations when a source’s intention is known and can be proven, can misinformation be classified as disinformation [68]. Disinformation is, therefore, false or borderline information on the veracity scale that is promoted deliberately on the intent scale. Disinformation can be part of a propaganda strategy, when propaganda is understood as an international strategy to alter public opinion, or can be distributed with an intent to defame, or advertise a product.

Style of Communication. Lastly, we theoretically disentangle the discursive style, such as satire, or conspiratorial narratives and the communicative form of the message, such as news, advertising, or a message, from the category of veracity of information. In other words, the level of accuracy of information is independent from the style and form of its presentation. This helps in clarifying the distinctions between misinformation and fake news, for instance. Moreover, such delineation between information veracity and different communicative styles in which it can be presented might potentially help with the problem when highly politicised concepts become a “floating signifier” that is used by opposing groups to delegitimise each other [26].

In this study, we examine the possible implications of the rise of generative AI for detecting different forms of false information. Specifically, we present a comparative analysis of two popular LLM-based chatbots, ChatGPT and Bing Chat (recently renamed into Microsoft Copilot), in their ability to evaluate the veracity of claims related to different issues which are often targeted with disinformation and are associated with conspiracy theories. In line with the conceptual framework presented above, we differentiate between true, false, and borderline statements to examine how well these chatbots are able to detect accuracy of given statements and ask:

RQ1. What are the differences in chatbots’ evaluations of true, false, and borderline statements?

Secondly, we are interested in the difference in chatbots’ performance in different languages. The existing research [29] highlights substantial disparities in the quality of chatbot outputs depending on the language and in some cases, these differences are attributed to the chatbot censoring information in certain languages [74, 79]. In other cases, prior research has shown the differences related to the discrepancies between high- and low-resource languages [29] attributed to the lower volume of training data for the latter. We do, therefore, expect to see some variation in the behavior of chatbots depending on the language of use and ask:

RQ2. What are the differences in chatbots’ performance in different languages?

Lastly, we explore chatbots’ ability to evaluate statements according to the concepts of disinformation, misinformation, using definition-oriented inquiries. Moreover, we include one of the most wide-spread and researched communicative styles, conspiracy, to test how well chatbots are able to identify it. Given that prior research on bias in algorithmically-mediated environments has highlighted that mentions of specific information source tend to have an impact on the chatbot outputs [71], we systematically test the presence of biases in chatbots’ evaluations by attributing the statements to various political and social actors, and ask the following question:

RQ3. How does source attribution of statements influence their labeling by the chatbots?

Methodology: AI auditing

To examine the capacities of LLMs to evaluate information veracity, we conducted AI audits of two LLM-powered chatbots: ChatGPT and Bing Chat. A recent extension to the field of algorithm auditing [11, 54] — a process of investigating functionality and impact of algorithmic systems—AI auditing is a research method which focuses on systematic examination of the performance of AI systems with the aim of understanding their functionality and impact. AI audits usually focus on system performance regarding specific tasks (e.g. unsafe content generation [75]) which is investigated and assessed to detect erroneous behavior or presence of systematic bias. With the growing impact of AI-driven applications and platforms on the society, AI audits have been viewed as a crucial element of governance frameworks that can “help pre-empt, track and manage safety risk while encouraging public trust in highly automated systems” [25], p. 566). Scholars have mentioned several ethical concerns surrounding AI-audits, particularly related to a common vagueness of concepts used in such frameworks or the lack of clear and ethical practices for involving stakeholders into the process and, in turn, insufficient accountability outcomes [14]. To ensure more ethically informed audits, scholars have called for better underlying conceptual frameworks that structure such audits for highlighting the importance of this method for improving the information ecosystems. In the field of political communication, AI audits increasingly serve as a crucial method for investigating how technology can lead to systematic distortion of subject representations and how this in turn can impact individual informedness on politics [65].

Prompt development

The design of this study is structured around comparing performance of two popular chatbots, ChatGPT and Bing Chat, in evaluating the veracity of statements related to different socio-political topics. We particularly focus on the differences in settings and guardrails in place regarding different socially relevant political topics. To this end, we used 25 statements on 5 topics: COVID-19, the Russian aggression against Ukraine, the Holocaust, climate change, and LGBTQ + debates. This selection was based on existing evidence that there is a plethora of false narratives surrounding these topics (e.g. false statements distributed by specific political groups and regimes, prejudice-based popular misconceptions resulting in partially false claims, and conspiracy theories) [5, 7, 34, 46, 70].

For each topic we developed a set of five statements split into three veracity categories: three false statements, one true, and one borderline (i.e. containing some true information but still misleading). One of the three false statements also contained a conspiracy claim, defined as “a belief that an event or a situation is the result of a secret plan made by powerful people” (“Conspiracy Theory,” 2023). The selection of the statements was based on the most salient debunked topics from either scientific sources or fact checking websites such as BBC Verify, PolitiFact, and EU vs Disinfo.Footnote 2 We used false and previously debunked stories related to the above-mentioned topics that had circulated in the online information environment before 2021, preceding the cutoff training data of ChatGPT. It is also important to note that we rely on a unique dataset constructed specifically for this study. It means that specific false and true statements were likely not included in the training data in the exact same formulations.

To explore whether chatbots’ evaluations of information veracity vary depending on the source of the claim, each statement was presented in 5 conditions: (1) without the source, or attributed to (2) US officials, (3) Russian officials, (4) US social media users, (5) Russian social media users. Source attribution was based on several theoretically grounded assumptions. Firstly, we chose a well-known disinformation agent, the Russian government and its officials [27]. Secondly, we selected US officials given the abundance of data available about the US and its profound impact on political communication and the broader realm of knowledge production [17]. We then introduced the group of social media users in the two countries as an opposition to government sources with a less obvious political agenda.

Statements were first designed in English and then translated into Russian and Ukrainian by native speakers of these languages. Our interest in comparing how chatbots react to prompts in different languages is attributed to the evidence of their performance being substantially affected by the prompt language (e.g. [29]). Specifically, we are interested in whether the ability of the chatbots to evaluate the veracity of information will be lower for a low-resource language (i.e. Ukrainian) compared to high-resource languages (i.e. English and, to a certain degree, Russian).

Furthermore, we are interested in whether the observed tendency of some chatbots to censor outputs generated in response to prompts in Russian regarding topics sensitive for the Kremlin [74] may affect chatbot performance, in particular as a number of false statements we included (e.g. regarding Russian aggression against Ukraine) fall into this category.

The above-mentioned conditions resulted in 375 unique prompts. In addition to the statement, each prompt included the task description. Specifically, we provided definitions of misinformation, disinformation, and of a conspiracy theory and asked the model to evaluate (1) whether the statement is true, false, or borderline, (2) whether it can be considered a conspiracy theory, and (3) whether it can be considered misinformation or disinformation (see Fig. 2 for an example). This part of the study was primarily interested in chatbots’ ability to evaluate statements based on complex political communication concepts and potential presence of bias against specific political actors, therefore we did not provide any information about the intent of the given sources (Table 1).

Fig. 2
figure 2

Example of a prompt used in the study

Table 1 List of Statements and Sources

Data collection and analysis

To evaluate outputs of the chatbots we designed a codebookFootnote 3 with the following variables:

  1. (1)

    Answer provided (yes/no): whether a chatbot clearly answered the question regarding (a) veracity of the statement, (b) presence of conspiracy theory, (c) presence of mis- or disinformation.

  2. (2)

    Accuracy for detecting false/true/borderline statements (accurate/non-accurate): whether a chatbot correctly identified the veracity of a statement.

  3. (3)

    Accuracy for detecting the conspiracy theory label (accurate/non-accurate): whether a chatbot correctly identified the presence of a conspiracy theory claim in a statement.

  4. (4)

    Presence of mis- or disinformation (misinformation/disinformation/both/none): whether a chatbot identified the statement as mis- or disinformation or found evidence for both (or none) of those. Since the main distinction between these types of false information is the presence/absence of intent in spreading it, which is impossible to derive from the statement itself unless it is mentioned directly, we did not have the baseline values for these variables and kept them explorative. We then unified different coding variations into one of the four labels outlined above.

  5. (5)

    Mentioning of the source (positive/neutral/negative/none): if and how the chatbot commented on the source, which the statement was attributed to.

The data in the form of 750 prompt outputs (i.e. 375 statements × 2 chatbots) was manually collected by the researchers within the timeframe of one week.Footnote 4 To avert any effect of the location, data was collected within the same location or with a VPN configured to that location. We have tested the version of ChatGPT running on GPT 3.5 LLM and Bing Chat running on GPT 4. To avoid the effect of prior interaction with an LLM, each prompt was submitted to a new chat (for ChatGPT) or after the page refresh (for Bing Chat). We did not use Open AI API or API wrappers for Bing Chat due to our interest in keeping the process of data generation close to how we expect the majority of users to engage with the chatbot and to ensure comparability of the chatbot outputs. Additionally, there is a possibility of differences in chatbot outputs generated via API and via the traditional human-chatbot interface, which to our knowledge have not been systematically investigated yet.

Data was manually coded independently by 5 researchers to allow for a more detailed interpretation of the results. Coders were fluent in two or more languages of the output. Our intercoder reliability test produced a Krippendorff’s alpha coefficient of 0.8 as an average for all five variables, which we considered satisfactory for the analysis. The remaining disagreements were consensus coded.

To analyse data, we used a combination of descriptive statistics and regression analysis. For the latter, we used multinomial logistic regression to track what factors influence how chatbots evaluate false, true, or borderline statements and assign to them conspiracy theory, misinformation and disinformation labels. As the reference category for all regression models, we used the “accurate” or “no disinformation/misinformation” categories of the variables and presented the other accuracy- (e.g. “inaccurate” and “no response”) and disinformation/ misinformation-related (i.e. “no response”, “disinformation”, “misinformation”, and “both”) categories in relation to it. As predictors, we used the language of the prompt (with English serving as a reference level), the type of the chatbot (with Bing Chat as a reference level), the topic of the prompt (with climate change prompts as a reference level), and mention of the source (with the mentions of the Russian officials as a reference level).

Results

Detection accuracy of false, true, and borderline statements

Firstly, we measured how close to the baseline the two chosen LLM-based chatbots are in identifying the statements as false, true, or borderline. Overall, 70% of prompts were identified in accordance with the baseline with regard to their veracity across all languages and chatbots. ChatGPT performed better than Bing Chat in all of the languages with an accuracy of 79% compared to 66% respectively for prompts in English (Fig. 3). In Russian, both ChatGPT and Bing Chat performed with a 70% accuracy. Ukrainian was the language in which both chatbots performed worse than in other languages (68% and 66%, respectively).

Fig. 3
figure 3

Percentage of accurately detected false, true, and borderline statements

We also found that while both chatbots almost always provided an answer to the question regarding veracity for English prompts, they sometimes gave no answer for prompts in Russian and Ukrainian. These included instances when chatbots would either clearly refuse to answer, for example due to the complexity of a topic, or produce nonsensical answers unrelated to the prompt’s topic. This rate is higher for Bing Chat, especially regarding prompts in Russian and Ukrainian, where the chatbot did not respond to 14% and 10% of prompts, respectively. ChatGPT tended to give relevant responses for Russian and Ukrainian prompts more often but, at the same time, provided more inaccurate answers in these languages.

Accuracy of detecting conspiratorial statements

To investigate the ability of chatbots to detect conspiratorial information, we used statements like “Euromaidan was organized by the US to turn Ukraine into its colony” or “COVID-19 vaccines are used to implant trackable microchips in the bodies of individuals”. These statements contained an assumption that there is a secret plot behind an action. The task required chatbots to identify hidden meanings in texts, a function that was for long considered a prerogative of humans [57]. Interestingly, ChatGPT demonstrated rather high performance with 81–86% correct responses for all of the languages (Fig. 4). Moreover, the chatbot contained a low proportion of non-responses and in most cases provided answers with a high level of certainty. Bing Chat, however, identified conspiracy labels with high accuracy only for English prompts (76%).

Fig. 4
figure 4

Percentage of accurately detected conspiracy theory statements

For the other two languages the accuracy of Bing Chat dropped significantly: only 26% of prompts in Ukrainian and 36% in Russian were identified correctly in regard to the presence of conspiratorial narratives. This is, however, not only due to inaccurate responses but also to a high non-response rate: for 67% of prompts in Ukrainian and 61% in Russian the chatbot did not provide a response. A considerable difference in the non-response rate for the veracity- and conspiracy theory-related evaluations can be explained by the following: As was observed during the data coding, in some cases, Bing Chat did not necessarily refuse to answer at all, but responded to other questions in the prompt (e.g. regarding the veracity of the statement or it being mis- or disinformation), while ignoring the one about the presence of conspiracy theory.

Disinformation and misinformation detection

We also examined how the chatbots apply the labels “disinformation” and “misinformation” based on provided definitions. Unlike the previous evaluation tasks, “disinformation” and “misinformation” statements did not have a baseline to which we compared the chatbots’ outputs. Therefore, our analysis of this category is explorative and is aimed at studying how chatbots deal with complex theoretical concepts and whether there are biases against specific political actors. Overall, we can observe that the “disinformation” label is used more often by all chatbots in most languages, with the exception of Bing Chat in English (Fig. 5). Its use is particularly high for ChatGPT in Russian (50% of responses) and Ukrainian (38%). One possible explanation is that the word “misinformation” in these languages is a neologism coming from English that is not frequently used. Remarkably, the most common response (27%) for ChatGPT in English is that the statement can be both, for example, depending on the source’s intent.

Fig. 5
figure 5

Distribution of misinformation and disinformation labels across chatbots in different languages

Although such a response meant that ChatGPT to a certain extent did not fully follow the instructions provided by our prompt, the answer presented a more nuanced and, in fact, accurate theoretical classification of the statement, because we did not provide specific information about the proven intent of the sources. In other languages, ChatGPT chose this labeling option less frequently (13% of cases in Russian and 8% in Ukrainian). Bing Chat, on the other hand, showed less nuance in working with theoretical concepts (the statement was labeled as “both” only in 2% of cases for all three languages). Unlike ChatGPT that often provided a clear explanation of the reasons for labeling statements either as “misinformation” or “disinformation”, Bing Chat answered this question with more certainty but without theoretical reasoning.

Presence of biases in veracity- and conspiracy-related evaluation tasks

Analyzing the potential biases against provided sources, we first focus on the proportions of statements mislabeled based on their veracity (i.e. true, false, or borderline) by source type (See Fig. 6a). We find that the incorrectly labeled Ukrainian-language content on ChatGPT is mainly connected to prompts that specified that the statement was distributed by Russian officials (27% of mislabeled prompts).

Fig. 6
figure 6

a Percentage of incorrectly labeled true, false, borderline statements by source, b Percentage of incorrectly labeled conspiracy statements by source

Accordingly, ChatGPT responses in Russian had the biggest share of mislabeled content connected to prompts that specified Russian social media users as the source of information (24%). At first glance, this could point to a bias against sources connected to Russia, which could be connected to a plethora of written evidence on Russian disinformation campaigns [27] and could be in line with research suggesting that the model behind ChatGPT is mostly liberal-leaning [49]. However, the biggest fractions of mislabeled English-language ChatGPT outputs were linked to prompts that specified either US officials or Russian social media users as the source of information (both 23%). On Bing Chat, statements with no indicated source formed the biggest proportion of mislabeled Ukrainian-language prompts (24%).

In English, the share of mislabeled prompts was equally high for statements distributed by Russian officials as for US officials (both 22%). However, in Russian, most mislabeled prompts were connected to US users or had no source (both 24%) and only 14% were linked to Russian officials. Content shared by Russian officials fared best on Bing Chat in Russian and ChatGPT in English (connected to only 14% and 15% mislabeled content); and the worst on English Bing Chat and Ukrainian ChatGPT (linked to 22% and 27% of mislabeled prompts, correspondingly).

We then analysed the statements for which the presence of conspiracy was inaccurately identified (Fig. 6b). For ChatGPT in Ukrainian, these were mostly statements attributed to Russian officials and Russian or US users (25% all three). For ChatGPT in Russian, the biggest proportion of mislabeled statements mentioned Russian officials (32%), while for ChatGPT in English, it was the US users (36%). Bing more often mislabeled statements attributed to Russian officials for Ukrainian prompts (63%), Russian users for English prompts (24%), and statements with no source for Russian prompts (50%).

Regression analysis results

To test the effect of various factors on the performance of chatbots for veracity detection, we performed three regression analyses (Figs. 7, 8 and 9). First, we examined factors influencing labeling of the accuracy variable. Figure 7 demonstrates that there are no statistically significant factors influencing the incorrect assessment of whether the statement is true, false or partially true. However, some factors are statistically significant for the decision of the chatbot to decline providing an answer to the veracity-inquiring prompt. The chatbots were significantly more likely not to answer prompts in low-resource languages compared with prompts in English.

Fig. 7
figure 7

Multinomial logistic regression results for labeling of accuracy variable.6

Fig. 8
figure 8

Multinomial logistic regression results for assignment of the conspiracy theory label

Fig. 9
figure 9

Multinomial logistic regression results for assignment of misinformation/disinformation labels

Besides the prompt language, the regression indicates significant differences between ChatGPT and Bing Chat with the former being substantially less likely to avoid providing an answer, despite the lack of integration with the web search engine and, thus, the more limited capacities to acquire the latest updates on the topic compared to Bing Chat. Finally, chatbots were substantially more likely to avoid giving answers to the prompts dealing with the Holocaust and the Russian aggression against Ukraine. This can indicate that chatbot outputs regarding such sensitive topics are more limited by guardrails implemented by their developers.

Similar to the capacity of chatbots to evaluate information veracity in general, we found that the accuracy of assigning the conspiracy theory label (Fig. 8) is primarily influenced by the language of the prompt and the chatbot model. The likelihood of declining to assign the respective label is significantly higher for the prompts in Ukrainian and Russian and if the prompt is addressed to Bing Chat. Unlike the case of accuracy, we did not observe any significant impact of the topic of the prompt; another difference is that the assignment of the incorrect conspiracy label has been significantly affected by mentioning US officials in the prompts. Such mentions decreased the likelihood of the conspiracy label to be assigned incorrectly. Accordingly, in the majority of cases, mentions of US government officials made it less likely for a chatbot to treat a non-conspiratorial claim as a conspiratorial one.

Figure 9 shows that the assignment of misinformation- and disinformation-related labels followed a similar pattern regarding the significance of individual factors. Prompts in Ukrainian and Russian were significantly more likely to result in chatbots declining to provide a response, but also less likely to suggest that a prompt can be treated both as misinformation and disinformation. ChatGPT was significantly less likely to decline providing a response to the prompt, whereas Bing was more likely to treat the prompt both as a form of misinformation and disinformation.

Compared to the other tasks, we found the effect of the prompt topic to be more significant for the assignment of misinformation and disinformation labels. Prompts related to the Holocaust and COVID-19 were significantly less likely to be labeled as non-intentionally false; similarly, prompts related to the Holocaust and LGBTQ + were less likely to be labeled as the ones containing both disinformation and misinformation (with the latter topic also being less likely to be treated as the one concerning disinformation). However, in the case of prompts dealing with the Russian aggression against Ukraine, chatbots were significantly more likely to treat our prompts as intentionally false claims or not give a response at all. Finally, we found that mentioning no source of the statement increased the likelihood of chatbots to treat the prompt as both a form of disinformation and misinformation.

On the whole, we find that different source types are among the least statistically significant factors with only two types of sources—mentioning US officials as the source or not providing any source—having a significant effect on chatbots’ performance for the individual veracity assessment tasks. However, the choice of the language, the chatbot used, and, to a certain degree, the topic influence the likelihood of getting correctly verified information. In terms of language, there is an overall higher likelihood of both chatbots to provide no response for the low-resource languages, Ukrainian and Russian, compared to English. We also find that Bing Chat is significantly more likely to avoid giving an answer to the veracity-inquiring prompts than ChatGPT. In terms of the topic, we observe that both chatbots are more likely to avoid providing responses if prompts deal with the Holocaust and, in particular, with the war in Ukraine. This may be related to platforms’ attempts to regulate sensitive topics, however the ethical frameworks underlying such decisions are often intransparent (Google, 2024).

Discussion and conclusion

In this study, we have presented a comparative analysis of ChatGPT and Bing Chat’s ability to evaluate the veracity of political information in three languages — English, Russian, and Ukrainian. We used AI auditing methodology to investigate how chatbots label true, false, and borderline statements on five topics: COVID-19, the Russian aggression against Ukraine, the Holocaust, climate change, and debates related to LGBTQ + . Comparing chatbots’ performance, we find an overall high misinformation identification potential of ChatGPT in English. Even though the performance of Bing Chat was comparatively low, our findings highlight the potential of chatbots for identifying different forms of false information in online environments. However, there is a strong imbalance concerning chatbots’ performance in lower-resource languages (e.g. Ukrainian), as we see substantial performance drops and, in the case of Bing Chat, decrease in responsiveness. While lower performance in low resource languages can be expected based on earlier research (e.g. [29]), it raises concerns regarding the use of LLM-based chatbots for evaluating the veracity of information in contexts where false information is likely to be generated in non-English languages and when information accuracy is of paramount importance, such as in the case of the ongoing war in Ukraine.

Our analysis of the chatbots’ ability to classify conspiracy theory statements has yielded surprisingly high-performance results, in particular in the case of ChatGPT (81% and above in all three languages). Given that this task meant dealing with hidden meanings, accurate identification of conspiratorial narratives highlights potential advantages of LLM-based approaches over traditional machine learning (ML) techniques for highly complex natural language processing tasks [57]. Such advantages can be crucial for improving the evaluation of veracity of content and can help in mitigation of misinformation risks. At the same time, it is important to note that our selection of statements was relatively small and focused on well-established conspiratorial claims. Future research will benefit from a more in-depth investigation of the ability of LLM-based chatbots to evaluate different types of conspiratorial claims.

Furthermore, we have explored chatbots’ ability to deal with the political communication concepts of disinformation and misinformation, using definition-oriented prompts, and systematically testing the presence of source biases by attributing specific claims to various political and social actors. Even though humans substantially outperform LLMs in tasks involving conceptual and abstract evaluation [55], generative AI has strong potential for these tasks. Our findings suggest that ChatGPT is particularly promising in this context, as it provides nuanced assessment of the task and well-detailed reasoning behind its evaluations.

We also observe that in most cases, the topic of the prompt and the inclusion of the source were not statistically significant predictors for the assignment of disinformation- and misinformation-related labels by the chatbots or the accuracy of veracy assessments. However, there were cases where these factors did matter. For instance, the mention of US officials as the source of the statement resulted in less likelihood of the incorrect evaluation of whether the statements were related to conspiratorial information, whereas for some topics (e.g. the Russian aggression against Ukraine and the Holocaust denial) the chatbots were significantly less likely to respond to the prompts or assign the misinformation label.

Taken together, our findings suggest that generative AI does have potential for automated content labeling, including highly challenging tasks related to veracity evaluation, in the context of political communication, but we need substantially more comparative research to understand how different chatbot settings vary depending on specific factors (e.g. whether the prompt is written in high- or low-resource languages) and whether it is subject to possible biases. It is important to continue investigating the possible impact of textual cues (e.g. the mention of the source of information) on the performance of LLM-based chatbots and performance variation depending on the topical issues which the chatbots are to deal with. Moreover, it is important to note that our findings highlight the potential inequalities regarding chatbots′ performance in different languages and socio-political contexts. In the case that these technologies are used by professional fact-checkers we see more potential with this technology especially under the condition of general-use LLMs being specifically fine-tuned for disinformation detection-related tasks and strategic implementation of value alignment. This could potentially help in dealing with a large volume of information or with automatically flagging problematic content for further evaluation. At the same time, it is crucial to recognise both advantages (e.g. scalability and accessibility for non-experts) and shortcomings (e.g. dependency on training data and potential knowledge gaps and biases associated with it) of using LLM-based tools in this context. While promising for detection of different forms of false information, LLM-based tools shall not be treated as a silver bullet (at least currently) and it is crucial that their users critically assess the capacities of these tools and are aware of their limitations, especially for semantically complex tasks, the realisation of which usually relies on human experts (e.g. professional fact-checkers).

This study has several limitations. First, we test the performance of LLMs given highly detailed instructions which may not be that common in a real-world environment. Second, some of the information types studied here are not always clear-cut: for example, a false claim might not fully fit the definition of a conspiracy theory but still be used as part of a larger conspiracy narrative. Finally, in this study, we evaluated the accuracy of chatbots based on how they labeled a statement. Yet, this research did not have the goal to verify the context of the model’s judgment (i.e. arguments why something is true or false), which can also be subject to factual errors. Thus, we suggest that future research should potentially focus on the analysis of the responses to less restricted and more natural (in a sense of being simpler and less structured) prompts which are more likely to be used by chatbot users in everyday situations and thoroughly analyse the veracity of the entire output. As another avenue for future research we would like to highlight the importance of investigating the temporal aspect of LLM-based chatbots in detecting misinformation. This is of particular importance when the aim is to facilitate speedy mitigation of exposure to misinformation to prevent its viral spread. Such strategies would require a comprehensive set of guardrails and a frequent adaptation to the political landscape in different countries and languages.