1 Introduction

Water is the source of all life on Earth and water moves often crossing political boundaries. Latest statistics show that there are 310 transboundary river basins globally, which are shared by 150 countries, covering 47.1% of the Earth’s land surface, including 52% of the world’s population, and accounting for almost 60% of the world’s freshwater flow (McCracken and Wolf 2019). With population growth, economic development and climate change, transboundary water resources management has emerged as a crucial global challenge in the twenty-first century as it could spark conflicts and, in extreme cases, even wars (Biswas and Tortajada 2019; Baranyai 2020; Iyer 2020; Gökçekuş and Bolouri 2023). Therefore, understanding the dynamics of conflict and cooperation in transboundary river water management is significant for addressing this global challenge (Uitto and Duda 2002; Phillips et al. 2006; Rai et al. 2017; Karim 2020; Honkonen 2022).

News media plays a crucial role in both mirroring and influencing public opinions regarding key issues (Miles and Morse 2007). It has a substantial impact on shaping policy agendas and providing a public sphere for deliberating and legitimizing policy options, thus influencing the formulation of policy alternatives (Steffek 2009; Kleinschmit 2012). In transboundary river issues, news media articles can reflect the attitudes and values of the country or public towards water events occurring within the region. News coverage from non-riparian countries can provide insights into the international public opinion on water events that occurred in the specific transboundary river basin.

Sentiment analysis, also known as opinion mining, plays a crucial role in new media data research (Bukovina 2016). It encompasses a range of computational techniques aimed at identifying, extracting, and analyzing human emotions, feelings, or opinions from textual data (Sadegh et al. 2012; Fang and Zhan 2015; Elnagar et al. 2020). Sentiment analysis can be categorized into either binary or multi classes. In binary sentiment analysis, the text is divided into two classes, encompassing positive and negative sentiments, while multi-class sentiment analysis involves the classification of the text into multiple levels or fine-grained labels. There are two kinds of methods for sentiment analysis: machine-learning methods and dictionary-based methods (Medhat et al. 2014). Machine learning methods can be further classified into supervised and unsupervised approaches, with a predominant reliance on supervised classification approaches. This entails the need for annotated data to train the classifiers (Gonçalves et al. 2013). In the case of supervised methods, it is necessary to have two separate sets of labeled data, one for training the model and another for testing its performance (Vinodhini and Chandrasekaran 2012; Ravi and Ravi 2015). Dictionary-based methods rely on pre-defined lists of words, lexicons, or dictionaries, where each word is assigned a specific sentiment.

Revealing the dynamics of conflict/cooperation in transboundary water management has high requirements for sentiment analysis in news media. Firstly, it requires grasping many stakeholders’ delicate opinions. Transboundary rivers such as the Lancang-Mekong River, the Indus River and the Nile River involve thousands of organizations including IGOs, NGOs, River Basin Organizations (RBOs), government ministries/agencies of each country, industries, financial institutions, civil groups and research institutes/universities (Wei et al. 2021). It is well argued that conflict/cooperation in transboundary water management is an important part of international politics, thus delicate or multi-level classification of sentiments is required. Most previous studies with manual coding methods adopted from 3 up to 15 levels (Azar 1980; Yoffe and Larson 2001; Grünwald et al. 2020b). Secondly, conflict/cooperation in transboundary water management requires not only an understanding of historical sentiment patterns on conflict and cooperation dynamics (Turton 2005; Zeitoun and Mirumachi 2008; Wei et al. 2021) but also the capability for timely monitoring and prediction of public sentiment surrounding such water conflicts (Warner 2023). This involves the early detection and resolution of potential water conflicts before they escalate, facilitating a proactive and pre-emptive approach to transboundary river water management. Thirdly, given the exponential growth of news text data accessible relevant to transboundary rivers (Fesseha et al. 2020), manual examination and classification have become impractical and unworkable. To address all the requirements, there is a critical need for the automated classification and processing of these news data (Bobichev and Cherednichenko 2017). The recent advances in machine learning have provided strong support for the development of automated text categorization systems (Hartmann et al. 2019; Kadhim 2019; Barberá et al. 2021) and public opinion monitoring (Meng et al. 2022; Chen and Du 2023; Duan et al. 2023). However, different machine learning models exhibit variations in performance across diverse research domains (movie reviews, product development, restaurant reviews, etc.) in sentiment analysis tasks (Maulana et al. 2020; Zahoor et al. 2020; Dashtipour et al. 2021; Giannakis et al. 2022; Yang et al. 2023). The lack of a clear consensus on which machine learning model is more suitable for specific domains in sentiment analysis tasks restricts the effective application of machine learning methods in transboundary water conflict dynamics.

The study aims to compare and evaluate the performance of different machine learning models in simulating transboundary water conflict and cooperation dynamics. Ten commonly used machine learning models (K-Nearest Neighbors, Naive Bayes, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Decision Tree, Extreme Gradient Boosting, Multilayer Perceptron, Long Short-Term Memory and Bidirectional Encoder Representations from Transformers) will be assessed on a corpus of global transboundary rivers news media articles and annotated each news article with five sentiment categories: − 2, − 1, 0, 1, and 2 according to the level of their conflict and cooperation. The best-performed model will be compared with dictionary-based approaches and validated using historical water events in the Lancang-Mekong River, the Nile River, and the Indus River basins. By identifying the most effective models for detecting nuanced sentiments related to transboundary water conflict and cooperation, the findings from this study are expected to advance the capabilities of policymakers, stakeholders, and researchers for early detection and proactive management of potential water conflicts, improving cooperation of transboundary rivers.

2 Methods

The flowchart in Fig. 1 outlines the steps of the research methods employed in this study. The process begins with data acquisition, followed by data labeling, data processing in the third step, model training in the fourth step, and finally model evaluation.

Fig. 1
figure 1

The flowchart of the methods in this study

2.1 Data acquisition

New media data collection is one of the major challenges in assessing the dynamics of transboundary water conflict and cooperation (Grünwald et al. 2020a). Since English is the most commonly used language in international communication, and the majority of countries have English-language news media, English news articles served as the primary focus of this study. The news articles were collected from the LexisNexis database, which is globally recognized and comprises full-text English news articles from more than 6000 newspapers worldwide, making it one of the extensively utilized news repositories in social science research (Weaver and Bimber 2008; Racine et al. 2010; Jiang et al. 2017). The components of search terms significantly influence the coverage and relevance of the news articles to be retrieved. Guo et al. (2022)’s search terms developed for the LexisNexis database were adopted in this study which was developed based on Transboundary Freshwater Dispute Database (TFDD) (Yoffe and Larson 2001). These search terms consist of five components, denoted as Basin Name, Riparian Countries, Theme Terms, Conflict/Cooperation Terms and Excluded Terms. The five sections aid in narrowing down the search to the intended range. Additionally, the utilization of a list of terms to exclude helps eliminate irrelevant topics. Totally, 9,382 relevant news articles were collected and analyzed, covering 105 out of 310 transboundary river basins, which were published by 759 news media agencies from 84 countries. The time frame was from 1977 to 2022. These news articles comprised the corpus used in this study.

2.2 Data labeling

In the sentiment classification of text, a binary (positive or negative) classification is predominantly adopted. This classification reduces complexity, leading to high predictive accuracy. However, this approach fails to capture subtle sentimental nuances and might overlook public sentiments that require special attention. Conversely, multi-level classification offers more comprehensive sentiment categorization, enabling the classifier to distinguish different sentiment states more accurately. Nevertheless, the classifier needs to predict in a larger decision space, which may lead to confusion between categories. Considering the dual requirements of classification and data characteristics, the number of labels was determined to be 5 in this study. Building on the characteristics of news media articles in this study and previous studies on defining the intensity of conflict or cooperation in transboundary water events (Azar 1980; Yoffe and Larson 2001; Grünwald et al. 2020b), the public sentiment polarities reflected in news media articles were categorized into five classes, corresponding to Cooperative response for actions (2), Oral expression of cooperative response (1), Neutrality (0), Oral expression of conflictive response (− 1) and Conflictive response for actions (− 2). Cooperative response for actions (2) signifies substantive collaborative actions in various fields jointly taken by the public or officials to achieve cooperation on water management, such as meetings, signing cooperation agreements/treaties, etc. While conflictive response for actions (− 2) represents public or official actions such as protest marches, various forms of hostile behavior across different domains, court arbitration, military conflicts, etc. Oral expression of cooperative response (1) indicates verbal support from the public or authorities towards the associated water event. Analogously, oral expression of conflictive response (− 1) refers to verbal expressions displaying negativity, discord, opposition and hostility. Neutrality (0) means that the relevant news media articles have no sentimental preference and primarily offer an objective description of the water event.

The sentiment polarities of collected news articles were classified by human reading. Apart from the authors of this paper, this study also invited several assistants to aid in judging the sentiment polarity of the news articles. All members were trained to judge the sentiment polarity of news articles with strict consistency. The sentiment polarity of each news article was judged by at least two group members. In the case of inconsistent judgment of a news article, the sentiment polarity of this article was determined by the consensus of the whole team. The result of data labelling is summarized in Table 1.The distribution of label values in the corpus aligned with the previous study by Yoffe and Larson (2001).

Table 1 News articles distribution across different labels

2.3 Data processing

Data processing is an essential prerequisite for cleaning the corpus and improving the results. To prepare the news text data for analysis, several processing steps were implemented. Firstly, all news text content was converted into lowercase to ensure uniformity and reduce the impact of case sensitivity. Numbers, URL and special symbols were systematically removed from the corpus. In addition, a list of predefined stop words commonly occurring but semantically less informative was employed for elimination. Furthermore, punctuation marks were stripped from the news text content. Next, feature extraction was performed on the cleaned news text data. The purpose of text feature extraction is to transform the processed news text into numerical features that can be utilized by machine learning models. The Term Frequency-Inverse Document Frequency (TF-IDF) method was mainly utilized to extract features of the news text data (Baeza-Yates and Ribeiro-Neto 1999; Li et al. 2022). The essential idea underlying TF-IDF is that words with higher frequency in one news article and less in others should be more important as they contribute more to classification (Chiny et al. 2021; Liu et al. 2022a). After text feature extraction, each news media article was represented by a vector suitable for machine learning algorithms.

2.4 Machine learning models

Many widely known machine learning models have been used to solve text classification problems. In this study, 10 machine learning models commonly used in the field of text sentiment analysis were selected to classify the news articles (Table 2).

Table 2 The strengths and weaknesses of the machine learning models used in this study

K-Nearest Neighbors (KNN), a supervised learning algorithm (Cover and Hart 1967), classifies a new text sample based on the class labels of its nearest K neighbors. The class with the highest number of votes among the K neighbors determines the sentiment category of the news text. Naive Bayes (NB) is a simple and fast probabilistic classifier (Maron 1961). It is based on Bayes’ theorem and assumes that the features are conditionally independent given the class label. NB calculates the probability of a text belonging to a particular sentiment class by the occurrence of its features in the training data. Support Vector Machine (SVM) is a widely used and robust supervised classifier (Cortes and Vapnik 1995; Morovati et al. 2021). For the five-class classification task in this research, the model utilizes multiple binary classifiers to achieve multi-class classification. Decision Tree (DT) is a classification algorithm that resembles a tree structure (Mitchell 1997), with each node of leaf representing a feature/attribute and equivalently a result or condition. It recursively reaches a conclusion by partitioning a tree (Fesseha et al. 2020). Random Forest (RF) is a classifier that relies on an ensemble of decision trees (Breiman 2001). The final class is determined by aggregating the outcomes of the individual trees. Gradient Boosting Decision Tree (GBDT) is an ensemble learning algorithm. It constructs multiple decision trees iteratively, where each new tree focuses on correcting the errors made by the previous ones. By combining the predictions of all trees, GBDT improves the classification accuracy and effectively handles complex relationships in the input text data, making it suitable for sentiment analysis tasks. Extreme Gradient Boosting (XGBoost) is an advanced implementation of the Gradient Boosting Decision Tree (GBDT) algorithm (Chen and Guestrin 2016). It optimizes the GBDT algorithm with enhancements in tree construction and regularization techniques, resulting in better performance and faster training, which can be a powerful choice for sentiment analysis tasks. Multilayer Perceptron (MLP) is a type of feedforward neural network (Khalil Alsmadi et al. 2009). It consists of multiple layers of interconnected neurons, including an input layer, one or more hidden layers, and an output layer. It learns to extract relevant features from the input text and maps them to the corresponding sentiment class through a series of nonlinear transformations and weight adjustments during training. Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) designed to handle sequential data (Hochreiter and Schmidhuber 1997; Li et al. 2023), like news media articles and other textual data. It can capture long-range dependencies and context information from the text, making it effective for understanding and classifying the sentiment of the input text. It processes sequential input step by step, updating its memory cell and hidden state to retain relevant information and make accurate sentiment predictions. Finally, Bidirectional Encoder Representations from Transformers (BERT) is a widely used and powerful language representation model used for text classification (Devlin et al. 2018). It employs a transformer-based architecture to pre-train on vast amounts of text data, enabling it to capture rich contextual information. Fine-tuning BERT on sentiment analysis allows it to understand the nuances of text and make accurate predictions for sentiment classification, achieving high performance in natural language processing tasks. As shown in Table 2, each of these 10 machine learning models has its own strengths and weaknesses when performing text sentiment analysis tasks (Ting and Zhang 2003; Gupte et al. 2014; Bhavitha et al. 2017; Yang et al. 2017; Hemmatian and Sohrabi 2019; Prabha and Srikanth 2019; Srivastava et al. 2020; Mohammadi and Shaverizade 2021; Saifullah et al. 2021; Hariguna and Ruangkanjanases 2023; Syriopoulos et al. 2023).

In our study, the generalization of the modeling process was achieved through meticulously designed training conditions and validation methods. Firstly, the ten models were run under the same training condition, with 80% of the news text data in the corpus used as the training dataset and 20% as the testing dataset. Secondly, recognizing that the performance of machine learning models can be highly sensitive to the choice of hyperparameters, we conducted extensive hyperparameter tuning process using the method of GridSearchCV, which performed an exhaustive search by trying all possible combinations of hyperparameter values from the parameter grid, such as the different combinations of parameter C, kernel type and parameter gamma for the Support Vector Machine (SVM). During each iteration of the grid search, the model's performance was evaluated using cross-validation. We performed fivefold cross-validation on the training dataset (80% of the news articles in the corpus). The training dataset was divided into five equal subsets, of which four sets were used for training, while the remaining one was used as the validation set. This process was then repeated five times, each time selecting a different subset as the validation set, which helps in assessing the models’ performance more reliably across different subsets of data.

2.5 Model evaluation metrics

The performance of the training models for analyzing the sentiment polarities of news media articles in this study was firstly assessed by using the metrics commonly used in machine learning. The four metrics—accuracy, precision, recall and F1-Score were chosen for comparing the ten models in this study (Hemmatian and Sohrabi 2019; Wang et al. 2019). Accuracy (Acc) is the proportion of correctly predicted samples to all samples. Precision (Pre) is the proportion of samples that actually belong to a specific category out of all samples predicted as that category. Recall (Rec) is the proportion of samples belonging to a specific category that are correctly predicted by the classifier out of all samples belonging to that category. F1-Score (F1) is a metric that considers both precision (Pre) and recall (Rec), the weighted average of precision (Pre) and recall (Rec). Precision, recall and F1-Score are vital metrics for unbalanced test datasets. In terms of evaluating classifiers, accuracy and F1-Score stand out as the primary metrics employed to assess the text classification methods (Wadud et al. 2022; Al Mahmoud et al. 2023). In this study, the precision, recall and F1-Score of each label were calculated separately to analyze the performance of not only the whole but also the individual label. In addition, the evaluation metrics of the prediction results can be influenced by the proportion of the label values of news articles in the training dataset (Ertekin 2013). Both under-sampling and over-sampling can tackle the issue of imbalanced label distribution (Amin et al. 2016), but over-sampling can have a more substantial positive impact on classification performance compared to under-sampling when dealing with intricate data types (Ertekin 2013; Chen et al. 2020). Given that the number of collected news articles with extreme conflictive label values (− 2) is significantly less than others in this study, the over-sampling method was adopted to achieve the balanced distribution of the news articles with different label values by replicating existing news text data from the minority class.

Then, the training models’ performance was assessed by comparing the results obtained from the conventional dictionary-based approaches. Dictionary-based approaches are widely employed in the field of text sentiment analysis (Hardeniya and Borikar 2016; Zhang et al. 2018; Catelli et al. 2023). A sentiment dictionary contains a list of words, with each word being associated with a sentiment polarity label. By matching the vocabulary in preprocessed text with the words in the sentiment dictionary, the sentiment score is calculated for each matched sentiment word based on its sentiment label and weight in the dictionary. The sentiment scores of all matched words are then aggregated to obtain the overall score of the entire text. We compared the performance of machine learning methods and dictionary-based methods for the 5-level sentiment classification, highlighting the possible advantages and disadvantages of the two approaches.

As there is no specialized sentiment dictionary available for this study, the widely adopted AFINN lexicon (Nielsen 2011; Huang et al. 2017; Shuvo et al. 2023) was selected to represent the dictionary-based methods. AFINN encompasses 2477 attached word sentiments, with each word scoring ranging from − 5 (very negative) to + 5 (very positive) (Nielsen 2011). As a result, the sentiment score of the text is also confined to the range of − 5 to + 5. The sentiment polarities of news articles calculated by AFINN fell within the range − 2 to + 3, not aligning with the label categories defined in this study. According to the mapping relationship between AFINN prediction scores and the true label values, we corresponded AFINN score − 2 with the defined label value − 2. Similarly, AFINN score − 1 corresponded to label value − 1, AFINN score 0 corresponded to label value 0, AFINN score 1 corresponded to label value 1, and AFINN scores 2 and 3 corresponded to label value 2. In this way, the sentiments calculated by AFINN were transformed into the five labels defined in this study. We chose the best-performed model as the representative of all machine learning models to compare with the results from the dictionary-based approaches.

Finally, we validated the training models’ performance with the historical water events in three case transboundary rivers. Among the 77 transboundary river basins covered by news media articles in the testing dataset, the Lancang-Mekong River Basin in South-East Asia, the Nile River Basin in North-East Africa and the Indus River Basin in South Asia were selected as case study areas, all of which are important and representative transboundary river basins in the world. The Lancang-Mekong River is the longest transboundary river in Southeast Asia. It flows from the Tibetan Plateau in China, then flows through Myanmar, Laos, Thailand, Cambodia, and Vietnam, and finally flows into the South China Sea. The Lancang-Mekong River system is relied upon by over 70 million people for water supply, food production and transportation (Junlin et al. 2021; Lu et al. 2021; Liu et al. 2022b). Riparian countries in the Lancang-Mekong River basin share divergent interests in the development and conservation of the basin. In the face of impacts from geopolitical shifts, hydrological changes and socio-economic development, the Lancang-Mekong River basin is undergoing fluctuations in water conflict and cooperation. The Nile River is the longest transboundary river in the world, measured at 6670 km. It originates from the plateau in Burundi and flows northward through Rwanda, Tanzania, Kenya, Uganda, the Democratic Republic of the Congo, South Sudan, Sudan, Ethiopia and Egypt, then flows into the Mediterranean Sea. 300 million people are relying on the waters of the Nile River, and the population within the basin continues to grow rapidly (Paisley and Henshaw 2013). The upstream and downstream countries in the Nile River basin face significant disputes over the allocation of water resources (Alemu and Dinar 2000; Whittington 2022). The Indus River has a total length of 3200 km, originating from the Tibetan Plateau in China, then flowing through India and Pakistan, finally flowing into the Arabian Sea (Abro et al. 2020). It is one of the main rivers and an important source of agriculture irrigation in Pakistan. The long-standing disputed territorial borders along the river flows between India and Pakistan and continuous domestic over-development of the shared water resources from the two countries have resulted in a tense situation in the Indus River basin (Yaqoob 2019; Rigi and Warner 2020; Janjua et al. 2021). The predictions of the best-performed model were validated against the most conflictive water events occurring within these three basins.

3 Results

3.1 Model performance assessment with the machine learning metrics

The evaluation of model performance is essential for any machine learning classifier as it evaluates the predictive capacity of the classifier (Guisan et al. 2017; Khanal 2022). Table 3 summarizes the accuracy results of the ten machine learning models. The best accuracy recorded by BERT model reached 72.7%, while conversely, the lowest scored by DT model marked 57.8%. The accuracy of the remaining eight models ranged from 58.6% to 71.4%. The average accuracy was 66.5%. One shocking observation was that the accuracy of the LSTM model was only 62.2%.

Table 3 The accuracy of each model

Figure 2 clearly shows the precision of all models across each label. For the sentiment label -2, BERT and GBDT achieved precision rates of 79.3% and 77.6% respectively, both of which demonstrated strong performance in predicting extreme conflictive sentiment tendencies. KNN exhibited a significantly higher precision rate of 88.3% compared to other models in the label value of − 1 category, depicting excellent performance in predicting conflictive sentiment. RF outperformed other models in terms of precision when it came to sentiment label 0, signifying its good performance in predicting neutral emotions. For the sentiment value of 1, KNN and SVM attained relatively high precision rates of 72.7% and 70.1% respectively, underscoring their effectiveness in predicting news media articles characterized by a moderately cooperative sentimental inclination. BERT and SVM showcased the highest precision rates in the case of sentiment label 2, both at 80.1%, emphasizing their exceptional capability in predicting cooperative sentiments.

Fig. 2
figure 2

The precision of each model across different label values

Further, Fig. 3 presents a comparative view of the recall performance among all models for varying label values. On sentiment label − 2, the recall rates across all models were relatively low, with the highest being KNN and MLP at 66.7%, suggesting that all models had difficulty in identifying extreme conflictive sentiment. This could be attributed to the limited representation of label − 2 in the corpus. With a recall rate of 85.4% on sentiment label − 1, SVM demonstrated excellent capability in identifying milder conflictive sentiment. Despite the lower accuracy, LSTM stood out with a significantly higher recall on sentiment label 0, achieving 71.5%, which highlighted that LSTM was more accurate in recognizing neutral news media articles. GBDT and BERT displayed relatively higher recall rates on sentiment label 1, reaching 69.1% and 68.4% correspondingly. It implied that these two models excelled in identifying articles with a lighter cooperative sentiment. And with a recall of 86.1% for sentiment label 2, KNN performed exceptionally well in identifying highly cooperative sentiment.

Fig. 3
figure 3

The recall of each model across different label values

The F1-Score for each label is visualized in Fig. 4. In terms of extremely conflictive sentiment (label − 2), the F1-Scores across all models were relatively low, the best and worst results were reported by GBDT with F1-Score of 66.2% and LSTM with F1-Score of 44.3% respectively. This indicated subpar performance of all classers in predicting extremely conflictive sentiment. With respect to sentiment label − 1, aside from KNN, DT, and LSTM, the remaining models showed similar F1-Scores, falling within the range of 70–77.7%. Moving on to neutral sentiment (label 0), LSTM achieved the highest F1-Score at 70.7%, while the remaining nine models only fluctuated between 45.7 and 58.0%. For cooperative sentiment, labelled as 1, the F1-Scores for all models varied from 54.1 to 67.2%, signalling a challenge in accurately predicting news articles with mild cooperative sentiment. Lastly, for highly cooperative sentiment (label 2), BERT and XGBoost led the pack with the highest F1-Scores of 81.0% and 79.4%, shining in predicting highly cooperative sentiment.

Fig. 4
figure 4

The F1-Score of each model across different label values

The performance of the machine learning models in predicting the conflictive sentiments (− 1 and − 2) should be given special attention as they are foci for transboundary water management. Despite the precision of KNN on the sentiment label − 1 was 88.3%, which was significantly higher than other models, its recall was only 42.3%, indicating that over half of the news media articles with a true label of − 1 in the testing dataset were not successfully identified. Considering precision, recall and F1-Score comprehensively, the performance of BERT in predicting and identifying sentiment label − 1 remained superior. NB and DT exhibited lower precision at 47.7% and 46.3%, respectively, suggesting that NB and DT are not suitable for the prediction and recognition of sentient label − 2. Among all the news media articles with the sentiment label − 2 in the testing dataset, LSTM only successfully predicted at 38.0%, showing the poorest performance in recognizing sentiment label − 2. The recall of DT was also below 50%, at 48.7%. By comparing the recall of the models, it can be observed that LSTM and DT failed to meet the requirements of the task. Both BERT and GBDT demonstrated high F1-Score in predicting sentiment label − 2. BERT exhibited higher precision than GBDT, while GBDT displayed slightly higher recall than BERT, illustrating their similar competence in predicting and identifying the most conflictive sentiment.

3.2 Model performance assessment by comparing with sentiment dictionary

Given the comprehensive performance of BERT and its outstanding capabilities in predicting and identifying conflictive sentiment labels, it was chosen as the representative of all models. The landscape of the correspondence relationship between the sentiment polarities predicted by BERT and AFINN and the true labels of the news media articles is provided in Fig. 5.

Fig. 5
figure 5

The landscape of the correspondence relationship between the sentiment polarities predicted by BERT and AFINN and the true labels of the news media articles

Firstly, accuracy scores were calculated to evaluate the overall performance of these two approaches. BERT exhibited an accuracy of 72.7%, which was notably higher than AFINN's 27.4%. Next, regarding the prediction and identification of conflictive sentiment labels, among 84 news media articles with a sentiment label value of -2 in the testing dataset, BERT incorrectly predicted the sentiment values of 21 news articles as label − 1, 6 news articles as 0, 7 news articles as 1 and 4 news articles as 2. Surprisingly, compared to BERT, AFINN failed to identify any news articles with label − 2, and mispredicted 75% of them as neutral or cooperative. The performance of AFINN in predicting and identifying sentiment label -1 was similarly poor.

3.3 Validation of the model’s performance with historical water events

BERT was chosen as the representative of all models because of its outstanding capabilities in predicting and identifying conflictive sentiment labels. The performance of BERT in predicting sentiment label − 2 was verified with historical water events occurring within transboundary river basins. Table 4 presents the number of news media articles with true sentiment label − 2, the number of news media articles predicted as sentiment label − 2, the number of news media articles with true sentiment label − 2 successfully predicted in each basin, and calculated precision (Pre) and recall (Rec).

Table 4 Number of news with true/predicted sentiment label -2 in each river basin and calculated precision (Pre) and recall (Rec)

BERT had the worst performance in predicting and identifying news media articles with sentiment label − 2 in the Lancang-Mekong River basin, with precision and recall rates of 33.3% and 22.2% respectively. Following that was the Nile River basin, with precision and recall rates of 83.3% and 41.7%. The best performance was observed in the Indus River basin, with precision and recall rates of 84.6% and 63.5%, respectively. As the number of news media articles with the true label − 2 increased in the river basin, both precision and recall improved. It can be observed that BERT predicted a minority of news media articles with the true label − 2 as neutral or cooperation. In the Lancang-Mekong River basin, for instance, a news media article about donors slashed funding for MRC was incorrectly predicted as neutral. In the Nile River basin, 5 news media articles with the most conflictive sentiment were predicted as neutral or cooperative, especially two of them about the giant dam construction of Ethiopia were predicted as the most cooperative sentiment. And in the Indus River basin, there were still instances discussing the arbitration between India and Pakistan on hydropower projects that were predicted as cooperative. In such cases, decision-makers would be confused by the predicted results of BERT, leading to biased decision-making. The overviews of the main contents of news media articles with true label − 1 and predicted as label − 2 in the Lancang-Mekong River basin, the Nile River basin and the Indus River basin are shown in Tables A1, A2, A3 in the Appendix.

4 Discussions and conclusions

Understanding the dynamics of conflict and cooperation on water is crucial for global transboundary river management. This paper presented a study on comparing the performance of different machine learning models in analyzing the sentiment polarity in news media articles on transboundary water conflict and cooperation. We developed a large corpus of 9382 news media articles on transboundary water conflict and cooperation, collected from the LexisNexis database, covering 105 of the 310 transboundary river basins globally. Each news article in the corpus was manually labeled with the value of − 2, − 1, 0, 1 or 2. A higher label indicates a greater cooperative sentiment, while a lower label signifies a higher level of conflict sentiment. A total of 10 well-known machine learning models including K-Nearest Neighbors, Naive Bayes, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Decision Tree, Extreme Gradient Boosting, Multilayer Perceptron, Long Short-Term Memory and Bidirectional Encoder Representations from Transformers, were explored for their application on the newly created corpus. The performance of each model was firstly assessed by four metrics: accuracy, precision, recall and F1-Score, then compared with the results from the dictionary approach, and finally validated with the historical water events in three case rivers. The key findings and their implications for future research and management practices are summarized below.

Regarding the four metrics, KNN, DT and LSTM fell short, whereas RF, GBDT and XGBoost displayed relatively stronger performance, with BERT leading the pack overall. As a deep learning model, the accuracy of the LSTM model should be higher than that of traditional machine learning models (Abd El-Jawad et al. 2018; Alameri and Mohd 2021; Dhola and Saradva 2021). One potential reason could be that the LSTM model requires a substantial amount of training data (Li et al. 2018; Derbentsev et al. 2023). In this study, 80% of the news text data from the corpus was utilized as the training data, which might be relatively limited for optimal performance of the LSTM model. Another reason may be that the LSTM model may face challenges in processing long texts, especially when the news text data is exceptionally lengthy, which could lead to some important feature information loss (Rao et al. 2018; Zhai et al. 2023).

The performance of these models in predicting and identifying news media articles with conflictive sentiments is extremely important for transboundary management because predicting news articles with conflictive sentiments as neutral or cooperative may lead to overlooking public opinions that require attention, downplaying real conflicts, and misleading decision-making. It was found that KNN, NB, DT and LSTM displayed limited applicability in predicting and identifying news articles with conflictive sentiments. While BERT excelled in overall performance, it also showed strong prediction and identification capabilities for conflictive sentiments.

The results from the comparison between BERT and the sentiment dictionary AFINN indicated that AFINN was not proficient at predicting news articles with conflictive sentiment polarities and may ignore public opinion information that required attention. The reason why AFINN predicted the sentiment polarities of many news articles as neutral mainly lies in two aspects. On the one hand, AFINN may have insufficient coverage of sentiment words, potentially lacking many terms that appeared in the news media articles of this study, leading to the inability to calculate sentiment scores for these words. On the other hand, AFINN directly calculated sentiment scores through the addition and subtraction of sentiments of each word, overlooking the semantic correlations between sentences in the news text. The main difference between BERT and AFINN was that the former concentrated errors on news articles marked by human annotators as neutral or hard to judge, while the errors in AFINN depended primarily on dictionary features. Although BERT performed significantly better than AFINN on this specific task, it operated as a black-box model, resulting in relatively poor interpretability of the model output (Li et al. 2022). While dictionary-based methods are based on manually curated vocabularies and sentiment rules, providing a certain level of interpretability to its results. Simultaneously, pre-tagging or labeling of data is not required, which can save time and effort in preparation. Therefore, in scenarios with limited computational resources and smaller datasets, dictionary-based methods can still be preferred, even if they offer slightly lower performance.

The results from the alignment of BERT with historical water events in three case rivers indicate that the performance of BERT varied across different river basins, with the best predictive and recognition performance for news media articles with the most conflictive sentiment in the Indus River basin. This was attributed to the availability of sufficient training data for the Indus River basin, allowing the model to better learn news text features with sentiment label − 2. According to the performance of BERT in the Lancang-Mekong River basin, it showed that if the sample size of news media articles with sentiment label − 2 was very small in a basin, the over-sampling method utilized in the training process can’t effectively improve the precision and recall of the model. Therefore, given the current inability to improve the model’s performance on small samples, machine learning model should be cautiously considered for extreme sentiment monitoring and forecasting in transboundary rivers lacking enough news media articles with these extreme sentiments. To address the challenge of classifying extreme sentiments accurately, it is necessary to combine the predictive results of machine learning models with manual review and examination, which can maintain efficient automated processing while improving the accuracy and reliability of the classification.

By assessing the performance of models in the predicting of sentiments of historical news articles on transboundary water conflict and cooperative issues, we established a benchmark for the reliability and robustness of these models. This is a critical step in demonstrating that the trained models could be trusted and used in real-time monitor for future sentiment analysis. The capability to predict sentiment trends allows for the early identification of potential conflicts, enabling the transboundary water issues to be addressed proactively rather than reactively. This historical perspective is also crucial for policymakers and stakeholders to engage more effectively with communities based on historical pattern of public sentiment towards transboundary water issues, fostering more inclusive and participatory policies to transboundary water resources management. As a result, the capability to predict sentiment trends and lessons learned from the historical events provided by this study enable policymakers to early detect the social, economic and environmental risks of transboundary rivers, foster cooperation among riparian countries, ensure the equitable distribution of water resources for regional stability, and build a community with a shared future for mankind.