1 Introduction

On 22 March 2018, U.S. President Trump signed a memorandum that would apply tariffs on 60 billion U.S. dollars of Chinese goods. And then, China responded by plans for retaliatory tariffs on 3 billion dollars of US products, which marked the beginning of US-China Trade War [1]. As the trade tensions between the world’s two largest economies grow, a wide discussion about the cause and impact of this trade war heats up.

In the era of information, the Internet has become a very important place for people to express opinions, share ideas and discuss issues. Furthermore, there are a variety of applications such as Weibo, WeChat and Twitter which make communications more convenient. After US-China Trade War began, more and more people used these applications to express their attitudes [2, 3].

This paper collects data about US-China Trade War on Weibo, uses natural language processing methods and tools including “Jieba” Chinese text segmentation system, text vectorization program and K-Means algorithm, to analyze people’s attitudes towards US-China Trade War, and lays the foundation for predicting the future of this trade war.

This paper is divided into six sections:

  • Section 1: Introduction. It introduces the research background, significance and arrangement of this paper.

  • Section 2: Related Research. It gives a description of clustering algorithms of topic analysis and research on US-China trade war.

  • Section 3: Data Collection. It describes the process to collect data for this paper, including challenges, methods and results.

  • Section 4: Data Cleaning. It discusses the methods of cleaning data for this paper and shows the results of data pre-processing.

  • Section 5: Data Mining. It introduces K-Means algorithm, and uses this algorithm to extract keywords and topics of collected data for this paper.

  • Section 6: Conclusion. It summarizes the conclusions of data mining.

2 Related Research

There are many research papers on US-China trade war. Some papers focus on causes, consequence and countermeasures of this trade war. For instance, Luo [4] and Wu [5] use the strategic game to analyze why US started this trade war and how China could respond. There are also a lot of papers which pay attention to the influence on the international trade. Chen takes forest products for example to discuss the influence of this trade war [6]. Bai analyzes changes of soybean trade after this trade war started [7]. However, there are no papers which use data on social media to analyze public attitudes which could influence the decisions of government and then change the trend of this trade war.

Topic analysis of data on social media tends to use large amounts of data. Supervised learning method needs labeled data which is not suitable for topic analysis. Clustering is an unsupervised learning method which is popular in text analysis, because it does not need to cost a large amount of time and energy to label data. There are mainly three algorithms used in text clustering: K-Means, hierarchical clustering and DBSCAN method. Shiva Shankar uses K-Means algorithm to extract tweets from social media factors to recognize news topic pervasive in both social media and the news media [8]. Metre uses hierarchical clustering to measure the similarity between text documents [9]. Pandey uses DBSCAN to do topic analysis of georeferenced documents from tweets [10]. This paper analyzes short texts from Weibo which have a large number of features. The K-Means algorithm needs less time and storage space than hierarchical clustering and DBSCAN algorithms.

3 Data Collection

Web crawler or API interface provided by Weibo could be used to get data about US-China Trade War on Weibo. API interface is easy to use, but it has many restrictions on topic search. For instance, only the latest 200 results of topic search could return, and only a limited number of requests can be made each time [11]. Furthermore, API interface cannot denote the type or published time of topic search. These disadvantages preclude large data volume which lay the foundation of the following data mining. As a result, this paper chooses web crawler to do data collection.

This paper designs a web crawler in Python to collect Weibo data. The process can be divided into three steps:

The first is to get the web page. Javascript technology such as ajax is applied on Weibo to modify web pages [12], this paper imports selenium module into Python crawler to access complete web pages [13]. Furthermore, this paper sets search keywords as “trade war”. It is noteworthy that each Weibo keywords search could only return fifty pages of results, so this paper sets published time scope of each search as 1 day, and totally accomplishes 30 searches from 22 March 2018 to 21 April 2018 in order to get large data volume. In addition, many anti-spider measures are taken on Weibo to prevent large volume data acquisition. If crawler browses Weibo’s web pages frequently, Weibo would return pages with no results, even popup validation page to break up web page acquisition. To solve this problem, this paper introduces time delay to slow the speed of browsing web pages, and repeatedly requests Uniform Resource Locator of the pages which return no results.

Secondly, this paper does information extraction based on collected web pages. The nickname, published time and published content of each collected web pages are needed.

Finally, this paper uses Microsoft Office Excel to store collected data, and 19201 blogs in total are crawled.

4 Data Cleaning

First, this paper uses EXCEL’s built-in function to check and deletes collected blogs which are unsuitable for data mining. As for blogs with missing value, this paper uses EXCEL’s function of location to check and delete them. Plus, there are duplicated blogs which can be deleted by EXCEL’s deduplication function. Besides, in the process of data collection, this paper sets published time scope of blogs from 22 March 2018 to 21 April 2018. It is necessary to check if there are blogs which are out of this time scope and delete them. When finished, there are 18813 blogs left.

Most Weibo blogs are written in Chinese which are different from English content, because there are no spaces between Chinese words. However, many open source Chinese text segmentation tools have sprung up recently. Of all these tools, “Jieba” is a very popular tool, which not only has over 16000 stars on Github [14], but also has many functions based on text segmentation such as keywords extraction and part-of-speech tagging [15]. Based on the above considerations, this paper chooses “Jieba” module to divide Chinese blogs.

After text segmentation, many words which express people’s attitudes toward US-China Trade War are available. However, there are still some special characters (e.g. “@” and “/”) and some meaningless words such as “of ( )”, “I ( )”, “you ( )” among these words, which need to be filtered in order to do data mining. This paper makes use of regular express function in Python to delete special characters, and then employs stop word dictionary to remove meaningless words.

Python has many visualization tools to help us recognize words which come up most frequently. For example, WordCloud is a very popular tool which could be used to display word frequency visually [16]. Based on word frequency map, high frequent words which are not wanted are checked, such as nickname, web links and hyperlink, these meaningless words could be added to stop word dictionary to remove them automatically. In this way, this paper could predict people’s attitudes toward trade war more precisely. The word frequency map before and after manual check are shown in the figures below (Figs. 1 and 2).

Fig. 1.
figure 1

Word frequency map before manual check

Fig. 2.
figure 2

Word frequency map after manual check

5 Data Mining

5.1 Text Vectorization

In order to do data mining based on K-Means, this paper needs to convert collected blog content into vectors. TF-IDF (Term Frequency-Inverse Document Frequency) algorithm is a very popular text vectorization method which is composed of two parts:

The first is TF (Term Frequency) algorithm which calculates the frequency of each word in each document [17]. The more frequently a word comes up in a document, the more important the word is for the document.

$$ tf_{ij} = \frac{{n_{ij} }}{{\sum\nolimits_{k} {n_{kj} } }} $$
(1)

tfij is the frequency of word i in document j. nij is the counts of word i in document j. \( \sum\nolimits_{k} {n_{kj} } \) is the counts of words in document j.

The second is IDF (Inverse Document Frequency) algorithm which calculates how many documents have a certain word. The less frequently a word comes up in all the documents, the more important the word is for the documents where it comes up.

$$ idf_{i} = \log (\frac{\left| D \right|}{{1 + \left| {D_{i} } \right|}}) $$
(2)

\( \left| D \right| \) means the counts of documents. \( \left| {D_{i} } \right| \) is the counts of documents where there is word i.

TF is combined with IDF to do text vectorization.

$$ f \times idf(i,j) = tf_{ij} \times idf_{i} = \frac{{n_{ij} }}{{\sum\nolimits_{k} {n_{kj} } }} \times \log (\frac{\left| D \right|}{{1 + \left| {D_{i} } \right|}}) $$
(3)

When finished, a matrix of 18813 × 45054 which means 18813 blogs have a bag of 45054 words or features is available.

5.2 Clustering Based on K-Means

As for a feature matrix of 18813 × 45054, there are too many features which have a bad influence on the following storage and computation. In order to solve this problem, this paper does dimensionality reduction to improve computation speed. Generally, Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) are very popular methods to reduce dimension. But the feature matrix is a sparse matrix which is unsuitable for PCA [18]. Therefore, this paper uses SVD to reduce some less important features.

To select an appropriate size of features, this paper evaluates the performance of clustering based on K-Means of different size of features. The popular assessment criteria of clustering is silhouette coefficient which is denoted as s [19].

$$ s_{i} = (b_{i} - a_{i} )/{ \hbox{max} }(a_{i} ,b_{i} ) $$
(4)

ai is the mean distance between blog i and other blogs which are in the same cluster with blog i. bi is the minimum of the mean distance between blog i and other blogs which are in the same cluster not including blog i. The mean value of all blogs’ silhouette coefficient could be used to evaluate the performance of clustering. The mean value of silhouette coefficient is between −1 and 1, and the larger the value is, the better the performance of clustering is. The values of silhouette coefficient are shown below in Fig. 3.

Fig. 3.
figure 3

Silhouette coefficient based on different size of features and clusters

As can be seen from the above figure, the silhouette coefficient value based on 100 dimensions of features is better than that based on 300 or 500 dimensions. In addition, if the number of dimensions is denoted as 100, there is an appropriate silhouette coefficient value based on 8–11clusters, so the number of clusters is denoted as 10. The clusters and keywords are described as follows in Table 1. Because of limited space, only 4 clusters are shown.

Table 1. Clusters and keywords

5.3 Topic Analysis

Based on keywords in each cluster, topics could be summarized. For example, cluster one pays attention to the influence of trade war on stock market and trade, as well as Trump’s policy on tariff. Cluster two concerns about the variety of stock which could make more money in short-term and mid-term in the context of trade war, such as pharmaceutical shares. The main point of Cluster three is that trade war is bad news for financial investment market, which necessitates wait-and-see attitudes and relative measures to offset the risks. Cluster five includes two subjects. One is about trading position, full, short, or other positions. The other focuses on the confidence China has to fight against US.

6 Conclusion

This paper uses python crawler to collect Weibo blogs on “US-China Trade War”, cleans data by EXCEL’s built-in functions and Python’s “Jieba” text segmentation and WordCloud module, does text vectorization through TF-IDF algorithm, gives clusters based on K-Means algorithm, and talks about topics in each cluster. To put it in a nutshell, Weibo blogs on this trade war focus on the influence of this trade war on financial investment market. In particular, many people pay attention to stock variety and positions in short-term and mid-term to make profit in the context of trade war. In addition, it is widely believed that China could cope with difficult although this trade war has bad influence on trade and economy. The topic analysis above shows that most Chinese people tend to be optimistic and confident for “US-China Trade War”, which is helpful for China to response this trade war rationally.