Keywords

1 Introduction

Social media presents rich and timely feedback on news events that take place around the world. According to the report from Pew Research Center, 63 % of social users from Twitter and Facebook accessed news online, and roughly a quarter of them actively expressed their opinions on daily news through these social applications [2]. The various user-generated content not only fuels the news with different events from different perspectives, but also spurs additional news coverage in the event. On the other hand, reading the social media, understanding and responding to public voice timely and objectively, can help news media promote its influence on the public.

Example. In the event Asia-Pacific Economic Cooperation (APEC) 2014 in China, region cooperation, global economic were the topics supposed to be reported by news media. However, in fact, social media users posted significant amount of comments on APEC blue – rare blue sky in Beijing during the summit due to emission reduction campaign directed by Chinese government. Following that, news media quickly followed and paid great attention on this topic which was beyond the original news agenda. We found that 38 of the 176 news articles on Sina were related to the APEC blue. Furthermore, how news responds to the public voice has a significant impact on the government credibility. For example, two severe earthquakes struck China in 2014, i.e., Yunnan and Sichuan. Both reports covered the major topics, but in Yunnan earthquake, news media responded the public timely and pictured a comprehensive image of event progress from the perspectives of the public, and thus harvested better support from the public. Therefore, investigating the public influence on news is of great benefit to public opinion management and government credibility improvement.

Related researches on mutually news and UGC stream analysis mainly follows three lines. The first studies event evolution within individual news stream [1, 8], e.g., Mei and Zhai adapt PLSA to extract topics in news stream, then identify coherent topics over time, and finally analyze their lifecycle [17]. The second focuses on simultaneously modeling multiple news streams, e.g. identifying the characteristics of social media and news media [26] or their common and private features [10, 23]. But both of them pay little attention on the interactions between two streams which could inspire their co-evolution. The last comes from the journalism communication. It applies agenda setting theory [16] to analyze the interactions between different news agenda, and it is often completed via questionnaire survey or manual work on limited events. However, in the era of social media, agenda setting is not a one-way pattern from news to the public, but rather a complex and dynamic interaction.

Detecting the public influence on news poses unique technical challenges: (i) most researches use latent topic to model news and UGC, but the traditional word distribution representation [5, 17] suffers from the sparsity problem due to the UGC’s short and fragmented characteristics, making it difficult to track topic changes; (ii) how to detect the cross-media influence links remains another problem, since the commonly-used measures (e.g., KL-divergence [12, 17]) often leads to heuristic results without statistical explanation.

In this paper, we propose a novel topic-aware dynamic Granger test framework to automatically study the public influence on news media. To address the sparsity problem, we first represent word as low-dimensional word vectors through skip-gram model [19], and further reform word representation via sparse coding to capture the latent semantic of each dimension. Then we employ Granger causality test [9] to theoretically detect the public influence on news. Particularly, for a pair of topics extracted from UGC and news respectively, we propose a topic-aware dynamic strategy to chronologically split those topic-related documents into disjoint bins with dynamic time intervals, calculate the topic representations based on the documents falling into each bin, and apply the multivariate Granger test to judge if the UGC-to-news influence exists. Finally, we quantify the influence [12] based on the discovered influence links, and validate the influence measures by calculating their correlations with the professional, manual results provided by China Youth Online.

The main contributions can be summarized as follows:

  • We address problem of analyzing public influence analysis on news through a unified Granger-based framework. Extensive experiments are conducted on 45 real-world events to demonstrate its effectiveness, and results could provide useful guidance on handling public hot topics in event reporting.

  • We propose a novel textual feature extraction method. Instead of directly using the popular word2vec, it further maps word and document into a low-dimensional space with each dimension denoting a more compact semantic thus facilitates topic extraction and representation.

  • To track the temporal changes of topic pair from news and UGC respectively, we propose a novel topic-aware dynamic binning strategy, splitting both streams into chronological bins to achieve smooth topic representations of each bin.

The rest of the paper is organized as follows. In Sect. 2, we first define the related concepts and the problem of influence analysis from UGC to news. Section 3 presents our proposed textual feature extraction method and Granger-based influence analysis. Our experimental results are reported in Sect. 4. Section 5 reviews the related literatures, and finally Sect. 6 concludes this paper with future research directions.

2 Problem Definition

A particular event often brings forth two correlated streams, namely news articles from newsroom form a news stream and the public voice from different social applications converge into a UGC stream. Both news stream \( NS \) and UGC stream \( US \) are text streams, which are defined as follows:

Definition 1

Text Stream. A text stream \(TS=\langle s_{1},s_{2},\ldots ,\) \(s_{n} \rangle \) is a sequence of documents, where \(s_{i}\ (i=1, 2, \ldots , n)\) is associated with a pair (\(d_{i}\), \(t_{i}\)), where \(d_{i}\) is a document comprising a list of words and \(t_{i}\) is the publish time in non-descending order, i.e. \(t_{i} \le t_{i+1}\).

It has been shown that news and UGC streams are mutually dependent [24]. Topic, which bridges these two different streams, plays an important role. In order to study the cross-stream interactions, we first define topic as follows:

Definition 2

Topic. Conceptually, topic z expresses an event related subject/theme within a time period. Mathematically, topic \(\mathbf {z}\) is characterized as a vector with each dimension denoting a word feature or a latent aspect. Topic z covers multiple documents (news articles or users comments).

The interaction between media, public and government has been theoretically studied in journalism communication, e.g., the agenda setting theoryFootnote 1 evaluated the ability of mass media to influence the salience of topics on the public [15]. Nowadays, the proliferation of social media is changing the way of news diffusion, i.e., the public may inversely affect or even drive the news media. It is useful to explore to what extent the traditional news depends on social media and how long the public influence lasts, thus we condense the following research problem.

Definition 3

Analyzing Public Influence on News. Given a news stream \( NS \) and a UGC stream \( US \), influence analysis from UGC to news aims to discover a set of influence links \(\{(\mathbf {z}_{u},\mathbf {z}_{n},\zeta )\}\), where \(\mathbf {z}_{u}\in Z_{u}\) and \(\mathbf {z}_{n}\in Z_{n}\) are topics extracted from \( US \) and \( NS \) respectively, and \(\zeta \in \{0,1\}\) indicates whether \(\mathbf {z}_{u}\) influences (or contributes to) \(\mathbf {z}_{n}\).

From the definition above, topic representation and extraction, influence detection are two major steps to complete the novel task. As mentioned in the introduction, existing methods suffer from various technical deficiencies, i.e., sparse representation and lack of theoretical foundation. To tackle these issues, we put effort on the following two problems: (i) given news and UGC streams, properly represent the documents and extract latent topics from both streams; (ii) given a topic pair \((\mathbf {z}_{u},\mathbf {z}_{n})\), determine if there exists a causality link and provide a statistical evaluation on how \(\mathbf {z}_{u}\) contributes to \(\mathbf {z}_{n}\).

3 Our Approach

In this paper, we propose a topic-aware dynamic Granger-based method to automatically detect the influence from UGC to news. Specifically, we develop a text representation method to better represent news and UGC in a low-dimensional space and extract their corresponding topics (Sect. 3.1). We incorporate temporal information to transform news and UGC topics into serialized representations and apply Granger causality test to detect the public influence on news (Sect. 3.2).

3.1 Text Representation and Topic Extraction

Text representation and topic extraction aims to properly represent the documents in \( NS \) and \( US \) and extract topics \(Z_n\) and \(Z_u\). However, traditional TF-IDF representations suffer problems of the curse of dimensionality and feature independence assumption in dealing with the short and fragmented UGC. These methods often ignore the semantic relationships among word features which leads to document sparse representation with many zero features values.

To alleviate the sparse representation, many methods have been proposed to unveil the hidden semantics of words, such as topic models (e.g., LDA [3]) and external knowledge enrichment (e.g., ESA [7]). However, topic models rely much on the word co-occurrence that cannot be accurately computed with short texts, while ESA requires plenty of high-quality knowledge, which is often not available in practice. In this paper, we propose a novel textual feature extraction pipeline, which gradually maps word and document into a low dimensional space where each dimension represents a unique semantic meaning. It consists of the following three steps:

Word Vectorization. Word is the basic element in text, so we first transform words into continuous low-dimensional vectors. Let V denote the vocabulary in \( NS \) and \( US \), we employ skip-gram model [19] to learn a mapping function: \(V \rightarrow \mathbb {R}^M\), where \(\mathbb {R}^M\) is a M-dimensional vector. Specifically, given a document \(s\in NS\cup US\) associated with word sequence \(\langle w_{1},w_{2},\ldots , w_{W}\rangle \), skip-gram model maximizes the co-occurrence probability among words that appear within a contextual window k:

$$\begin{aligned} \mathop {\max }_{\mathbf {w}} \frac{1}{W}\sum _{i=1}^{W}\sum _{j=i-k, j\ne 0}^{j=i+k} \log p(w_{j}|w_{i}) \end{aligned}$$
(1)

The probability \(p(w_{j}|w_{i})\) is formulated as:

$$\begin{aligned} p(w_{j}|w_{i}) = \frac{\exp (\mathbf {w}_{j}^{\mathrm {T}}\mathbf {w}_{i})}{\sum _{l=1}^{V}\exp (\mathbf {w}_{l}^{\mathrm {T}}\mathbf {w}_{i})} \end{aligned}$$
(2)

where \(\mathbf {w}_{i}\in \mathbb {R}^M\) is the M-dimensional representation of \(w_{i}\).

Mid-level Feature Learning. Intuitively, the document representation can be achieved via word vector composition. However, each dimension in word vector represents a latent meaning and word semantic scatters over almost all dimensions, simple composition of individual word vectors ignores the potential correlation between dimensions [20]. To prevent possible information loss by simple composition, we reconstruct each word vector into a mid-level feature [4], where each dimension represents a unique dense semantic. In other words, we learn a \(\mathbb {R}^M \rightarrow \mathbb {R}^N\) mapping, and it is typically a sparse coding problem, whose objective is:

$$\begin{aligned} \mathop {\min }_{\mathbf {W^{*},D}} \sum _{i=1}^{V}{\Vert \mathbf {w}_{i} - \mathbf {D}\mathbf {w}_{i}^{*}\Vert }_{2}^{2} + \lambda {\Vert \mathbf {w}_{i}^{*}\Vert }_{1} \end{aligned}$$
(3)

where \(\mathbf {w}_{i}\in \mathbb {R}^M\) is the vector obtained in word vectorization; \(\mathbf {w}_{i}^{*}\in W^{*}\subseteq \mathbb {R}^N\) is the N-dimensional sparse representation (\(N > M\)); \(\mathbf {D}\) is an \(M\times N\) matrix with each column denoting a dense sematic; \({\Vert \cdot \Vert }_{1}\) denotes the \(\ell _{1}\)-norm of input vector; \(\lambda > 0\) is a hyperparameter controlling the sparsity of the result representation, i.e., larger (or smaller) \(\lambda \) induces more (or less) sparseness of \(\mathbf {w}_{i}^{*}\). Because the vocabulary V usually contains tens of thousands of words, optimization of the non-convex problem would be very time consuming.

To efficiently solve the problem, we apply a two-step approximation method. Firstly, we learn the matrix \(\mathbf {D}\) offline. We cluster the learned word vectors into N clusters through K-means where each cluster denotes a compact semantic, and use the cluster centers as the columns of D. Secondly, based on the assumption that locality is more essential than sparsity [22], we select the K-nearest neighbors in D for each word \(\mathbf {w}_{i}\) based on Euclidean distance, and then adopt the Locality-constraint Linear Coding (LLC) to learn its transformation \(\mathbf {w}_{i}^{*}\):

$$\begin{aligned} \begin{aligned}&\mathop {\min }_{\mathbf {W^{*}}} \sum _{i=1}^{V}{\Vert \mathbf {w}_{i} - \mathbf {B}_{i}\mathbf {w}_{i}^{*}\Vert }_{2}^{2} + \lambda {\Vert \mathbf {w}_{i}^{*}\Vert }^{2}_{2}\\&s.t.\mathbf {1}^{\mathrm {T}}\mathbf {w}_{i}^{*} = 1, \forall i \end{aligned} \end{aligned}$$
(4)

where \(\mathbf {B}_{i}\) is the K-nearest neighbors to \(\mathbf {w}_{i}\) in \(\mathbf {D}\). The problem could be solved analytically by:

$$\begin{aligned} \begin{aligned}&\widehat{\mathbf {w}_{i}^{*}} =(\mathbf {V}_{i} + \lambda \mathbf {I}) \backslash \mathbf {1} \\&\mathbf {w}_{i}^{*}=\widehat{\mathbf {w}_{i}^{*}}/\mathbf {1}^{\mathrm {T}}\widehat{\mathbf {w}_{i}^{*}} \end{aligned} \end{aligned}$$
(5)

where \(\mathbf {V}_{i}={(\mathbf {B}_{i}-\mathbf {1}\mathbf {w}^{\mathrm {T}}_{i})}^\mathrm {T}(\mathbf {B}_{i}-\mathbf {1}\mathbf {w}^{\mathrm {T}}_{i})\) denotes the covariance matrix.

Document Representation and Topic Extraction. We employ spatial pooling to represent each document as a N-dimensional vector \(\mathbb {R}^N\) based on the learned sparse word vectors. Given a document \(s_{i}\) consisting W words with vector representations \(\mathbf {w}_{i}^{*}, i = 1,2,\ldots ,W\), we try two different pooling functions to obtain the document representation \(\mathbf {s}_{i}\):

$$\begin{aligned} s_{ij} = \mathop {\underbrace{\frac{1}{W}\sum \nolimits _{k=1}^{W}|\mathbf {w}_{kj}^{*}|}}_{average}\quad or\quad s_{ij} = \mathop {\underbrace{\sqrt{\frac{1}{W}\sum \nolimits _{k=1}^{W}{\mathbf {w}_{kj}^{*}}^2}}}_{square\ root} \end{aligned}$$
(6)

where \(\mathbf {s}_{i}\) denotes the final representation of \(s_{i}\) and \(s_{ij}|_{j=1}^{N}\) is the j-th entry. Note that different pooling functions assume the underlying distributions differently. Once completing the document representation, we feed the news and comment vectors into K-means algorithm separately to obtain topic sets \(Z_{n}\) and \(Z_{u}\). The achieved topics have more compact distributed representations than TF-IDF, which is convenient to further computation and analysis.

3.2 Topic Influence Detection

Topic influence detection analyzes the relationship between news and UGC topics, which behaves as inter-stream influence. Normally, KL-divergence is employed to evaluate topic transition within news stream [13, 17] or topic interaction across streams [12], but the idea is heuristic and results are often restricted within a too short time period to track the topic evolution.

Therefore, we perform the influence detection in a more theoretical way through Granger causality testFootnote 2. Its basic idea is that a cause should be helpful in predicting the future values of a time series, beyond what can be predicted solely based on its own historical values [9]. That is to say, a time series x is to Granger cause another time series y, if and only if regressing for y in terms of both past values of y and x is statistically significantly more accurate than regressing for y in terms of past values of y only.

Granger-Based Influence Detection. In this paper, Granger causality analysis is performed on two topics \(z_{u}\in Z_{u}\) and \(z_{n}\in Z_{n}\) to test whether \(z_{u}\) is the Granger cause of \(z_{n}\).

In the previous subsection, we achieve the news and UGC topic sets and their associated documents, but the Granger causality test requires two time series. So we need to turn topics in \(Z_{n}\) and \(Z_{u}\) into time-varying topic series. For each \(\mathbf {z} \in Z_{n}\cup Z_{u}\), we need to represent it as \(\langle \mathbf {z}^{t}\rangle _{t=1}^{T}\) where \(\mathbf {z}^{t}\) is the status of topic z at the t-th interval and T is the size of time intervals. A straightforward way is to partition both streams into disjoint slices with fixed time intervals (e.g., one day), i.e., equal-size binning. An alternative is equal-depth binning, i.e., evenly partitioning all documents into T bins. For an obtained partition \(\langle S^{t}\rangle _{t=1}^{T}\), the representation of topic z at the t-th bin \(\mathbf {z}^{t}\) could be simply computed via averaging the related document vectors within that bin:

$$\begin{aligned} \mathbf {z}^{t} = \frac{1}{|S_{z}^{t}|}\sum \nolimits _{s_{z}\in S^{t}}\mathbf {s} \end{aligned}$$
(7)

where \(S_{z}^{t}\) denotes the documents within t-th bin that are related to topic z.

Once we get the time-varying representations of two target topics \(\langle \mathbf {z}_{n}^{t}\rangle _{t=1}^{T}\) and \(\langle \mathbf {z}_{u}^{t}\rangle _{t=1}^{T}\), we first fit two vector autoregressive models (VAR) over these two series:

$$\begin{aligned} \mathbf {z}_{n}^{t} = a_{0} + \sum _{i=1}^{q}a_{i}\mathbf {z}_{n}^{t-i} + \mathbf {r} \end{aligned}$$
(8)
$$\begin{aligned} \mathbf {z}_{n}^{t} = a_{0} + \sum _{i=1}^{q}(a_{i}\mathbf {z}_{n}^{t-i} + b_{i}\mathbf {z}_{u}^{t-i}) + \mathbf {r}_{u} \end{aligned}$$
(9)

where (8) predicts a news topic \(\mathbf {z}_{n}^{t}\) at time stamp t purely based on its historical values, i.e., \(\mathbf {z}_{n}^{t-i}\), while (9) considers the historical values from both news and UGC streams, i.e., \(\mathbf {z}_{n}^{t-i}\) and \(\mathbf {z}_{u}^{t-i}\), for prediction; q is a predefined maximum lag to measure how long the influence lasts; \(\mathbf {r}_{u}\) and \(\mathbf {r}\) denote the residuals with/without considering the topic \(z_{u}\).

Then, to test whether or not (9) results in a better regression than (8) with statistical significance, we apply an F-test (some other similar tests could also be chosen). More specifically, we calculate the residual sum of squares RSS and \(RSS_{u}\), based on which we obtain the F-statistic:

$$\begin{aligned} F = \frac{(RSS-RSS_{u})/q}{RSS_{u}/(n-2q-1)}\sim F(q,n-2q-1) \end{aligned}$$
(10)

Given a confidence coefficient \(\alpha \), we say \(\mathbf {z}_{u}\) Granger causes \(\mathbf {z}_{n}\) if F is greater than a predefined \(F_{\alpha }\), i.e. \(\zeta = 1\) as defined in Sect. 2, and otherwise \(\zeta = 0\).

However, both streams, especially news articles, are often generated nonuniformly. The equal-size binning performs poorly on such streams since it produces many empty intervals without any news or comments, and the equal-depth binning often leads to extremely unbalanced time spans. Either empty interval or unbalanced spans has side effect on Granger test, making it failed or meaningless.

Topic-aware Dynamic Granger Test. To address the uneven distribution problem, we propose a topic-aware dynamic binning strategy to partition both streams into several disjoint intervals. The motivation for topic-aware is that: different topics follow their unique patterns and show various distributions along timeline and the Granger causality test actually processes a topic pair rather than the whole streams at one time, thus one partition only need to deal with documents within target topics from news and UGC respectively. And the dynamic binning aims to alleviate problem of the uneven distribution. Let \(S_{z}\) denote the streaming documents associated with topic z, \(\langle S_{z}^{t}\rangle _{t=1}^{T}\) is a partition result, we define the following two types of dispersion:

  • \(dis_{amount}\): the difference between the largest and the smallest bin size with bin size is defined as the number of contained documents;

  • \(dis_{span}\): the difference between the largest and the smallest time span.

Our objective is to balance these two dispersions, namely,

$$\begin{aligned} \min \limits _{\langle S_{z}^{t}\rangle _{t=1}^{T}} dis_{amount} + dis_{span}\quad s.t.|S_{z}^{t}| > 0, \forall t \end{aligned}$$
(11)

Due to the extremely unbalanced volume of news and comments, we perform the optimization on news stream and the comments just follow. The problem could be solved efficiently using dynamic programming (where dynamic comes) and the best solution is always available [13].

4 Experiments

In this section, we first briefly introduce our datasets, and then present the detailed experimental results on topic extraction, topic influence detection and further analysis.

4.1 Dataset Description

To evaluate the effectiveness of the proposed methods, we prepare the following two kinds of datasets:

Datasets from Hou’s paper [12]. They are composed of five datasets containing four international events: the Federal Government Shutdown in both Chinese and English (cFGS/eFGS), Jang Sung-taek’s (Jang), The Boston Marathon Bombing (Boston) and India Election (India). They are collected from influential news portals and social media platforms (i.e., Sina, New York Times, Twitter), and the detailed statistics are summarized in Table 1. These datasets are used to evaluate the effectiveness of our topic extraction and influence detection.

Table 1. Datasets: duration, numbers of comments and news articles, max and average number of comments per news

Datasets from China Youth Online (CYOLFootnote 3). CYOL is one of the biggest and leading public opinion analysis website in China. It monthly publishes opinion index based on questionnaire surveys from experts and scholars, civil servants, media people, opinion leaders and ordinary Internet users. The index includes five well-defined metrics: information coverage, activeness, response arrival rate, response recognition rate and satisfactory. For each event reported by CYOL in 2014, we crawled the news articles and comments from SinaFootnote 4 if there existed a corresponding special issue. Finally, we collected 40 events, and for each event, there are 140 news articles and 12,849 comments on average. Due to the space limit, the detailed statistics and data will be published later. We incorporated these datasets and published opinion index to evaluate the influence measures that are automatically calculated based on our approach.

4.2 Results for Topic Extraction

In this section, we report the evaluation on text representation and topic extraction, including the experiment setup (settings, baselines and metrics), comparison results and the parameter analysis.

Settings. We use the gensimFootnote 5 implementation of word2vec to learn word vectors with \(M=200\), and K-means to generate the transform matrix D with \(N=\{\)128, 256, 512, 1024\(\}\). For mid-level feature learning, we apply LLC with various K-nearest neighbors, with \(K \in \{1,5,10,50\}\). The parameter \(\lambda \) is set to be \(1e^{-4}\) as the author suggested.

Baselines. We use DeepDoc to denote our proposed text representation method, and compare it with TF-IDF based method (TF-IDF) and state-of-the-art topic models on news and UGC, i.e., Document Comment Topic Model (DCT) [11] and Cross Dependence Temporal Topic Model (CDTTM) [12].

Metrics. As for the evaluation metrics, we calculate the inner/inter-cluster distance for all topics. The inner-cluster distance (inner) is defined as the average distance between documents within topic, and a smaller value indicates a compact cluster. The inter-cluster distance (inter) is the average distance from one topic to all the other topics, and a larger value indicates a better result. We also calculate their relative ratios (ratio) where a bigger value shows better performance.

Comparison Results. Table 2 presents the comparison results, from which we can see: (i) macroscopically, our proposed DeepDoc outperforms three baselines consistently, and DCT is more steady than other methods while the TF-IDF representation obtains the worst performance. (ii) under this measurement, CDTTM is not so sensitive to the stream distribution as described in [12], and DeepDoc does not have this problem as we do not include temporal information in clustering.

Table 2. Results for topic extraction: inner and inter stand for the average inner/inter-cluster distances, and they are related to the dimension of document representation; ratio is calculated through dividing the inter by inner to measure the clustering performance, and a larger ratio indicates a better clustering result

Parameter Analysis. Then the Boston dataset is chosen to investigate the effects of the number of neighbours, pooling function and the number of matrix columns, and the results are presented in Table 3. We have the following conclusions:

  • Number of neighbours (K). Regardless of various other settings, generally small number of neighbors leads to better clustering results. This is a promising finding, because the smaller the number of neighbors used (i.e., the more sparse the codes are), the faster the computation will be run, and the less the memory will be consumed.

  • Pooling function. Different choices of pooling functions lead to very different clustering results. The root mean square pooling achieves better performance under almost every settings than average pooling, and the smaller the code sparseness (larger K and smaller matrix), the gap between these two pooling functions is more significant.

  • Number of matrix columns. It actually denotes the dimensions of transformed space. Intuitively, if the number of dimension is too small, the mid-level representation will lose discriminative power, but words from the same category of documents will be less similar if the size is too big. Here, we mainly focus on the trade-off between reasonably smaller and bigger size. As can be seen from the results, larger size leads to better results when \(K>10\), while it is likely that smaller matrix is sufficient under higher level of sparseness.

Table 3. Clustering results on Boston dataset with various number of neighbors (K), pooling functions and number of matrix columns (N).

4.3 Results for Topic Influence Detection

To evaluate our proposed topic-aware dynamic Granger test method (TDG), we perform three series of experiments, namely, (1) the overall comparison with KL-divergence based method in [12], (2) the comparison of different binning methods, and (3) the effect of the maximum lag.

TDG – KL Divergence. Hou et al. evaluated their method on manually labeled data, and it achieved comparable results to the human annotation. To make the comparison fair, we compare the Granger results with \(\alpha =0.9/0.8\) with their top 10 %/20 % links (Hou et al. included links with distance less than the median value). Through manual evaluation, the Granger test achieves 94 % precision while KL gets only 82 %, indicating our method significantly outperform theirs. This comes as no surprise because their KL-divergence based method only finds similar patterns in the other stream (it assumes similar topics share similar patterns along timeline which may not hold) while our proposed Granger based method discovers the most useful topics in UGC that contribute to predicting the target news topic and thus are more likely to influence the news.

Dynamic Binning – Equal Size Binning. Table 4 shows the number of detected Granger causal links when different time split methods are applied. We can find that, (i) equal-size: the equal-size binning gets the worst performance because the streams (especially news stream) distribute nonuniformly and it often leads to zero vectors for bins with on documents. Though mean linear interpolation is employed to deal with the zeros, the results are still not so satisfactory. (ii) dynamic: dynamic binning optimize (11) over whole news stream without distinguish topics. It can handle the uneven distributed streams to some extent, thus finds more influence links. (iii) topic: since our proposed method tests a pair of topics every time and different topics may follow different patterns, while the dynamic binning is applied on the whole streams, thus it might not perform well on different topic pairs. Therefore, the topic-aware binning further improves the performance.

Table 4. Granger causality links with different time split methods (0.8 and 0.9 are confidence coefficients)

How Long the Influence Lasts. To choose a proper maximum lag q (i.e., how many historical values are included in the regression), we select five topic pairs to conduct Granger causality test with maximum lag ranging from 1 to 10, and determine the proper value that achieves the best F statistics (divided by \(F_{0.8}\) due to the different time spans). Table 5 shows the results from 3 to 7, we observe that the \(F/F_{0.8}\) increases initially until \(q=5\) to reach stable status. We therefore execute all the Granger test with q set as 5. Note that here \(q=5\) does not mean 5 days since topic aware binning is used for stream split, and actually the average time difference is about 3.2 days, which tells us that news and UGC in the previous 3 days have much more influence on the current news report.

Table 5. F-statistic with maximum lag (q): \(F/F_{0.8}\) denotes the average values of the 5 selected topic pairs.

4.4 Influence Usage Analysis

This experiment exploits whether our automatically obtained results are consistent with the objective CYOL public opinion index. Specifically, with our achieved influence links \(\{(z_u,z_n)\}\), we quantify the public influence on news through news response rate(NRR), promptness(NRP) and effect(NRE) as defined in [12]. Their comparable measures in CYOL Public Opinion Index are information coverage(IC), response activity(RA) and satisfactory(SA). We compute three correlation coefficients for the 3 pairs of measures NRR-IC, NRP-RA, and NRE-SA respectively, and higher correlations indicates better results. For comparison, we use LDA+KL-divergence, Hou’s methods(CDTTM+KL) as our baselines. We further try to only use first half of the event data for analysis (Ours\(^\frac{1}{2}\)) to test whether it is helpful in predicting the future influence. Table 6 shows the comparison results.

Table 6. Influence usage results.

As shown in Table 6, our method achieves higher correlations with the CYOL measures than other two methods. Furthermore, we notice that only using the first half of event data, our method can achieve comparable results with those on all data. This implies that it can be used on predicting whether an event could be handled properly at early stage.

Case Review. Now we review the events mentioned in the introduction. The APEC 2014 summit shows a good example that the social media can influence news media. Besides APEC blue, we identify another topic beyond the scheduled ones, i.e., tourist. It actually covered the part-time activities of the dignitaries Mrs, especially their clothing. The news media started to report the part-time activities causally. However, the public was very enthusiastic about the Mrs’ tourist and discussed a lot about their clothing. To satisfy people’s curiosity, news presented systematic introduction of the first lady’s activities and dress. Then, we compare the news response in the two earthquakes: both reports covered the major topics — both NRRs are pretty high (Yunnan 84 % and Sichuan 82 %); but in Yunnan earthquake, news media responded the public more timely — the NRP in Yunnan is much smaller than that in Sichuan, roughly 0.8 day v.s. 1.4 days. The final satisfactory shows that it is very important to properly handle the heatedly-discussed topics. Our analysis could summarize about which topic that news should response at what time, thus benefits the public opinion management.

5 Related Work

Our work is related to three lines of research as follows:

5.1 Distributed Text Representation

Representing words in continuous vector space has been an appealing pursuit since 1986 [25]. Recently, Mikolov et al. developed efficient method to learn high quality word vectors [19], and a host of follow-up achievements have been made on phrase or document representation, such as paragraph-to-vector [20]. Different from these attempts, we are inspired to borrow the state-of-the-art feature extraction pipeline in computer vision [4] to represent word and document in a new space where each dimension denotes a more compact semantic than directly using word2vec.

5.2 Social News Analysis and Topic Evolution

The proliferation of social media encourages researchers to study its relationship between traditional news media, e.g., Zhao et al. employed Twitter-LDA to analyze Twitter and New York Times and found Twitter actively helped spread news of important world events although it showed low interests in them [26]. Petrovic et al. examined the relation between Twitter and Newsfeeds and concluded that neither streams consistently lead the other to major events [21]. Besides the common and specific characteristics of news and social media, we pay more attention on the cross-stream interaction.

As for the topic evolution, Mei et al. solved the problem of discovering evolutionary theme patterns from single text stream [17], Hu et al. modeled the topic variations and identified the topic breakpoints in news stream [13]. Wang et al. aimed at finding the burst topics from coordinated text streams based on their proposed mixture model [24]. Lin et al. formalized the evolution of a topic and its latent diffusion paths in social community as a joint inference problem, and solved it through a mixture model (for text generation) and a Gaussian Markov Random Field (for user-level social influence) [14]. In this paper, we study the interplay of news and UGC within specific events, trying to analyze the cross-media influence and figure out how they co-evolve over time.

5.3 Agenda Setting and Granger Causality

Agenda-setting is the creation of public awareness and concern of salient issues by the news media. Mccombs and Shaw discussed the function of mass media in agenda setting [16] in 1972. Many researchers studied the interactions between public agenda and news agenda, e.g., Meraz employed time series analysis to measure the influence in political blog and news media [18]. Our work falls into the second-level agenda-setting (also called attribute agenda-setting), and the major advantage of our framework is that, the attributes are predefined and we extract the latent topics automatically.

The Granger causality test [9] is a statistical hypothesis test for determining whether one time series is useful in forecasting the other one. It has been utilized in many areas for causality analysis or prediction, e.g., [6] adapted it to model the temporal dependence from large-scale time series data [6]; Chang et al. used it in Twitter user influence modeling. In this paper, we apply the agenda-setting theory and multivariate Granger test to automatically analyze how the social media influence traditional news.

6 Conclusion

In this paper, we analyze the public influence on news through a Granger-based framework: first represent words and documents in distributed low-dimensional space and extract topics from news and UGC streams, then dynamically split streams to achieve changing topic representations on which we employ Granger causality test to detect influence links. Experiments on real-world events demonstrate the effectiveness of the proposed methods and the results show good prospects on predicting whether an event could be properly handled.

It should be note that Granger test attempts to capture an interesting aspect of causality, but certainly is not meant to capture all, e.g., it has little to say about situations in which there is a hidden common cause for the two streams. In the future work, we will try to address the important but challenging issue.