1 Introduction

The increasing volume of online user activity represents a vital new opportunity for data scientists and analysts to measure the collective behavior of social, economic, and other important evolutions [15, 30, 60, 72, 73].

Given real-time, online user activity sequences, such as the search volume for the keywords “Xbox” and “PlayStation”, how can we find patterns and rules to perform, e.g., sociological, behavioral, and even marketing research? If we know nothing about the sequences, we could (and should) try using Fourier, Wavelets, AR, Kalman filters and the other time series analysis tools. However, we are told that the sequences correspond to online user activity (e.g. the search volume for a keyword) — Can we do better than the existing methods?

This is exactly the idea behind our work. We conjecture that the volume per keyword/activity will behave like a species in an “ecosystem”. It will compete with other species for food and also exhibit seasonal behavior. Here we propose that “food” corresponds to user resources: given a set of users and their resources (e.g., attention, time, money), the d keywords/activities compete for the user resources.

In this paper, we present an intuitive model, namely EcoWeb, which provides a good description of large collections of co-evolving online activities.Footnote 1 In short, the problem we wish to solve is as follows:

Informal Problem 1

Given a large collection of co-evolving sequences X = {x 1,⋯ ,x d }, which consists of d keywords/activities of duration n, where each record x i (t) corresponds to a user activity (e.g., queries, time/dollars spent) for keyword i at time tick t, we want to

  • detect competition (e.g., “Xbox” vs. “PlayStation”)

  • find seasonal events (e.g., Christmas, summer vacations)

  • forecast future dynamics

Preview of our results

Figure 1 shows our discoveries related to the video game industry consisting of d = 4 activities, namely, the search volumes for “Xbox” (x 1), “PS2, PS3” (x 2), “Wii” (x 3), and “Android” (x 4), taken from Google,Footnote 2 and spanning over a decade (2004-2014), with weekly measurements. EcoWeb discovered the following important patterns:

  • Long-term fitting: Figure 1a shows the original volume of the four activities/keywords as circles, and our fitted model as solid lines. Notice that our fit is even visually very good, and it detects seasonalities and up- or down-trends: For example, our model fitted the success of “Wii” (which launched in 2006 and apparently drew attention from the competing “Xbox”). Similarly, it fitted the fall in the popularity of “Wii” in 2011, which coincided with the ascent of “Android”, possibly indicating that mobile and social games attracted the attention of Wii gamers.

  • Interspecies interaction: Recently, video games have been facing increasing competition (from online/social games), and our model automatically identifies this latent competition: Figure 1b shows the interaction network that captures the interaction between the four activities/keywords. Edges indicate interaction/competition between two keywords; the thicker the edge, the stronger the interaction. For example, the red edge from “Wii” to “Android” indicates that the latter is drawing attention away from “Wii”. Similarly, “Xbox” has strong connections to “PlayStation” and “Wii” (blue edges), summarizing the fact that the attention for “Xbox” was anti-correlated with “Wii” and “PlayStation”, during 2007-2010.

  • Seasonal activities: Figure 1c succinctly summarizes the seasonality of all four keywords. There is a clear yearly periodicity, with peaks every November (“Black Friday”) and December (Christmas); a small peak in June (coinciding with the Electronic Entertainment Expo (E3), an annual trade show for video games); and sustained, medium-level activity during the summer vacations.

Figure 1
figure 1

Modeling power of EcoWeb: a Our model (solid lines) fits the original data (in circles) very well, and b it reveals latent interaction networks, such as “Xbox” vs. “PlayStation” and “Wii” vs. “Android”, as well as c seasonal activities (i.e., they all peak on Black Friday and at Christmas). Moreover, our fitting algorithm is fully automatic, requiring no user intervention

Contributions

We propose EcoWeb, a succinct, yet powerful model, which is inspired by the competition between biological species, and which captures the evolution of multiple online activities. EcoWeb has the following desirable properties:

  1. 1.

    Effective:  EcoWeb captures long-range dynamics, important patterns and seasonalities that agree with human intuition.

  2. 2.

    Automatic:  EcoWeb-Fit requires no training set, no parameters to tune, no user intervention.

  3. 3.

    Scalable:  It is carefully designed to be linear on the input size.

  4. 4.

    Practical:  It can provide long-term forecasting, outperforming existing methods (Sections 6 and 7).

Outline

The rest of the paper is organized in the conventional way: Section 2 discusses related work and Section 3 describes some fundamental concepts. In Sections 4 and 5, we describe our proposed model and algorithms. Sections 6 and 7 describe our experimental results and applications. We conclude in Section 8.

2 Related work

The related work falls into the following large subgroups:

Similarity search and forecasting

There is a lot of interest in mining time series and data streams [1, 6, 12, 22, 4244, 51, 58, 59, 61, 69]. Traditional approaches applied to data mining include auto-regression (AR), linear dynamical systems (LDS), Kalman filters (KF) and their variants [20, 33, 34, 66]. Similarity search, indexing and pattern discovery in time sequences have also attracted huge interest [10, 14, 27, 51, 57, 6163, 65, 67, 68].

Large-scale sequence mining

Here, TriMine [39] is a scalable method for forecasting co-evolving multiple (thousands of) sequences, while, FUNNEL [40] is a non-linear model for spatially coevolving epidemic tensors. The work in [37] is the first attempt to bridge the theoretical modeling of a biological ecosystem and user activities on the Web, while [36] developed a fully-automatic mining algorithm for co-evolving sequences. Rakthanmanon et al. [55] proposed a similarity search algorithm for “trillions of time series” under the DTW distance. Yang et al. [71] developed a new model for mining time-evolving event sequences. As regards parameter-free mining, the work in [5, 8] focused on summarization and clustering based on the MDL principle.

Social media analysis

Analyses of social media and online user behavior has attracted considerable interest [3, 11, 24, 26, 28, 31, 35, 38, 53, 64]. Gruhl et al. [17] explored online “chatter” (e.g., blogging) activity, and measured the actual sales ranks on Amazon.com. Ginsberg et al. [15] examined a large number of search engine queries tracking influenza epidemics. They reported that the evolutions of search engine keywords are highly correlated with actual flu virus activity. The work reported in [9, 16, 54] studied keyword volume, to predict consumer behavior.

Spikes and propagation

The work in [41] studied the rise and fall patterns in the information diffusion process through online social media. The work in [13] investigated the effect of revisits on content popularity, while [56] focused on the daily number of active users. Prakash et al. [52] described a case where two competing products/ideas spreading over the network, and provided a theoretical analysis of the propagation model (winner takes all: WTA) for arbitrary graph topology.

Economic models

Leontief [29] developed the “input-output model”, which represents an economy as d interdependent industries (i.e., sectors). This model represents an economy as a system of equations, with producer-consumer relationships (analogous to prey-predator equations).

Contrast with competitors

Table 1 illustrates the relative advantages of our method. Only our EcoWeb matches all requirements, while,

  • The Lotka-Volterra (LV) model [45], the logistic function (LF) [7], the susceptible-infected (SI) model [2], and other non-linear equations [19, 40, 47, 52] incorporate domain knowledge, however, they are not intended to capture co-evolving user activities and seasonal patterns.

  • Wavelets and Fourier transforms (i.e., DWT, DFT, DCT) focus on a single time sequence, and cannot detect interaction between multiple co-evolving sequences.

  • The traditional AR, ARIMA and related forecasting methods including AWSOM [49], PLiF [34] and TriMine [39] are fundamentally unsuitable for our setting, because they are based on linear equations, while we employ non-linear equations. Moreover, (a) they can not incorporate domain knowledge, and (b) most of them require parameter tuning.

  • AutoPlait [36], SWAB [23] and pHMM [70] have the ability to capture the dynamics of sequences and perform segmentation, however, they cannot model the long-range evolution of multiple time series.

Table 1 Capabilities of approaches. Only our approach meets all specifications

In short, none of the existing methods focuses specifically on the automatic mining of non-linear dynamics in co-evolving online activities.

3 Background - Ecological considerations

Let us consider a biological ecosystem by analogy with a jungle where herbivores feed on plants, carnivores feed on other animals, and so on. How many spider monkeys should we expect to have in the next time tick, given the current count of spider monkeys, bananas, squirrel monkeys, etc? This is exactly the focus of population ecology, namely, to develop mathematical models to predict the evolution of the population of each species [47, 48].

Competition between species

There are two major mechanisms that the equations try to model: (a) un-restricted growth, i.e., with infinite resources, every squirrel monkey generates r offspring in each time tick, and (b) competition, i.e., with finite resources, the “carrying capacity” K of the environment is the maximum number of squirrel monkeys it can support.

There is competition between the members of the same species (two squirrel monkeys competing for fruit), as well as between different species (e.g., squirrel monkey vs. spider monkey, all competing for fruit). This competition is what keeps the population size of a species from exploding exponentially: If an ecosystem has too many squirrel monkeys and too few fruits, competition for those resources increases, throttling the growth of the squirrel monkey population.

One of the simplest models that captures the above phenomena is the Lotka-Volterra population model of competition [46]. It describes the interaction of d species with the following non-linear differential equations:

$$\begin{array}{@{}rcl@{}} \frac{dP_{i}}{dt} = r_{i} P_{i} \left( 1- \frac{ {\sum}_{j=1}^{d} a_{ij} P_{j}}{K_{i}} \right), \quad(i=1,2,\dots, d) \end{array} $$
(1)

where,

  • P i : Population size of species i.

  • r i : Intrinsic growth rate of species i, i.e., the rate of reproduction in the absence of density regulation (r i ≥ 0).

  • K i : Carrying capacity of species i when the other species are absent (K i ≥ 0).

  • a i i : Intraspecies competition, i.e., competition for resources between members of the same population (a i i = 1).

  • a i j : Interspecies competition, i.e., competition between two different species (a i j ≥ 0).

Here, time t is considered continuous and d P i /d t is the derivative. For each species i, the number of offspring per parent increases linearly with the size of the current population P i , and it corresponds to the intrinsic growth rate r i .

In the Lotka-Volterra equation, it is assumed that multiple (i.e., d) species are competing for some common resources. For example, Figure 2a shows the interaction between wild animals in the jungle.Footnote 3 Assume that these species share some of the resources (e.g., fruits). The number of individuals using the resources of species i can be described as: \( a_{i1}P_{1} + {\cdots } a_{ij}P_{j} {\cdots } P_{i} + {\cdots } a_{id}P_{d} = {\sum }_{j=1}^{d} a_{ij} P_{j}. \) Here, a i j (ij) is called “interspecies competition”, which measures the effect an individual of species j has on an individual of species i.Footnote 4

Figure 2
figure 2

Illustration of jungle vs. Web: a Ecosystem in the jungle (e.g., Amazon rainforest): squirrel monkeys partially share their food with spider monkeys and macaws, while capybaras are isolated (i.e., they have no competitors here); b Ecosystem on the Web (e.g., game industry): the main targets of Xbox, PlayStation and Wii are kids and teenagers, while most adults are interested in Android games rather than Xbox and PlayStation

4 Proposed model

In this section, we present our proposed model, namely, EcoWeb. Consider that we have a collection of activity volumes X of d keywords, with duration n. That is, we have X = {x 1,⋯ ,x i ,⋯ ,x d }, where x i is a sequence of keyword i, (i.e., \(\boldmath {x}_{i}=\{ x_{i}(t)\}_{t=1}^{n}\)). Given a set of co-evolving time series X, our goal is to (a) capture the evolutions of X, (b) find the hidden relationship between each sequence, and (c) forecast future dynamics.

So, how can we describe the evolutions of multiple keywords, and spot interactions between two different keywords? What exactly is the relationship, say, between Xbox and Wii, or Facebook and LinkedIn? Are there any differences or similarities? Do they compete with each other, like wild animals?

Ecosystem on the Web - intuition behind our model

So, what is an ecosystem on the Web? Can we find similar phenomena in virtual communities? If so, what kind of species live on the Web? How does the population size of each species evolve over time? —Our answers are that: (a) there are an infinite number of “virtual species” living on the Web (as in a “jungle”), and (b) they evolve naturally over time by interacting with other species.

Figure 2b shows an ecosystem on the Web. Similar to the biological community, which consists of multiple species (e.g., monkeys and macaws, as shown in Figure 2a), there is a community of virtual species on the Web (e.g., Xbox and PlayStation, as shown in Figure 2b).

Here, we provide two important analogies with respect to the ecosystem on the Web.

  • Keyword/activity (i.e., species): No keyword can survive on the Web if no one is paying attention to that topic. It behaves like a living organism. The relationship between keywords and users (e.g., between Wii and kids) is similar to the relationship between species and food resources (e.g., between squirrel monkeys and fruits or between capybaras and grass). No species can survive without resources.

  • User resources (i.e., food resources): Similar to an ecological system, there are a finite number of users and their resources on the Web. The user resources could be anything, such as user interest/attention, or an amount of the time and money they spend. Users cannot use their time/money for multiple purposes simultaneously.Footnote 5 As shown in Figure 2b, there are some groups of users, such as kids, teenagers and adults. For example, kids love video games, e.g., Xbox, PlayStation and Wii, while most adults prefer Android.

Although important, the above analogies are not immediately applicable to our setting. We need a few more concepts. Specifically, we want to describe the following three properties:

  • (G1): Non-linear evolution of keywords/activities

  • (G2): Interaction coefficients between keywords

  • (G3): Seasonality of user activities

In a real ecosystem, the population of each species varies continuously over time. It depends on the reproduction rate per generation and the number of offspring produced in a lifetime by each individual. The same thing happens on the Web: the popularity size of each keyword evolves over time. The popularity size corresponds to the aggregated volume of each user interest/attention. If a new product (say, Android) is attractive, the users would spend more time on it, or recommend it to their friends. Similarly, their friends would influence other users, and eventually, this would lead to an exponential growth in popularity size. To handle (G1), we propose using a non-linear difference equation.

For (G2), we assume that there are latent interactions between two different keywords. For example, in Figure 1a, the sequences of Xbox (i.e., x 1) and PlayStation (i.e., x 2) behave in opposite ways: When the volume of PlayStation increases, the volume of Xbox decreases considerably (please see Figure 1a from 2007 to 2010). That is, there must be competition/interaction between these two keywords.

We should also note that online activities have certain annual patterns, i.e., seasonality (G3). For example, in Figure 1a, all the sequences have a huge spike at Christmas. This is because the users modulate their activities based on a yearly cycle. Similar behavior is observed with wild animals in that their activities may depend on climate and season.

Table 2 describes our basic analogy, namely, the jungle ecosystem applied to the Web. We conjecture that users of the Web behave in the same way as wild animals in the jungle in that they interact and compete with each other for resources.

Table 2 Analogy: jungle vs. Web

Next, we introduce our model in steps of increasing complexity.

4.1 EcoWeb-individual (G1)

We begin with the simplest case, where we have a single sequence/keyword, i.e., there is no interspecies interaction/competition.

Let K be the quantity of available user resources that might be used (i.e., paid attention) as regards this keyword, and p represent the quantity of user resources that have already been used as regards this keyword at time tick t = 0 (i.e., initial condition).

In our model, we assume that the keyword/activity follows some very simple local rules:

  • It maintains its current popularity size (i.e., user attention) unless there is intra/interspecies competition.

  • For each time tick t, it obtains new user resources, and the popularity size increases by a constant percentage r.

Let P(t) be the popularity size of the keyword at time tick t. The evolution of a single keyword is described by the following difference equation:

$$\begin{array}{@{}rcl@{}} P(t+1) = P(t) \left[ 1 + r \left( 1- \frac{P(t)}{K} \right)\right], \end{array} $$
(2)

with the initial condition P(0) = p, where,

  • P(t): Popularity size of the keyword at time tick t, i.e., the aggregated volume of user attention to the keyword.Footnote 6

  • p: Initial condition, i.e., popularity size at time tick t = 0.

  • r: growth rate, i.e., the attractiveness/strength (i.e., impact) of the keyword.

  • K: Carrying capacity, i.e., maximum popularity size of the keyword (= available user resources).

Note that the term: \(\left [ 1 + r \left (1- \frac {P(t)}{K} \right )\right ]\) corresponds to the contribution of the current popularity to the next popularity growth, where \(\left (1- \frac {P(t)}{K}\right )\) is the percentage of available user resources for the keyword at time tick t. If the keyword runs out of user resources (i.e., P(t) = K), the expanding popularity will hit a constraint. Also note that (2) is a discrete version of the Lotka-Volterra differential equation, (1), when it has a single species (d = 1). Table 3 lists the major symbols and their definitions.

Table 3 Symbols and definitions

4.2 EcoWeb-interaction (G2)

We now move on to the next step, namely, spotting an interaction between co-evolving keywords (G2). In general, some keywords are competing for some common user resources. Obviously, there is some kind of competition between video game consoles, such as Xbox and PlayStation. Most users choose one of the consoles based on their preferences (e.g., price and available game titles).

Model 1 (EcoWeb-interaction)

Let P i (t) be the popularity size of keyword i at time tick t. Our interaction model is governed by the following equations,

$$\begin{array}{@{}rcl@{}} P_{i}(t+1) = P_{i}(t) \left[ 1+ r_{i} \left( 1- \frac{ {\sum}_{j=1}^{d}a_{ij}P_{j}(t)}{K_{i}} \right) \right], \\ (i=1,\cdots, d), \end{array} $$
(3)

where, r i >0,K i >0,a i i = 1,a i j ≥ 0,a n d P i (0) = p i .

In Model 1, it is assumed that competing keywords share some of the same user resources. At time tick t, the percentage of potential (i.e., available) user resources for keyword i Footnote 7 can be described as,

$$\begin{array}{@{}rcl@{}} \left( 1 - \frac{ {\sum}_{j=1}^{d} a_{ij} P_{j}(t)}{K_{i}} \right), \end{array} $$
(4)

where, a i j is the interaction coefficient, which describes the effect rate of keyword j on keyword i.

Please note that if there is no interspecies interaction/competition, (that is, a i j = 0(ij)), this model is identical to (2) (i.e., “neutralism”). In contrast, if a i j = a j i = 1 for keywords i, j, this means that two keywords i, j compete with each other, by sharing exactly the same user resource group. If a i j = 1,a j i = 0, the model describes an asymmetric competitive interaction, which is known as “amensalism”. In this case, keyword i is strongly affected by keyword j, while keyword j is almost unaffected by keyword i.

Example 1

Figure 1b shows the interaction between d = 4 keywords, where we have an interaction matrix:

$$\begin{array}{@{}rcl@{}} \mathbf{A}=\left[ \begin{array}{cccc} 1 &0.5&0.1& 0 \\ 0 & 1 & 0 &0.1\\ 0 & 0 & 1 &0.3\\ 0 & 0 & 0 & 1 \\ \end{array} \right]. \end{array} $$

Here, Xbox x 1 is affected by PlayStation x 2, (i.e., a 12 = 0.5) and Wii x 3, (i.e., a 13 = 0.1), while PlayStation and Wii are affected by Android x 4, (i.e., a 24 = 0.1, a 34 = 0.3). Xbox and Android do not interact directly with each other (i.e., a 14 = a 41 = 0).

4.3 With seasonality (G3)

Thus far, we have discussed how to describe the long-range dynamics of d co-evolving sequences. Although important, it is not sufficient to capture the real keyword evolutions. Each keyword (e.g., Xbox and Amazon) always has a certain number of users (i.e., popularity), however, the users change their behavior dynamically, according to various seasonal events (e.g., Amazon.com has many visitors on Black Friday). We can observe similar behavior in an ecological system, where activities depend on season and climate: for example, most monkeys are active during warm and sunny days, while they sleep at night. Most importantly, these activities are often correlated with other related species/keywords, e.g., the sales of most retailers including Amazon peak on Black Friday. That is, there must be some groups of “hidden” seasonal activities, (e.g., seasonal retail sales).

So how can we reflect this phenomenon in our equation? We want a powerful yet simple model that can capture seasonal patterns (G3) in real co-evolving sequences, as well as long-range non-linear evolutions. We provide an answer below.

Model 2 (EcoWeb-full)

Let C i (t) be the estimated volume of keyword i at time tick t. Our full model captures seasonal user activities with the following equations:

$$\begin{array}{@{}rcl@{}} C_{i}(t) = P_{i}(t) \left[ 1+e_{i}(t ) \right] \quad (i=1,\cdots, d), \end{array} $$
(5)

where e i (t) describes seasonal activities of keyword i over time.

The estimated volume C i (t) describes how many times keyword i appears at time tick t, and depends on the latent popularity size P i (t) and seasonal activities \(\mathbf {E}=\{e_{i}(t)\}_{i,t=1}^{d,n}\). Each element in E describes the relative value of the potential popularity size versus the actual keyword volume, and it corresponds to seasonal events, holidays, etc. If there is no seasonal pattern in keyword i at time t, (i.e., e i (t)=0), the keyword volume is equal to the popularity size (i.e., C i (t) = P i (t)).

Compact representation of seasonality

With respect to seasonal activities E, we need (d × n) parameters to describe the entire dataset X, and this is not feasible in our case. We want to avoid redundancy, and so it should be compressed into a small set of parameters. We are interested in capturing (a) yearly periodic patterns (e.g., Black Friday) as well as (b) hidden groups of seasonal activities (e.g., retail sales). So how can we deal with this issue? We propose decomposing E, to achieve much better modeling. Specifically, we decompose E into two matrices, namely, seasonality matrix B of size (k × n p ) and participation matrix W of size (d × k). Here, B represents a set of k seasonal components of period n p , while W describes the participation weight of each sequence for each seasonal component. Consequently, the seasonal activities \(\mathbf {E}=\{e_{i}(t)\}_{i,t=1}^{d,n}\) can be described as the following function:

$$\begin{array}{@{}rcl@{}} e_{i}(t ) \simeq f(i,t|\mathbf{W}, \mathbf{B}) = \sum\limits_{j=1}^{k}w_{ij} b_{j}(\tau) (\tau = [t\mod n_p]) \end{array} $$
(6)

where,

  • n p : Period (say, 52 weeks in one year).

  • k: Number of latent seasonal components.

  • \(\mathbf {W}=\{w_{ij}\}_{i,j=1}^{d,k}\): Participation matrix, i.e., participation weight of keyword i for the j-th seasonal component.

  • \(\mathbf {B}=\{b_{j}(\tau )\}_{j,\tau =1}^{k,n_p}\): Seasonality matrix, i.e., temporal activity at time tick τ for the j-th seasonal component.

Note that the number of components k should be estimated automatically, and we will describe this in the next section.

EcoWeb: full model parameter set

Figure 3 shows our modeling framework. Given a set of d co-evolving sequences X, our goal is to find important patterns with respect to three aspects: (G1) individual properties, i.e., initial popularity size: \(\boldmath {p}=\{p_{i}\}_{i=1}^{d}\), growth rate: \(\boldmath {r}=\{r_{i}\}_{i=1}^{d}\), carrying capacity: \(\boldmath {K}=\{K_{i}\}_{i=1}^{d}\); (G2) interaction matrix: \(\mathbf {A}=\{a_{ij}\}_{i,j=1}^{d,d}\); (G3) a set of k seasonal activities, which consists of participation matrix W and seasonality matrix B.

Figure 3
figure 3

Illustration of EcoWeb structure. Given a set of d sequences X of length n, we extract (G1) individual properties, i.e., initial popularity size: p, growth rate: r, carrying capacity: K, (G2) interaction matrix: A, as well as (G3) a set of k seasonal components, i.e., participation matrix: W and seasonality matrix: B

Definition 1 (Complete set of EcoWeb)

Let S be a complete set of parameters (namely, S = {p, r, K, A, W, B}) that describe the individual/interactive/seasonal patterns of X.

5 Optimization algorithm

In the previous section, we have seen how we can describe the evolutions of multiple sequences with respect to three properties that we observed with real time series data. Now, we want to figure out how to estimate an optimal parameter set. Specifically, we need to answer the following two questions: (1) How can we find an optimal set of seasonal components, (i.e., W, B)? (2) How can we efficiently and effectively estimate full parameter set S that best captures the important patterns in X? Each question is dealt with in the following subsections.

5.1 Automatic seasonal component analysis

Let us begin with the first question, namely, how to find an appropriate set of seasonal components W and B. Here, we divide the question into two parts:

  • Seasonal component detection: Find good seasonal matrices W and B, when given a fixed number of components k.

  • Automatic component analysis: Search for the best number of components among all possible k values (\(k=1,2,\dots \)).

Seasonal component detection

Assume that we are given X, and also a set of base model parameters for our model, i.e., {p, r, K, A}. According to Models 1 and 2, each element in E can be simply computed by:

$$\begin{array}{@{}rcl@{}} e_{i}(t) = \frac{x_{i}(t)-P_{i}(t)}{P_{i}(t)} (i=1,\dots,d; t=1,\dots,n). \end{array} $$
(7)

After computing E of size (d × n), our next step is to decompose it into an optimal set consisting of W and B.

The most straightforward solution would be to assume that there is a set of k = d different temporal activities of length n for all d sequences. However, this solution requires (d × n) parameters to capture the entire sequence set X. Also, it gives a very poor representation, and cannot capture seasonal dynamics among multiple keywords.

We thus propose an efficient and effective algorithm that can find an optimal set of k distinct seasonal patterns among all sequences X. Figure 4 illustrates our approach. Given a set of seasonal activities E of size (d × n), our algorithm splits each sequence into non-overlapping subsequences of length n p , and constructs a matrix \(\hat {\textbf {E}}\) of size ([d × ⌈n/n p ⌉]×n p ). It then finds a set of k components from \(\hat {\textbf {E}}\) and creates a seasonality matrix B of size (k × n p ). After finding B, it estimates a participation matrix W of size (d × k) so that we can reconstruct the original matrix E as described in (6), (i.e., Ef(W, B)).

Figure 4
figure 4

Illustration of seasonal component analysis (for n p = 2). Given a set of seasonal activities E of size (d × n), it creates a matrix \(\hat {\textbf {E}}\) of d × ⌈n/n p ⌉ disjoint windows. It then finds their k major components, i.e., B (n × n p )

There is an important issue here: what is the best way of finding typical seasonal components B in \(\hat {\mathbf {E}}\)? The first idea would be to perform principal component analysis (PCA) [21] as employed in [25, 50]. However, PCA has pitfalls: it uses an orthogonal transformation. Given an input matrix \(\hat {\mathbf {E}}\), it tries to find the best component that goes through \(\hat {\mathbf {E}}\); and then the second best component (orthogonal to the first), and so on, until it obtains k components. That is, it cannot capture “real” activities. We thus propose employing independent component analysis (ICA) [18], which is also known as blind source separation (BSS). Unlike PCA, it finds a set of k components that are both statistically independent and non-Gaussian. That is, it seeks components that are the most independent from each other.

Automatic component analysis

As regards seasonal component analysis, we need to determine the number of components, k. We thus provide an intuitive coding scheme, which enables our algorithm to find appropriate sizes for W and B, automatically. Our coding scheme is based on the minimum description length (MDL) principle. In short, it follows the assumption that the more we can compress the data, the more we can learn about its underlying patterns.

The description complexity of model parameter set S consists of the following terms: The number of dimensions d and time ticks n require \(\log ^{*}(d)+ \log ^{*}(n)\) bits.Footnote 8 The initial popularity size, growth rate, carrying capacity i.e., {p, r, K} and the interaction matrix A require d × 3 and (d × dd) parameters, respectively, i.e., C o s t M (p, r, K) + C o s t M (A) = c F d(3 + d−1), where c F is the floating point cost.Footnote 9 Similarly, the model description cost of k seasonal components is \( Cost_{M}(k,\mathbf {W},\mathbf {B}) = \log ^{*}(k) + \log ^{*}(n_p) + c_{F} (dk + k n_p) \).

Once we have decided the full parameter set S, we can encode the original data X using Huffman coding [4], i.e., a number of bits is assigned to each value in X, which is the logarithm of the inverse of the probability (i.e., the negative log-likelihood) of the value. The encoding cost of X given k is computed by:

$$ Cost_{C}(X|S)= {\sum}_{i,t=1}^{d, n} \log_{2} p^{-1}_{Gauss(\mu,\sigma^{2})} (x_{i}(t) - C_{i}(t)), $$
(8)

where, x i (t) and C i (t) are the original and estimated volumes of keyword i at time tick t (i.e., Model 2). Also, μ and σ 2 are the mean and variance of the distance between the original and estimated values.Footnote 10

The total code length for X with respect to a given parameter set S can be described as follows:

$$\begin{array}{@{}rcl@{}} Cost_{T}(X;S)&=& \log^{*}(d)+\log^{*}(n) + Cost_{M}(\boldmath{p},\boldmath{r},\boldmath{K}) \\ &&+ Cost_{M}(\mathbf{A}) + Cost_{M}(k, \mathbf{W}, \mathbf{B}) +Cost_{C}(X|S) \end{array} $$
(9)

Consequently, our algorithm automatically determines the optimal number of seasonal components k o p t according to the above function, i.e., \(k_{opt}=\arg \min \limits _{k} Cost_{T}(X; S)\).

5.2 Multi-step fitting algorithm

We have described how to find seasonal activities {W, B} in X, when a set of base parameters {p, r, K, A} were given. Next, we tackle the most important and challenging question, namely, how to efficiently and effectively estimate a full parameter set S. We would like to estimate (G1) individual parameters {p, r, K}, (G2) interaction matrix A and (G3) seasonal activities {W, B}, simultaneously.

So how do we go about finding the optimal solution S? The most straightforward approach would be simply to estimate all the parameters in S simultaneously. This approach requires us to estimate (3d+(d 2d) + k(d + n p )) parameters for each iteration. It also requires us to compare all possible solutions for a different number k (1≤kd). This method is both extremely expensive and ineffective in that it is difficult to optimize all the parameters directly.

We thus propose an efficient algorithm, StepFit, which divides a parameter set S into two subsets {p, r, K, A}, and {W, B}, and estimates the parameters alternately (see Algorithm 1). The first step assumes that there is no seasonality, i.e., k = 0, and estimates the base parameters. In the next step, the base parameters are fixed, and B and W are computed using automatic seasonal component analysis as described in Section 5.1. Here, we use the Levenberg-Marquardt (LM) [32] algorithm to minimize the cost function (i.e., (9)). The algorithm continues to estimate the parameters until convergence.

However, StepFit still needs to update the parameters of interaction matrix A of size (d × d), as well as all d individual parameters i.e., {p, r, K} for every iteration. In other words, StepFit tries to find the best solution S among all possible combinations of d keywords. One subtle but important issue is that, compared with the linear model, it is difficult to find the optimal parameter set for non-linear equations. So, how can we efficiently and effectively estimate all the parameters S? We want to find the optimal solution in terms of both the individual and interactive parameters.

figure a
figure b

Algorithm - EcoWeb-Fit

We thus extend StepFit and introduce a partitioning approach for analyzing a large number of keywords, which yields a dramatic reduction in the computation cost. Algorithm 2 describes the overall procedure. The idea is that instead of fitting all the parameters of X simultaneously, it first assumes that there is no interspecies competition, (that is, it sets A = I d , i.e., a i j = 0(ij)), and estimates a model parameter set S i = {p i ,r i ,K i ,w i i ,b i } for each individual sequence \(x_{i}(i=1,\dots ,d)\), separately using StepFit. In the next step, it assumes that there is competition between two keywords i and j. Specifically, for each iteration, the algorithm tries to find the best pair (x i ,x j ) so that it minimizes the cost function i.e., C o s t T (x i ,x j |S i j ). It continues pair-fitting until convergence. Finally, the algorithm optimizes the full parameter set S using the entire sequence set X.

6 Experiments

In this section we demonstrate the effectiveness of EcoWeb with real data. The experiments were designed to answer the following questions:

  1. Q1

    Effectiveness: How successful is our method in spotting meaningful patterns in given input sequences?

  2. Q2

    Accuracy: How well does our method match the data?

  3. Q3

    Scalability: How does our method scale in terms of computational time?

6.1 Q1: Effectiveness

We now demonstrate the power of our model in terms of capturing important and informative patterns of online activities. We performed experiments on sequence sets of keywords/activities from seven areas (i.e., games, programming languages, social media, apparel companies, retail companies, beers and online TV) on G o o g l e T r e n d. Note that the dataset is scaled so that each sequence has a peak volume of 1.0.

#1. Video games

The result for this area has already been presented in Figure 1 of Section 1. Our method captures long-range evolving dynamics between three game consoles (i.e., Xbox, PlayStation and Wii), and the appearance of Android, as well as important annual events, e.g., Electronic Entertainment Expo (E3), Black Friday and Christmas.

#2. Programming languages

Figure 5a shows our discoveries on the programming language activities: “C”, “R” and “MATLAB”.

  • Long-range evolution and interaction: Figure 5a-i shows the fitting results (lines) and the original sequences (circles). Again, our method fits the real data very well. Moreover, it captures the interaction: Figure 5a-ii shows the interaction network, indicating competition between the “C” programming language and the “R” statistical system, while “MATLAB” seems not to be involved. Indeed, the time sequences show that the interest in “R” has increased constantly since 2004, at the expense of “C” - possibly, due to an emphasis on big data analytics.

  • Seasonal activities: Figure 5a-iii shows the full parameter set of EcoWeb (darker gray corresponds to a higher value). With respect to the seasonal activities (W and B, shown at the bottom), our method discovered, to our surprise, that there is a strong correlation with the academic calendar. For example, during the spring, summer and winter breaks, the attention paid to each keyword (especially, MATLAB) decreases significantly: Apparently, most of those issuing queries, are students (as opposed to professional programmers), and they enjoy their vacation, instead of coding.

Figure 5
figure 5

Fitting results of EcoWeb for #2. Programming languages, #3. social media. Our model (solid lines) fits the original data (in circles) very well; spots competitors (indicated by edges); and spots the strongest seasonal patterns. See text for more observations

#3. Social media

Figure 5b shows the fitting result for the social media activities: “Tumblr”, “Facebook” and “LinkedIn”.

  • Long-range evolution and interaction: Most social media sites have been attracting searches only recently (say, after 2008 - p≈0, see (b-iii)). For example, Tumblr is a blog platform that was founded in 2007, and it has been attracting huge numbers of users (i.e., the growth rate r of Tumblr is steep). Figure 5b-ii shows that there is competition between Tumblr and Facebook, but there is no competitor for LinkedIn.

  • Seasonal activities: The bottom figure (b-iii) shows that there is an opposite seasonality as regards social media: during Christmas and New Year’s day, the number of Facebook users increases, while the number using LinkedIn drops significantly. This is probably because the former is used for private purposes, while the latter is a business-oriented SNS.

#4. Apparel companies

Figure 6c shows the result for four heavily-searched fashion-related companies: Nordstrom (an upscale department store); Kohl’s (a discount retailer) JCPenney (a mid-range department store, with CEO problems) and Forever21 (which focuses on young girls, and recently added a line of bigger sizes).

  • Long-range evolution and interaction: Our method captures the competition between Kohl’s and Nordstrom, and between JCPenney and Forever21. Arguably due to the recession (2008 onwards), shoppers moved away from upscale Nordstrom and towards discount-priced Kohl’s (which also engaged in some brilliant marketing: offering discounts for seniors, and issuing its own credit card to encourage increased customer loyalty). Similarly, Forever21 grew significantly, probably due to their decision to add a line of bigger sizes; thus, it apparently lured attention away from JCPenney, which was damaged by poor decisions made by the new CEO, Ron Johnson, who was eventually fired.

  • Seasonal activities: All keywords have clear patterns of annual activity. There is a huge spike on Black Friday: the biggest sale event of the year. There is also a small spike in August, which is the “back to school” period.

Figure 6
figure 6

Fitting results of EcoWeb for #4. apparel and #5. retail companies. Our model (solid lines) fits the original data (in circles) very well; spots competitors (indicated by edges); and spots the strongest seasonal patterns. See text for more observations

#5. Retail companies

Figure 6d shows the results for the top six retail companies (i.e., Amazon, Walmart, Home Depot, Best buy, Lowes and Costco).

  • Long-range evolution and interaction: Clearly, every keyword is steadily increasing, with Best Buy being the only exception (arguably suffering, due to the success of online retailers). There is no clear interaction, except between Home Depot and Lowes, which are home improvement and appliance retailers, or, do it yourself (DIY) stores. In Figure 6d-i, our method captures both the individual and interaction dynamics of retail activities.

  • Seasonal activities: As described in (d-iii), our method automatically discovered two hidden seasonal patterns (i.e., k = 2) in retail companies. The first component (b 1, in light brown) corresponds to Home Depot and Lowes, and the second component (b 2, in purple) corresponds to Amazon, Walmart, Best Buy and Costco. In addition to a huge clear spike on Black Friday in both components, there are multiple spikes in Home Depot and Lowes, corresponding to the national holidays in summer (see b 1): Memorial Day (last Monday in May), Independence Day (4th of July) and Labor Day (first Monday in September).

#6. Beers

Figure 7e shows the results for the beer activities (i.e., Corona, Coors and Modelo).

  • Long-range evolution and interaction: Figure 7e-ii shows the latent competition between Modelo and Corona. Modelo and Corona are popular lagers produced by Grupo Modelo in Mexico, while Coors is brewed in Colorado, US. As shown in Figure 7e-i, compared with Corona, which is growing steadily, Modelo is declining significantly.

  • Seasonal activities: Figure 7e-iii shows the seasonal trend of three beer brands. Our method discovered that many users are interested in these keywords during the summer, which is a season where drinking beer is popular.

Figure 7
figure 7

Fitting results of EcoWeb for #6. beers and #7. online TV. Our model (solid lines) fits the original data (in circles) very well; spots competitors (indicated by edges); and spots the strongest seasonal patterns. See text for more observations

#7. Online TV

Figure 7f shows the results for Netflix, Hulu, YouTube and Amazon Prime.

  • Long-range evolution and interaction: Recently, there has been a rapid increase in new video streaming services, and EcoWeb successfully captures the long-range evolution and exponential rising and decaying patterns in all co-evolving keywords. For example, Hulu (shown as the green line) has had a declining pattern since 2011, which coincided with the ascent of Netflix and Amazon Prime, possibly indicating that there was competition/interaction between the these services, and Netflix and Amazon Prime have been drawing customers’ attention away from Hulu.

  • Seasonal activities: Figure 7f-iii shows the latent seasonality for the online TV services. Specifically, there is a yearly cyclic spike, which corresponds to the New Year event.

6.2 Q2: Model accuracy

Next, we discuss the quality of our approach in terms of fitting accuracy. We compared EcoWeb-Fit with the standard LV model. To evaluate the effect of our efficient fitting algorithms, we also compared them with a special version of our method: ECOWEB-Plain, which uses only STEPFIT to estimate model parameters. Figure 8 shows the root mean square error (RMSE) between the original and estimated volumes for seven sequence sets (#1–#7). A lower value indicates a better fitting accuracy. As shown in the figure, our approach achieved high fitting accuracy. Since the LV model cannot capture seasonal patterns, it was strongly affected by multiple spikes and failed to capture co-evolving dynamics. ECOWEB-Plain has the ability to capture periodic patterns, but it was not completely successful in capturing complicated dynamics and interactions between multiple sequences.

Figure 8
figure 8

Accuracy of ECOWEB-FIT: Fitting error (RMSE) between original and estimated volume for seven sequence sets (#1-#7) (lower is better)

6.3 Q3: Scalability

We also evaluated the scalability of our method. Figure 9 shows the average computational cost of EcoWeb-Fit. We varied the dataset size from five to ten years. Our method achieved a large reduction in terms of computation time as well as fitting error for every sequence set. We observed that EcoWeb was linear with respect to data length n, and was up to 20 times faster than ECOWEB-Plain. EcoWeb-Fit was also up to 7 times faster than the LV model, even though our method has the ability to capture seasonal dynamics.

Figure 9
figure 9

ECOWEB-FIT scales linearly: Wall clock time vs. dataset size (years). ECOWEB-FIT is 7 times faster than LV and 20 times faster than ECOWEB-Plain

7 EcoWeb at work - forecasting

Here, we describe the most important application of EcoWeb, namely, forecasting the future dynamics of co-evolving activities. Figure 10 shows the forecasting accuracy of seven sequence sets (i.e., #1–#7). Figures 11 and 12 show results of our forecasting in relation to sevent sequence sets: (#1-#7). We trained the model parameters by using the 2/3 values for each sequence set (black lines in Figs 11 and 12), and then forecasted the following years (colored lines, from 2012). We compared EcoWeb with the auto regressive (AR) model. For a fair comparison, we used coefficients that were the same size as our model parameters. In Figs 11 and 12, the top, middle and bottom rows show the original sequences, and the forecast results of EcoWeb and AR, respectively. As shown in the figures, our method successfully forecasted the long-range evolution of each sequence, as well as seasonal spikes, while AR failed to capture the non-linear evolutions. The forecasting error (RMSE) between the original and the forecasted volume of each dataset is shown in Fig 10. A lower value indicates a better forecasting accuracy. Unlike AR, our method achieves high forecasting accuracy for every sequence set.

Figure 10
figure 10

Forecasting error for each sequence set (#1-#7). Lower is better. Our method achieves high forecasting accuracy for every sequence set

Figure 11
figure 11

Forecasting future evolutions. Top: original sequence set. Middle and bottom: EcoWeb clearly outperforms AR. Both methods train the model parameters using 2/3 of each sequence set, and then start forecasting (at the vertical line, i.e., 2012)

Figure 12
figure 12

Forecasting future evolutions. Top: original sequence set. Middle and bottom: EcoWeb clearly outperforms AR. Both methods train the model parameters using 2/3 of each sequence set, and then start forecasting (at the vertical line, i.e., 2012)

8 Conclusions

We presented EcoWeb, an intuitive model for mining large scale co-evolving online activities. Our main idea is that online activities behave like species in an ecological system in that they compete for resources (such as user attention), and they evolve over time according to a non-linear dynamical system. Our proposed method has the following appealing properties:

  1. 1.

    Effective: it detects important patterns, hidden interactions and seasonalities that match human intuition.

  2. 2.

    Automatic: it needs no parameter tuning, thanks to our coding scheme.

  3. 3.

    Scalable: it is linear on the input size.

  4. 4.

    Practical: it can undertake long-range forecasting and outperforms existing methods (Section 7).