Keywords

1 Introduction

Advances in data logging and mobile computing devices have enabled biologists to collect scientific data on the movement, behavior and physiology of moving animals. This approach is called bio-logging and has been considered as a promising methodology for acquiring knowledge and insights about animal behaviors and natural environment. Collected data in bio-logging studies are naturally formulated as multi-dimensional time-series data. For example, using GPS loggers, trajectories of moving animals are represented as a two-dimensional time-series data. In this paper, we study computational data analysis techniques for extracting biological knowledge from such multi-dimensional time-series data collected in bio-logging studies.

In bio-logging data analysis, it is important to extract knowledge that is both interesting and interpretable for biologists. Traditional descriptive statistical data analysis such as averages, variances of certain animal behaviors are often not so interesting, and it rarely leads to new scientific finding. On the other hand, highly complicated nonparametric and nonlinear data analysis method such as artificial neural networks tend to provide too complicated results for biologists to interpret. The goal of this paper is to introduce sequential pattern mining into bio-logging data analysis, and demonstrate that it is useful for obtaining both interesting and interpretable knowledge about animal behaviors.

Sequential pattern mining has been studied in data mining community [1] for extracting knowledge from discrete symbol sequences. The extracted knowledge in pattern mining methods is called patterns. In order to apply sequential pattern mining methods to bio-logging data analysis, we first represent a multi-dimensional time-series as a discrete symbol sequence. Then, by using a method called frequent sequential pattern mining [2,3,4,5,6,7,8,9,10,11,12,13], we can extract the set of frequent subsequences that appear in the bio-logging trajectory data. When biologists are interested in comparison between two groups or conditions, we can also find patterns that frequently appear in one group or condition, and does not appear in the other, which is called discriminative sequential pattern mining.

The rest of the paper is organized as follows. First, we formulate the problem setup in Sect. 2. Then, we present the basic idea of frequent and discriminative sequential pattern mining and one of the well-known algorithms called PrefixSpan in Sect. 3. In Sect. 4, we apply the sequential pattern mining method to two published animal movement data on Streaked Shearwater and the nematode C. elegans. In the former data, we extracted sequential patterns that frequently appear only in male or female birds. In the latter data, we extracted sequential patterns that frequently appear only in a group of worms which lost a specific function by genetic mutation of one of the genes. Section 5 concludes the paper.

Notations

We use the following notations in the rest of the paper. For any natural number n, we define \([n] := \{1, \ldots , n\}\). A sequence (an ordered list of discrete symbols) with length T is represented as \(\langle x_1, x_2, \ldots , x_T \rangle \).

2 Problem Setup

Data taken from a bio-logging study for an animal is generally represented as a multi-dimensional time-series. We assume that appropriate prepossess operations such as outlier removal, missing value imputation, and noise reduction have been applied to the raw data before obtaining the multi-dimensional time-series. For applying sequence mining methods to bio-logging data, a multi-dimensional time-series is first transformed into a sequence by discretizationFootnote 1. A sequence is an ordered list of discrete symbols. We denote the number of different symbols as m and denote the set of those symbols as \(\mathcal{S}:= \{s_1, \ldots , s_m\}\).

In this paper, we consider a bio-logging study of two groups of animals such as male/female or infant/adults. Let the total number of animals as n. We denote the first and the second group of the animals as \(\mathcal{G}_{+}, \mathcal{G}_{-} \subseteq [n]\) and their sizes as \(n_+ := |\mathcal{G}_+|, n_- := |\mathcal{G}_-|\), respectively.

We denote a data set in a bio-logging study transformed into sequences of symbols as

$$\begin{aligned} \mathcal{D}:= \{ \varvec{x}_1, \varvec{x}_2, \ldots , \varvec{x}_n \} \end{aligned}$$

where \(\varvec{x}_i\) represents the sequence of the i-th animal. Each sequence \(\varvec{x}_i\) is written as

$$\begin{aligned} \varvec{x}_i := \langle x_{i1}, x_{i2}, \ldots , x_{iT(i)} \rangle , i \in [n], \end{aligned}$$

where \(x_{it}\) represents the symbol of the i-th animal at t-th time point which takes one of the symbols in \(\mathcal{S}\), and T(i) indicates the length of the i-th sequence.

The goal of sequence mining is to extract a set of patterns \(\varvec{q}_1, \varvec{q}_2, \ldots \) each of which is also defined as a sequence in the form of

$$\begin{aligned} \varvec{q}_k := \langle q_{k1}, q_{k2}, \ldots , q_{kL(k)} \rangle , k = 1, 2, \ldots , \end{aligned}$$

where L(k) is the length of the pattern \(\varvec{q}_k\) for \(k = 1, 2, \ldots \). We say that a sequence \(\varvec{x}_i\) contains a pattern \(\varvec{q}_k\) if

$$\begin{aligned} \exists \{1 \le i_1< \ldots < i_{L(k)} \le T(i)\}\,\text {such that}\,q_{k1} = x_{i_1}, q_{k2} = x_{i_2}, \ldots , q_{k L(k)} = x_{i_{L(k)}}, \end{aligned}$$

and represent this relationship as \(\varvec{q}_k \sqsubseteq \varvec{x}_i\). We denote the set of all possible patterns contained in any one of the sequences \(\{\varvec{x}_i\}_{i \in [n]}\) as \(\mathcal{Q}\). Note that the size of \(\mathcal{Q}\) is quite large in general.

For a set of sequences \(\mathcal{G}\subseteq [n]\), we define the support of the pattern \(\varvec{q}_k\) as

$$\begin{aligned} \mathrm{support}_\mathcal{G}(\varvec{q}_k) := \left| \{ \varvec{x}_i \mid i \in \mathcal{G}\text { and } \varvec{q}_k \sqsubseteq \varvec{x}_i \} \right| , \end{aligned}$$

i.e., \(\mathrm{support}_\mathcal{G}(\varvec{q}_k)\) indicates the number of sequences in the set \(\mathcal{G}\) that contain the pattern \(\varvec{q}_k\).

In this study, we are interested in finding patterns which appears more frequently in one of the two groups than the other. We define the difference of the supports of a pattern \(\varvec{q}_k\) in the two groups as

$$\begin{aligned} \delta _+ (\varvec{q}_k) = \mathrm{support}_{\mathcal{G}_+}(\varvec{q}_k) - \mathrm{support}_{\mathcal{G}_-}(\varvec{q}_k), \end{aligned}$$
$$\begin{aligned} \delta _- (\varvec{q}_k) = \mathrm{support}_{\mathcal{G}_-}(\varvec{q}_k) - \mathrm{support}_{\mathcal{G}_+}(\varvec{q}_k). \end{aligned}$$

We then want to extract the top C patterns that have largest \(\delta _+ (\varvec{q}_k)\) and \(\delta _- (\varvec{q}_k)\), respectively.

Table 1 shows an illustrative example of a data set with \(n=6 (n_+ = n_- = 3)\), \(T(1)=4\), \(T(2)=4\), \(T(3)=7\), \(T(4)=4\), \(T(5)=4\), \(T(6)=7\), and \(\mathcal{S}= \{a, b, c, d, e, f\}\). In this illustrative example, a movement of each of the three male and female animals is represented as a sequence of T(i) symbols chosen from \(\mathcal{S}\). The goal of our analysis is to extract patterns (subsequences) that are found more frequently in male animals than the female animals or vice-versa. When we set \(C=2\), the top 2 patterns appearing more frequently in the males than the females are \(\langle d \rangle \) and \(\langle e, a \rangle \), while the top 2 patterns appearing more frequently in the females than the males are \(\langle b \rangle \) and \(\langle b, b \rangle \).

Table 1. An illustrative example of a data set with \(n=6 (n_+ = n_- = 3)\), \(T(1)=4\), \(T(2)=4\), \(T(3)=7\), \(T(4)=4\), \(T(5)=4\), \(T(6)=7\), and \(\mathcal{S}= \{a, b, c, d, e, f\}\). The goal is to extract patterns appearing more frequently in the male animals than the female animals, and vice-versa.

3 Sequential Pattern Mining

Sequential pattern mining is widely used as methods for extracting frequent subsequences from a set of sequences. In this section, we describe the basic idea of sequential pattern mining, a famous sequential pattern mining algorithm called PrefixSpan, and how to extract discriminative patterns between two groups. In this paper, we focus on finding contiguous patterns, i.e., the patterns having no blanks between symbols. It is easy to extend the method to find discontiguous sequential patterns.

3.1 Frequent Sequential Pattern Mining

Given a subset of sequences \(\mathcal{G}\subseteq [n]\), the set of all patterns that appear more than or equal to \(\mathtt{{min\_sup}}\) sequences in \(\mathcal{G}\) is called frequent sequential patterns, and denoted as

$$\begin{aligned} F_\mathcal{G}(\mathtt{{min\_sup}}) := \{\varvec{q}_k \in \mathcal{Q}\mid \mathrm{support}_\mathcal{G}(\varvec{q}_k) \ge \mathtt{{min\_sup}}\}. \end{aligned}$$

In the context of pattern mining, the threshold value \(\mathtt{{min\_sup}}\) is called minimum support. A method that can find frequent sequential patterns is called a frequent sequential pattern mining method. For example, in Table 1, when \(\mathtt{{min\_sup}}=2\),

$$\begin{aligned} F_\mathrm{male}(2)&= \{\langle a \rangle , \langle d \rangle , \langle e \rangle , \langle f \rangle , \langle a, d \rangle , \langle e, a \rangle \}, \\ F_\mathrm{female}(2)&= \{\langle a \rangle , \langle b \rangle , \langle e \rangle , \langle f \rangle , \langle a, b \rangle , \langle b, b \rangle , \langle b, f \rangle , \langle a, b, f \rangle \}. \end{aligned}$$

Since the number of possible patterns \(|\mathcal{Q}|\) is quite large in general, it is often infeasible to actually count the supports of all possible patterns. To circumvent this difficulty, sequential pattern mining methods exploit the fact that the support of a pattern is always smaller than or equal to the supports of its any subsequences. Consider two sequences \(\varvec{q}_{k^\prime }\) and \(\varvec{q}_k\) such that \(\varvec{q}_{k^\prime } \sqsubseteq \varvec{q}_k\), i.e., \(\varvec{q}_{k^\prime }\) is a subsequence of \(\varvec{q}_k\), then, it is obvious that

$$\begin{aligned} \mathrm{support}_\mathcal{G}(\varvec{q}_{k^\prime }) \ge \mathrm{support}_\mathcal{G}(\varvec{q}_k) ~~~ \forall \varvec{q}_{k^\prime } \sqsubseteq \varvec{q}_k. \end{aligned}$$
(1)

Eq. (1) indicates that, when we consider a tree as in Fig. 1, the support of the pattern in a node is always greater than or equal to its descendant node patterns, and smaller than or equal to its ancestor node patterns. This anti-monotonicity of the support in the tree can be exploited for finding frequent sequential patterns. Namely, when we search over the tree, if the support of a node in the tree is already smaller than \(\mathtt{{min\_sup}}\), we can skip searching its subtree.

Fig. 1.
figure 1

Tree structure for a frequent sequential pattern mining problem with \(\mathcal{S}= \{ a,b,c \}\).

3.2 PrefixSpan

In the literature of data mining, several types of sequential pattern mining methods were proposed [12]. Among them, we use a pattern-growth type method, and employ the most popular algorithm called PrefixSpan [2]. PrefixSpan proceeds as follows. It explores the search space of sequential patterns based on a depth-first search. It starts from sequential patterns containing only a single symbol and explores longer patterns by recursively appending symbols to the existing ones.

To formulate the PrefixSpan algorithm, let us define a concatenation of a sequence and a symbol. Given a sequence \(\varvec{z}= \langle z_1, z_2, \ldots , z_T \rangle \) and an symbol \(s \in \mathcal{S}\), the notation

$$\begin{aligned} \varvec{z}\diamond s = \langle z_1, z_2, \ldots , z_T, s \rangle \end{aligned}$$

indicates the concatenation of \(\varvec{z}\) and s. A concatenation of two sequences are similarly defined. Given two sequences \( \varvec{v}= \langle v_1, v_2, \ldots , v_{T_1} \rangle \) and \(\varvec{w}= \langle w_1, w_2, \ldots , w_{T_2}\rangle \), the notation

$$\begin{aligned} \varvec{v}\diamond \varvec{w}= \langle v_1, v_2, \ldots , v_{T_1}, w_1, w_2, \ldots , w_{T_2}\rangle \end{aligned}$$

represents the concatenation of \(\varvec{v}\) and \(\varvec{w}\).

In the PrefixSpan algorithm, the reduced database obtained by removing a specific sequence \(\varvec{z}\) as a prefix from the original database \(\mathcal{D}\) is defined as the projected database \(\mathcal{D}_{\varvec{z}}\), and defined as

$$\begin{aligned} \mathcal{D}_{\varvec{z}} :=\{ \varvec{w}| \varvec{z}^\prime \in \mathcal{D}, \varvec{z}^\prime =\varvec{v}\diamond \varvec{w}\} ~~~ \mathrm{s.t.~}\varvec{z}\sqsubseteq \varvec{v}\mathrm{{\,\,\,}}\text {and} \mathrm{{\,\,\,}}\not \exists \varvec{v}^\prime , \varvec{z}\sqsubseteq \varvec{v}^\prime \sqsubset \varvec{v}. \end{aligned}$$
(2)

where \(\varvec{v}\) represents the smallest prefix including \(\varvec{z}\) in \(\varvec{z}^\prime \).

In general sequential pattern mining problems, the contiguity of patterns are not considered, and only the order of the symbols matters. For example, in general sequential pattern mining contexts, both of the following two sequences \(\varvec{z}_1= \langle s_1, s_2, s_3 \rangle \) and \(\varvec{z}_2= \langle s_1, s_4, s_5, \ldots , s_{100}, s_2, s_3 \rangle \) are considered to contain a sequential pattern \(\varvec{q} = \langle s_1, s_2, s_3 \rangle \). In this paper, however, we focus on finding contiguous patterns, and regard that \(\varvec{z}_2\) does not contain \(\varvec{q}\) in the above example. To reflect this change, we need to slightly change the definition of the projected database as

$$\begin{aligned} \mathcal{D}_{\varvec{z}} =\{ \varvec{w}| \varvec{z}^\prime \in \mathcal{D}, \varvec{z}^\prime =\varvec{v}\diamond \varvec{w}\} ~~~ \mathrm{s.t.~}\varvec{z}\sqsubseteq \varvec{v}, |\varvec{z}|=|\varvec{v}|, \mathrm{{\,\,\,}}\text {and} \mathrm{{\,\,\,}}\not \exists \varvec{v}^\prime , \varvec{z}\sqsubseteq \varvec{v}^\prime \sqsubset \varvec{v}. \end{aligned}$$
(3)

In the example of Table 1, projected databases in our definitions are, e.g., given as

$$\begin{aligned} \mathcal{D}_{\langle e, a\rangle }&= \{ \langle d, f \rangle , \langle b, f, b, d, c \rangle \}, \\ \mathcal{D}_{\langle a, b \rangle }&= \{\langle f, b, d, c\rangle , \langle f \rangle , \langle f, b, b, c, e \rangle \}. \end{aligned}$$

The pseudo-code of the PrefixSpan algorithm is presented in Algorithm 1. The PrefixSpan algorithm is efficient since only the sequential patterns appearing more than \(\mathtt{{min\_sup}}\) in \(\mathcal{D}\) are selected and counted.

figure a

3.3 Finding Discriminative Patterns

As mentioned in previous sections, it is important to find patterns that appear more frequently in one of the two groups than the other. We call these patterns as discriminative patterns. In this paper, we consider a naive approach for finding discriminative patterns. When we want to find the patterns that appear more frequently in the positive group than the negative group, we first conduct frequent sequential pattern mining with respect to the positive group. After we find the frequent sequential patterns \(F_{\mathcal{G}_+}(\mathtt{{min\_sup}})\), we naively count \(\mathrm{support}_{\mathcal{G}_-}(\varvec{q}_k)\) for all \(\varvec{q}_k \in F_{\mathcal{G}_+}(\mathtt{{min\_sup}})\), and find the top C patterns which has the largest \(\delta _+(\varvec{q}_k)\). Similarly, when we want to find the patterns that appear more frequently in the negative group than the positive group, we first compute \(F_{\mathcal{G}_-}(\mathtt{{min\_sup}})\) and count \(\mathrm{support}_{\mathcal{G}_+}(\varvec{q}_k)\) for all \(\varvec{q}_k \in F_{\mathcal{G}_-}(\mathtt{{min\_sup}})\) for obtaining the top C patterns which has the largest \(\delta _-(\varvec{q}_k)\).

4 Applications to Two Data Sets in Bio-Logging Studies

We applied the sequential pattern mining method described in the previous section to two data sets in bio-logging studies on Streaked Shearwater [14] and C. elegans [15, 16]. Table 2 shows the summary of each dataset.

4.1 Datasets

C. elegans. The nematode Caenorhabditis elegans (C. elegans) is a widely-used model animal. The authors in [16] measured the avoidance behavior of C.elegans from the repulsive odor 2-nonanone. The research question discussed in [16] is that a loss of a function via a mutation in a certain gene can change the avoidance behavior. We apply the sequential pattern mining method discussed in the previous section for finding patterns whose frequencies are different between the wild-type and the mutant. Specifically, we compared the avoidance behaviors of wild-type strains (N2) with each of the three mutant strains called egl-21, egl-3, and dop-3. For each of the three comparisons, we analyzed \(n=72\), 86, and 154 avoidance behaviors. Each avoidance behavior is characterized by three-dimensional symbols representing the moving direction (Forward, Backward), two behavioral states (ruN, Pirouette), and putative odor concentration change (Up, Down)Footnote 2. Here, the set of symbols \(\mathcal{S}\) consists of \(2 \times 2 \times 2 = 8\) symbols each of which is the combination of the three features such as “Forward Movement”, “Pirouette” and “Down”, which is denoted as “\((\mathrm{F, P, D})\)”.

Table 2. Summary of each dataset

Streaked Shearwater. Birds have an ability to reach the feeding destination far away from their nests by using various environmental information. In [14], the authors recorded the navigation trajectories of a number of them by using GPS loggers. Here, the goal of our analysis is to extract different navigation patterns between male and females. We defined a trip to be a movement of a bird which leaves the island of their nests for more than 8 h. In our analysis, we used \(n = 968~(n_+ = n_- = 484)\) trips. Each trip is represented by a two-dimensional time-series of the longitude and the latitude taken every minute. For sequential pattern mining, we extracted the speed (Low or High), the seawater temperature (Low, Middle, or High), and the distance from the coastline (Large, Middle, Small) at each location in a trajectory, and represent an each trip by a sequence of symbols defined by the combination of these three featuresFootnote 3. Here, the set of symbols \(\mathcal{S}\) consists of \(2 \times 3 \times 3 = 18\) symbols each of which is the combination of the three features such as “High speed”, “Medium seawater temperature” and “Small distance from the coastline”, which is denoted as “\((\mathrm{H, M, S})\)”.

4.2 Parameter Setting

In this data analysis, we set the parameters of the sequential pattern mining method as follows:

  • Minimum sequence length = 10

  • Maximum sequence length = 90

  • Select the top 10 (\(C = 10\)) patterns which appear more frequently in one group than the other.

Table 3. Result of EGL-21.
Table 4. Result of EGL-3.
Table 5. Result of DOP-3.

4.3 Results

C. elegans. Tables 3, 4 and 5 show the results of EGL-21, EGL-3 and DOP-3, respectively. In the tables, P1, P2, \(\ldots \), P10 indicate the top 10 patterns which appear more frequently in the positive class than the negative class. On the other hand, M1, M2, \(\ldots \), M10 indicate the top 10 patterns which appear more frequently in the negative class than the positive class. The notation such as “\(\langle (\mathrm{F, N, D}) \times 60 \rangle \)” indicates the pattern in which the symbol “\((\mathrm{F, N, D})\)” repeats 60 times.

The results in the table clearly indicate that patterns which are more frequent in one group than the other are clearly extracted. Table 3 suggests that many patterns containing multiple “\((\mathrm{B, P, U})\)” appear frequently in egl-21, but they hardly appear in N2. On the other hand, many patterns containing multiple “\((\mathrm{F, N, D})\)” frequently appear in N2, but they hardly appear in egl-21. This results can be interpreted that the loss of a function in egl-21 leads to many pirouette behavior, which suggests that it fails to directly escape from the odor as in N2. Table 4 shows that egl-3 has a similar tendency to egl-21. However, in Table 4, P1, P2, P3, P5, P8, P9 also contains “\((\mathrm{F, N, D})\)”, meaning that the behavior of egl-3 is more close to N2 than egl-21. These results are reasonable because the function lost in egl-21 and egl-3 are same, and the degrees of the loss of the function is larger in egl-21 than egl-3. Table 5 shows that many patterns containing multiple “\((\mathrm{F, P, D})\)” appear more frequently in dop-3, which is not observed for egl-21 and egl-3, suggesting that the loss of function in dop-3 leads to different behavior from egl-21 and egl-3.

Table 6. Result of BIRD.

Streaked Shearwater. Table 6 shows the results of BIRD. The results tell that patterns repeating “\((\mathrm{H, M, L})\)” appear more frequently in males, while patterns repeating “\((\mathrm{L, H, S})\)” appear more frequently in females. These results can be interpreted that (1) males tend to fly faster than females, (2) males tend to be further away from the coastline than females, and (3) the region where males fly tends to have medium seawater temperature.

Although further studies are needed for confirming the correctness of these interpretations, the current results in the paper suggests that sequential pattern mining methods can be useful for extracting interesting and interpretable knowledge from bio-logging data.

5 Discussion

In this paper, we employed a naive approach for finding discriminative patterns between two groups of animals simply by applying frequent sequential pattern mining to the sequences in one of the two groups. Although this approach successfully found interesting and interpretable discriminative patterns in the two bio-logging data analysis, there are a few problems to be addressed in the future. First, many similar patterns were extracted as the top 10 patterns with the largest \(\delta _+(\varvec{q}_k)\) and \(\delta _-(\varvec{q}_k)\) values. It would be nice if we could find a variety of discriminative patterns. Second, it is unclear whether the identified patterns are statistically significant or not. It would be desired to be able to provide some statistical confidence measures on the extracted patterns. Third, in the current approach, we need to specify several tuning parameters such as \(\mathtt{{min\_sup}}\), minimum and maximum sequence lengths, the number of patterns C etc. It would be nice if we could develop a method that works with a smaller number of tuning parameters.