Abstract
Recent advancement of bio-logging devices such as GPS sensor enables researchers in ecology to quantitatively measure animal trajectories. These animal trajectory data are often represented in the form of multi-dimensional time-series. In this paper, we develop a method for extracting interesting animal behaviors from these multi-dimensional time-series. To this end, we represent a multi-dimensional time-series as a discrete symbol sequence, and introduce some techniques developed in the context of sequential pattern mining, which has been actively studied in the literature of knowledge discovery and data mining. In animal behavior studies, it is often desired to conduct comparative studies for finding different animal behaviors in different groups, e.g, different behaviors between male and female animals etc. We use a sequential pattern mining method designed for finding so-called discriminative sequential patterns, i.e., sequential patterns that are useful for discriminating different group of animals. We apply the method to several animal trajectory datasets for demonstrating its effectiveness.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Advances in data logging and mobile computing devices have enabled biologists to collect scientific data on the movement, behavior and physiology of moving animals. This approach is called bio-logging and has been considered as a promising methodology for acquiring knowledge and insights about animal behaviors and natural environment. Collected data in bio-logging studies are naturally formulated as multi-dimensional time-series data. For example, using GPS loggers, trajectories of moving animals are represented as a two-dimensional time-series data. In this paper, we study computational data analysis techniques for extracting biological knowledge from such multi-dimensional time-series data collected in bio-logging studies.
In bio-logging data analysis, it is important to extract knowledge that is both interesting and interpretable for biologists. Traditional descriptive statistical data analysis such as averages, variances of certain animal behaviors are often not so interesting, and it rarely leads to new scientific finding. On the other hand, highly complicated nonparametric and nonlinear data analysis method such as artificial neural networks tend to provide too complicated results for biologists to interpret. The goal of this paper is to introduce sequential pattern mining into bio-logging data analysis, and demonstrate that it is useful for obtaining both interesting and interpretable knowledge about animal behaviors.
Sequential pattern mining has been studied in data mining community [1] for extracting knowledge from discrete symbol sequences. The extracted knowledge in pattern mining methods is called patterns. In order to apply sequential pattern mining methods to bio-logging data analysis, we first represent a multi-dimensional time-series as a discrete symbol sequence. Then, by using a method called frequent sequential pattern mining [2,3,4,5,6,7,8,9,10,11,12,13], we can extract the set of frequent subsequences that appear in the bio-logging trajectory data. When biologists are interested in comparison between two groups or conditions, we can also find patterns that frequently appear in one group or condition, and does not appear in the other, which is called discriminative sequential pattern mining.
The rest of the paper is organized as follows. First, we formulate the problem setup in Sect. 2. Then, we present the basic idea of frequent and discriminative sequential pattern mining and one of the well-known algorithms called PrefixSpan in Sect. 3. In Sect. 4, we apply the sequential pattern mining method to two published animal movement data on Streaked Shearwater and the nematode C. elegans. In the former data, we extracted sequential patterns that frequently appear only in male or female birds. In the latter data, we extracted sequential patterns that frequently appear only in a group of worms which lost a specific function by genetic mutation of one of the genes. Section 5 concludes the paper.
Notations
We use the following notations in the rest of the paper. For any natural number n, we define \([n] := \{1, \ldots , n\}\). A sequence (an ordered list of discrete symbols) with length T is represented as \(\langle x_1, x_2, \ldots , x_T \rangle \).
2 Problem Setup
Data taken from a bio-logging study for an animal is generally represented as a multi-dimensional time-series. We assume that appropriate prepossess operations such as outlier removal, missing value imputation, and noise reduction have been applied to the raw data before obtaining the multi-dimensional time-series. For applying sequence mining methods to bio-logging data, a multi-dimensional time-series is first transformed into a sequence by discretizationFootnote 1. A sequence is an ordered list of discrete symbols. We denote the number of different symbols as m and denote the set of those symbols as \(\mathcal{S}:= \{s_1, \ldots , s_m\}\).
In this paper, we consider a bio-logging study of two groups of animals such as male/female or infant/adults. Let the total number of animals as n. We denote the first and the second group of the animals as \(\mathcal{G}_{+}, \mathcal{G}_{-} \subseteq [n]\) and their sizes as \(n_+ := |\mathcal{G}_+|, n_- := |\mathcal{G}_-|\), respectively.
We denote a data set in a bio-logging study transformed into sequences of symbols as
where \(\varvec{x}_i\) represents the sequence of the i-th animal. Each sequence \(\varvec{x}_i\) is written as
where \(x_{it}\) represents the symbol of the i-th animal at t-th time point which takes one of the symbols in \(\mathcal{S}\), and T(i) indicates the length of the i-th sequence.
The goal of sequence mining is to extract a set of patterns \(\varvec{q}_1, \varvec{q}_2, \ldots \) each of which is also defined as a sequence in the form of
where L(k) is the length of the pattern \(\varvec{q}_k\) for \(k = 1, 2, \ldots \). We say that a sequence \(\varvec{x}_i\) contains a pattern \(\varvec{q}_k\) if
and represent this relationship as \(\varvec{q}_k \sqsubseteq \varvec{x}_i\). We denote the set of all possible patterns contained in any one of the sequences \(\{\varvec{x}_i\}_{i \in [n]}\) as \(\mathcal{Q}\). Note that the size of \(\mathcal{Q}\) is quite large in general.
For a set of sequences \(\mathcal{G}\subseteq [n]\), we define the support of the pattern \(\varvec{q}_k\) as
i.e., \(\mathrm{support}_\mathcal{G}(\varvec{q}_k)\) indicates the number of sequences in the set \(\mathcal{G}\) that contain the pattern \(\varvec{q}_k\).
In this study, we are interested in finding patterns which appears more frequently in one of the two groups than the other. We define the difference of the supports of a pattern \(\varvec{q}_k\) in the two groups as
We then want to extract the top C patterns that have largest \(\delta _+ (\varvec{q}_k)\) and \(\delta _- (\varvec{q}_k)\), respectively.
Table 1 shows an illustrative example of a data set with \(n=6 (n_+ = n_- = 3)\), \(T(1)=4\), \(T(2)=4\), \(T(3)=7\), \(T(4)=4\), \(T(5)=4\), \(T(6)=7\), and \(\mathcal{S}= \{a, b, c, d, e, f\}\). In this illustrative example, a movement of each of the three male and female animals is represented as a sequence of T(i) symbols chosen from \(\mathcal{S}\). The goal of our analysis is to extract patterns (subsequences) that are found more frequently in male animals than the female animals or vice-versa. When we set \(C=2\), the top 2 patterns appearing more frequently in the males than the females are \(\langle d \rangle \) and \(\langle e, a \rangle \), while the top 2 patterns appearing more frequently in the females than the males are \(\langle b \rangle \) and \(\langle b, b \rangle \).
3 Sequential Pattern Mining
Sequential pattern mining is widely used as methods for extracting frequent subsequences from a set of sequences. In this section, we describe the basic idea of sequential pattern mining, a famous sequential pattern mining algorithm called PrefixSpan, and how to extract discriminative patterns between two groups. In this paper, we focus on finding contiguous patterns, i.e., the patterns having no blanks between symbols. It is easy to extend the method to find discontiguous sequential patterns.
3.1 Frequent Sequential Pattern Mining
Given a subset of sequences \(\mathcal{G}\subseteq [n]\), the set of all patterns that appear more than or equal to \(\mathtt{{min\_sup}}\) sequences in \(\mathcal{G}\) is called frequent sequential patterns, and denoted as
In the context of pattern mining, the threshold value \(\mathtt{{min\_sup}}\) is called minimum support. A method that can find frequent sequential patterns is called a frequent sequential pattern mining method. For example, in Table 1, when \(\mathtt{{min\_sup}}=2\),
Since the number of possible patterns \(|\mathcal{Q}|\) is quite large in general, it is often infeasible to actually count the supports of all possible patterns. To circumvent this difficulty, sequential pattern mining methods exploit the fact that the support of a pattern is always smaller than or equal to the supports of its any subsequences. Consider two sequences \(\varvec{q}_{k^\prime }\) and \(\varvec{q}_k\) such that \(\varvec{q}_{k^\prime } \sqsubseteq \varvec{q}_k\), i.e., \(\varvec{q}_{k^\prime }\) is a subsequence of \(\varvec{q}_k\), then, it is obvious that
Eq. (1) indicates that, when we consider a tree as in Fig. 1, the support of the pattern in a node is always greater than or equal to its descendant node patterns, and smaller than or equal to its ancestor node patterns. This anti-monotonicity of the support in the tree can be exploited for finding frequent sequential patterns. Namely, when we search over the tree, if the support of a node in the tree is already smaller than \(\mathtt{{min\_sup}}\), we can skip searching its subtree.
3.2 PrefixSpan
In the literature of data mining, several types of sequential pattern mining methods were proposed [12]. Among them, we use a pattern-growth type method, and employ the most popular algorithm called PrefixSpan [2]. PrefixSpan proceeds as follows. It explores the search space of sequential patterns based on a depth-first search. It starts from sequential patterns containing only a single symbol and explores longer patterns by recursively appending symbols to the existing ones.
To formulate the PrefixSpan algorithm, let us define a concatenation of a sequence and a symbol. Given a sequence \(\varvec{z}= \langle z_1, z_2, \ldots , z_T \rangle \) and an symbol \(s \in \mathcal{S}\), the notation
indicates the concatenation of \(\varvec{z}\) and s. A concatenation of two sequences are similarly defined. Given two sequences \( \varvec{v}= \langle v_1, v_2, \ldots , v_{T_1} \rangle \) and \(\varvec{w}= \langle w_1, w_2, \ldots , w_{T_2}\rangle \), the notation
represents the concatenation of \(\varvec{v}\) and \(\varvec{w}\).
In the PrefixSpan algorithm, the reduced database obtained by removing a specific sequence \(\varvec{z}\) as a prefix from the original database \(\mathcal{D}\) is defined as the projected database \(\mathcal{D}_{\varvec{z}}\), and defined as
where \(\varvec{v}\) represents the smallest prefix including \(\varvec{z}\) in \(\varvec{z}^\prime \).
In general sequential pattern mining problems, the contiguity of patterns are not considered, and only the order of the symbols matters. For example, in general sequential pattern mining contexts, both of the following two sequences \(\varvec{z}_1= \langle s_1, s_2, s_3 \rangle \) and \(\varvec{z}_2= \langle s_1, s_4, s_5, \ldots , s_{100}, s_2, s_3 \rangle \) are considered to contain a sequential pattern \(\varvec{q} = \langle s_1, s_2, s_3 \rangle \). In this paper, however, we focus on finding contiguous patterns, and regard that \(\varvec{z}_2\) does not contain \(\varvec{q}\) in the above example. To reflect this change, we need to slightly change the definition of the projected database as
In the example of Table 1, projected databases in our definitions are, e.g., given as
The pseudo-code of the PrefixSpan algorithm is presented in Algorithm 1. The PrefixSpan algorithm is efficient since only the sequential patterns appearing more than \(\mathtt{{min\_sup}}\) in \(\mathcal{D}\) are selected and counted.

3.3 Finding Discriminative Patterns
As mentioned in previous sections, it is important to find patterns that appear more frequently in one of the two groups than the other. We call these patterns as discriminative patterns. In this paper, we consider a naive approach for finding discriminative patterns. When we want to find the patterns that appear more frequently in the positive group than the negative group, we first conduct frequent sequential pattern mining with respect to the positive group. After we find the frequent sequential patterns \(F_{\mathcal{G}_+}(\mathtt{{min\_sup}})\), we naively count \(\mathrm{support}_{\mathcal{G}_-}(\varvec{q}_k)\) for all \(\varvec{q}_k \in F_{\mathcal{G}_+}(\mathtt{{min\_sup}})\), and find the top C patterns which has the largest \(\delta _+(\varvec{q}_k)\). Similarly, when we want to find the patterns that appear more frequently in the negative group than the positive group, we first compute \(F_{\mathcal{G}_-}(\mathtt{{min\_sup}})\) and count \(\mathrm{support}_{\mathcal{G}_+}(\varvec{q}_k)\) for all \(\varvec{q}_k \in F_{\mathcal{G}_-}(\mathtt{{min\_sup}})\) for obtaining the top C patterns which has the largest \(\delta _-(\varvec{q}_k)\).
4 Applications to Two Data Sets in Bio-Logging Studies
We applied the sequential pattern mining method described in the previous section to two data sets in bio-logging studies on Streaked Shearwater [14] and C. elegans [15, 16]. Table 2 shows the summary of each dataset.
4.1 Datasets
C. elegans. The nematode Caenorhabditis elegans (C. elegans) is a widely-used model animal. The authors in [16] measured the avoidance behavior of C.elegans from the repulsive odor 2-nonanone. The research question discussed in [16] is that a loss of a function via a mutation in a certain gene can change the avoidance behavior. We apply the sequential pattern mining method discussed in the previous section for finding patterns whose frequencies are different between the wild-type and the mutant. Specifically, we compared the avoidance behaviors of wild-type strains (N2) with each of the three mutant strains called egl-21, egl-3, and dop-3. For each of the three comparisons, we analyzed \(n=72\), 86, and 154 avoidance behaviors. Each avoidance behavior is characterized by three-dimensional symbols representing the moving direction (Forward, Backward), two behavioral states (ruN, Pirouette), and putative odor concentration change (Up, Down)Footnote 2. Here, the set of symbols \(\mathcal{S}\) consists of \(2 \times 2 \times 2 = 8\) symbols each of which is the combination of the three features such as “Forward Movement”, “Pirouette” and “Down”, which is denoted as “\((\mathrm{F, P, D})\)”.
Streaked Shearwater. Birds have an ability to reach the feeding destination far away from their nests by using various environmental information. In [14], the authors recorded the navigation trajectories of a number of them by using GPS loggers. Here, the goal of our analysis is to extract different navigation patterns between male and females. We defined a trip to be a movement of a bird which leaves the island of their nests for more than 8 h. In our analysis, we used \(n = 968~(n_+ = n_- = 484)\) trips. Each trip is represented by a two-dimensional time-series of the longitude and the latitude taken every minute. For sequential pattern mining, we extracted the speed (Low or High), the seawater temperature (Low, Middle, or High), and the distance from the coastline (Large, Middle, Small) at each location in a trajectory, and represent an each trip by a sequence of symbols defined by the combination of these three featuresFootnote 3. Here, the set of symbols \(\mathcal{S}\) consists of \(2 \times 3 \times 3 = 18\) symbols each of which is the combination of the three features such as “High speed”, “Medium seawater temperature” and “Small distance from the coastline”, which is denoted as “\((\mathrm{H, M, S})\)”.
4.2 Parameter Setting
In this data analysis, we set the parameters of the sequential pattern mining method as follows:
-
Minimum sequence length = 10
-
Maximum sequence length = 90
-
Select the top 10 (\(C = 10\)) patterns which appear more frequently in one group than the other.
4.3 Results
C. elegans. Tables 3, 4 and 5 show the results of EGL-21, EGL-3 and DOP-3, respectively. In the tables, P1, P2, \(\ldots \), P10 indicate the top 10 patterns which appear more frequently in the positive class than the negative class. On the other hand, M1, M2, \(\ldots \), M10 indicate the top 10 patterns which appear more frequently in the negative class than the positive class. The notation such as “\(\langle (\mathrm{F, N, D}) \times 60 \rangle \)” indicates the pattern in which the symbol “\((\mathrm{F, N, D})\)” repeats 60 times.
The results in the table clearly indicate that patterns which are more frequent in one group than the other are clearly extracted. Table 3 suggests that many patterns containing multiple “\((\mathrm{B, P, U})\)” appear frequently in egl-21, but they hardly appear in N2. On the other hand, many patterns containing multiple “\((\mathrm{F, N, D})\)” frequently appear in N2, but they hardly appear in egl-21. This results can be interpreted that the loss of a function in egl-21 leads to many pirouette behavior, which suggests that it fails to directly escape from the odor as in N2. Table 4 shows that egl-3 has a similar tendency to egl-21. However, in Table 4, P1, P2, P3, P5, P8, P9 also contains “\((\mathrm{F, N, D})\)”, meaning that the behavior of egl-3 is more close to N2 than egl-21. These results are reasonable because the function lost in egl-21 and egl-3 are same, and the degrees of the loss of the function is larger in egl-21 than egl-3. Table 5 shows that many patterns containing multiple “\((\mathrm{F, P, D})\)” appear more frequently in dop-3, which is not observed for egl-21 and egl-3, suggesting that the loss of function in dop-3 leads to different behavior from egl-21 and egl-3.
Streaked Shearwater. Table 6 shows the results of BIRD. The results tell that patterns repeating “\((\mathrm{H, M, L})\)” appear more frequently in males, while patterns repeating “\((\mathrm{L, H, S})\)” appear more frequently in females. These results can be interpreted that (1) males tend to fly faster than females, (2) males tend to be further away from the coastline than females, and (3) the region where males fly tends to have medium seawater temperature.
Although further studies are needed for confirming the correctness of these interpretations, the current results in the paper suggests that sequential pattern mining methods can be useful for extracting interesting and interpretable knowledge from bio-logging data.
5 Discussion
In this paper, we employed a naive approach for finding discriminative patterns between two groups of animals simply by applying frequent sequential pattern mining to the sequences in one of the two groups. Although this approach successfully found interesting and interpretable discriminative patterns in the two bio-logging data analysis, there are a few problems to be addressed in the future. First, many similar patterns were extracted as the top 10 patterns with the largest \(\delta _+(\varvec{q}_k)\) and \(\delta _-(\varvec{q}_k)\) values. It would be nice if we could find a variety of discriminative patterns. Second, it is unclear whether the identified patterns are statistically significant or not. It would be desired to be able to provide some statistical confidence measures on the extracted patterns. Third, in the current approach, we need to specify several tuning parameters such as \(\mathtt{{min\_sup}}\), minimum and maximum sequence lengths, the number of patterns C etc. It would be nice if we could develop a method that works with a smaller number of tuning parameters.
Notes
- 1.
A time-series is an ordered list of numbers, whereas a sequence is an ordered list of nominal values (symbols).
- 2.
The moving direction is defined as “Forward” if the angle of the movement is in the range of \(+90^\circ \) and \(-90^\circ \) degrees with its velocity faster than 0.06 mm/s, and “Backward” otherwise. The definitions of pirouette status are defined in [17]. The putative odor concentration change is defined as “Up” when the worm experiences increases in odor concentration, and “Down”, otherwise.
- 3.
The discretization of the speed, the seawater temperature and the distance from the coastline are defined as follows. The speed is defined as “Low” if it is lower than 10 km/h, and “High” otherwise. The seawater temperature is defined as “Low” if it is lower than \(20^\circ \), “Medium” if it is in the range from \(20^\circ \) to \(25^\circ \), and “High” otherwise. The distance from the coastline is defined as “Small” if it is larger than 12652 m, “Medium” if it is in the range from 12652 m to 28362 m, and “Large” otherwise, where the two thresholds are defined to be the empirical 33.3 and 66.6 percentiles.
References
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)
Han, J., Pei, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 17th International Conference on Data Engineering, pp. 215–224 (2001)
Wang, J., Han, J., Li, C.: Frequent closed sequence mining without candidate maintenance. IEEE Trans. Knowl. Data Eng. 19(8), 1042–1056 (2007)
Fu, T.: A review on time series data mining. Eng. Appl. Artif. Intell. 24(1), 164–181 (2011)
Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Apers, P., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 1–17. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0014140
Fournier-Viger, P., Gomariz, A., Campos, M., Thomas, R.: Fast vertical mining of sequential patterns using co-occurrence information. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014. LNCS (LNAI), vol. 8443, pp. 40–52. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06608-0_4
Zaki, M.J.: Spade: an efficient algorithm for mining frequent sequences. Mach. Learn. 42(1–2), 31–60 (2001)
Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a bitmap representation. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 429–435. ACM (2002)
Yang, Z., Kitsuregawa, M.: Lapin-spam: an improved algorithm for mining sequential pattern. In: 21st International Conference on Data Engineering Workshops, pp. 1222–1222. IEEE (2005)
Gouda, K., Hassaan, M., Zaki, M.J.: Prism: an effective approach for frequent sequence mining via prime-block encoding. J. Comput. Syst. Sci. 76(1), 88–102 (2010)
Salvemini, E., Fumarola, F., Malerba, D., Han, J.: FAST sequence mining based on sparse id-lists. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS (LNAI), vol. 6804, pp. 316–325. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21916-0_35
Aggarwal, C.C., Han, J. (eds.): Frequent Pattern Mining. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07821-2
Fournier-Viger, P., Lin, J.C.W., Kiran, R.U., Koh, Y.S., Thomas, R.: A survey of sequential pattern mining. Data Sci. Pattern Recogn. 1(1), 54–77 (2017)
Matsumoto, S., Yamamoto, T., Yamamoto, M., Zavalaga, C.B., Yoda, K.: Sex-related differences in the foraging movement of streaked shearwaters calonectris leucomelas breeding on Awashima Island in the sea of Japan. Ornithol. Sci. 16(1), 23–32 (2017)
Yamazoe-Umemoto, A., Fujita, K., Iino, Y., Iwasaki, Y., Kimura, K.D.: Modulation of different behavioral components by neuropeptide and dopamine signalings in non-associative odor learning of caenorhabditis elegans. Neurosci. Res. 99, 22–33 (2015)
Kimura, K., Fujita, K., Katsura, I.: Enhancement of odor avoidance regulated by dopamine signaling in caenorhabditis elegans. J. Neurosci. 30, 16365–16375 (2010)
Pierce-Shimomura, J.T., Morse, T.M., Lockery, S.R.: The fundamental role of pirouettes in caenorhabditis elegans chemotaxis. J. Neurosci. 19(21), 9557–9569 (1999)
Acknowledgement
This work was partially supported by MEXT KAKENHI (17H00758, 16H06538), JST CREST (JPMJCR1302, JPMJCR1502), RIKEN Center for Advanced Intelligence Project, and JST support program for starting up innovation-hub on materials research by information integration initiative.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Sakuma, T. et al. (2018). Finding Discriminative Animal Behaviors from Sequential Bio-Logging Trajectory Data. In: Streitz, N., Konomi, S. (eds) Distributed, Ambient and Pervasive Interactions: Technologies and Contexts. DAPI 2018. Lecture Notes in Computer Science(), vol 10922. Springer, Cham. https://doi.org/10.1007/978-3-319-91131-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-91131-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91130-4
Online ISBN: 978-3-319-91131-1
eBook Packages: Computer ScienceComputer Science (R0)