Skip to main content
Log in

Contextual anomaly detection on time series: a case study of metro ridership analysis

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The increase in the amount of data collected in the transport domain can greatly benefit mobility studies and create high value-added mobility information for passengers, data analysts, and transport operators. This work concerns the detection of the impact of disturbances on a transport network. It aims, from smart card data analysis, to finely quantify the impacts of known disturbances on the transportation network usage and to reveal unexplained statistical anomalies that may be related to unknown disturbances. The mobility data studied take the form of a multivariate time series evolving in a dynamic environment with additional contextual attributes. The research mainly focuses on contextual anomaly detection using machine learning models. Our main goal is to build a robust anomaly score to highlight statistical anomalies (contextual extremums), considering the variability within the time series induced by the dynamic context. The robust anomaly score is built from normalized forecasting residuals. The normalization of the residuals is carried out using the estimated contextual variance. Indeed, there are complex dynamics on both the mean and the variance in the ridership time series induced by the flexible transportation schedule, the variability in transport demand, and contextual factors such as the station location and the calendar information. Therefore, they should be considered by the anomaly detection approach to obtain a reliable anomaly score. We investigate several prediction models (including an LSTM encoder–decoder of the recurrent neural network deep learning family) and several variance estimators obtained through dedicated models or extracted from prediction models. The proposed approaches are evaluated on synthetic data and real data from the smart card riderships of the Quebec Metro network. It includes a basis of events and disturbances that have impacted the transport network. The experiments show the relevance of variance normalization on prediction residuals to build a robust anomaly score under a dynamic context.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Availability of data and material

Data provided by the Société de Transport de Montréal (STM) are private and the authors do not have the right to provide it to third parties. However, synthetic data and the generation process can be shared.

Notes

  1. Git-lab of experiments on synthetic data: https://gitlab.com/Haroke/contextual-anomaly-detection.

References

  1. Chandola V (2009) Anomaly detection for symbolic sequences and time series data, Ph.D. thesis, University of Minnesota

  2. Hayes MA, Capretz MA (2014) Contextual anomaly detection in big sensor data. In: 2014 IEEE International Congress on Big Data, IEEE, pp 64–71

  3. Benkabou S-E, Benabdeslem K, Canitia B (2018) Unsupervised outlier detection for time series by entropy and dynamic time warping. Knowl Inf Syst 54:463–486

    Article  Google Scholar 

  4. Yeh C-CM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Silva DF, Mueen A, Keogh E (2016) Matrix profile i: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: 2016 IEEE 16th international conference on data mining (ICDM), IEEE, pp 1317–1322

  5. Nakamura T, Imamura M, Mercer R, Keogh E (2020) Merlin: Parameter-free discovery of arbitrary length anomalies in massive time series archives. In: 2020 IEEE 16th international conference on data mining (ICDM), IEEE

  6. Ding Z, Fei M (2013) An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proc. Vol. 46:12–17

    Article  Google Scholar 

  7. Feremans L, Vercruyssen V, Cule B, Meert W, Goethals B (2019) Pattern-based anomaly detection in mixed-type time series, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp. 240–256

  8. Tonnelier E, Baskiotis N, Guigue V, Gallinari P (2018) Anomaly detection in smart card logs and distant evaluation with twitter: a robust framework. Neurocomputing 298:109–121

    Article  Google Scholar 

  9. Malhotra P, Vig L, Shroff G, Agarwal P (2015) Long short term memory networks for anomaly detection in time series. In: Proceedings, vol 89, Presses universitaires de Louvain

  10. Guo Y, Liao W, Wang Q, Yu L, Ji T, Li P (2018) Multidimensional time series anomaly detection: a gru-based gaussian mixture variational autoencoder approach. In: Asian Conference on Machine Learning, pp 97–112

  11. Pasini K, Khouadjia M, Same A, Ganansia F, Oukhellou L (2019) LSTM encoder-predictor for short-term train load forecasting. Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pp 535–551

  12. Zhu L, Laptev N (2017) Deep and confident prediction for time series at uber. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 103–110

  13. Yu Y, Long J, Cai Z (2017) Network intrusion detection through stacking dilated convolutional autoencoders. Security and Communication Networks 2017

  14. Hundman K, Constantinou V, Laporte C, Colwell I, Soderstrom T (2018) Detecting spacecraft anomalies using LSTMS and nonparametric dynamic thresholding. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 387–395

  15. Schlegl T, Seeböck P, Waldstein SM, Schmidt-Erfurth U, Langs G (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International conference on information processing in medical imaging, Springer, pp. 146–157

  16. Abdallah A, Maarof MA, Zainal A (2016) Fraud detection system: a survey. J Netw Comput Appl 68:90–113

    Article  Google Scholar 

  17. Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W, Retain, (2016) An interpretable predictive model for healthcare using reverse time attention mechanism. In: Advances in Neural Information Processing Systems 3504–3512

  18. Cao N, Lin C, Zhu Q, Lin Y-R, Teng X, Wen X (2017) Voila: visual anomaly detection and monitoring with streaming spatiotemporal data. IEEE Trans visual Comput Graph 24:23–33

    Article  Google Scholar 

  19. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv (CSUR) 41:1–58

    Article  Google Scholar 

  20. Habeeb RAA, Nasaruddin F, Gani A, Hashem IAT, Ahmed E, Imran M (2019) Real-time big data processing for anomaly detection: a survey. Int J Inf Manag 45:289–307

    Article  Google Scholar 

  21. Chalapathy R, Chawla S (2019) Deep learning for anomaly detection: a survey, arXiv preprint arXiv:1901.03407

  22. Cheng H, Tan P-N, Potter C, Klooster S (2009) Detection and characterization of anomalies in multivariate time series. In: Proceedings of the 2009 SIAM international conference on data mining, SIAM, pp 413–424

  23. Dimopoulos G, Barlet-Ros P, Dovrolis C, Leontiadis I (2017) Detecting network performance anomalies with contextual anomaly detection. In: 2017 IEEE international workshop on measurement and networking (M&N), IEEE, pp 1–6

  24. Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: 2008 Eighth IEEE international conference on data mining, IEEE, pp 413–422

  25. Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data TKDD 6:1–39

    Article  Google Scholar 

  26. Yankov D, Keogh E, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inf Syst 17:241–262

    Article  Google Scholar 

  27. Akouemo HN, Povinelli RJ (2014) Time series outlier detection and imputation. In: 2014 IEEE PES General Meeting, IEEE, pp 1–5

  28. Li J, Pedrycz W, Jamal I (2017) Multivariate time series anomaly detection: a framework of hidden Markov models. Appl Soft Comput 60:229–240

    Article  Google Scholar 

  29. Salem O, Guerassimov A, Mehaoua A, Marcus A, Furht B (2014) Anomaly detection in medical wireless sensor networks using svm and linear regression models. Int J E-Health Med Commun IJEHMC 5:20–45

    Google Scholar 

  30. Kromanis R, Kripakaran P (2013) Support vector regression for anomaly detection from measurement histories. Adv Eng Inf 27:486–495

    Article  Google Scholar 

  31. Hasan MAM, Nasser M, Pal B (2014) Ahmad S (2014) Support vector machine and random forest modeling for intrusion detection system (ids). J Intell Learn Syst Appl

  32. Kasai H, Kellerer W, Kleinsteuber M (2016) Network volume anomaly detection and identification in large-scale networks based on online time-structured traffic tensor tracking. IEEE Trans Netw Serv Manag 13:636–650

    Article  Google Scholar 

  33. Malhotra P, Ramakrishnan A, Anand G, Vig L, Agarwal P, Shroff G (2016) Lstm-based encoder-decoder for multi-sensor anomaly detection. In: Anomaly Detection Workshop of the 33rd International Conference on Machine Learning (ICML 2016)

  34. Munir M, Siddiqui SA, Dengel A, Ahmed S (2018) Deepant: a deep learning approach for unsupervised anomaly detection in time series. IEEE Access 7:1991–2005

    Article  Google Scholar 

  35. Meinshausen N (2006) Quantile regression forests. J Mach Learn Res 7:983–999

    MathSciNet  MATH  Google Scholar 

  36. Carel L (2019) Big data analysis in the field of transportation, Ph.D. thesis, Université Paris-Saclay

  37. Kingma DP, Welling M (2014) Auto-encoding variational bayes. In: 2nd international conference on learning representations, ICLR 2014, Conference Track Proceedings

  38. Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), pp 1050–1059

  39. Toqué F, Côme E, Oukhellou L, Trépanier M (2018) Short-term multi-step ahead forecasting of railway passenger flows during special events with machine learning methods

  40. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  41. Martin A et al. (2015) TensorFlow: Large-scale machine learning on heterogeneous systems

  42. Chollet F et al. (2015) Keras,

Download references

Acknowledgements

This research is a part of the IVA Project, which aims to develop machine learning approaches to enhance traveler information. The project is carried out under the leadership of the Technological Research Institute SystemX, with the partnership and support of the transport organization authority Ile-De-France Mobilités (IDFM), SNCF, Université Gustave Eiffel and public funds under the scope of the French Program “ANR - Investissements d’Avenir.” The authors also wish to thank the Montreal Transit Corporation (STM) for providing ridership data and the database of events and disturbances.

Funding

This research is a part of the IVA Project, which aims to enhance traveler information. The project is carried out under the leadership of the Technological Research Institute SystemX, with the partnership and support of the transport organization authority Ile-De-France Mobilités (IDFM), the french railway operator NCF, and public funds under the scope of the French Program “Investissements d’Avenir.”

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by Kevin PASINI. The first draft of the manuscript was written by Kevin PASINI and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Latifa Oukhellou.

Ethics declarations

Conflicts of interest

There is no conflict of interest.

Code availability

The code related to experiments on real data is the property of the project partners. The part of the code related to experiments on synthetic data can be shared. https://gitlab.com/Haroke/contextual-anomaly-detection.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Git-lab of experiments on synthetic data: https://gitlab.com/Haroke/contextual-anomaly-detection.

Appendix

Appendix

1.1 Bias-variance Estimators

First, we propose two ways to learn and estimate the bias-variance based on the prediction residues produced by the forecasting models.

1. EMP: Empirical estimation on a prior sampling.

The estimation model is based on prior knowledge. We segment the contextual attribute space \(\mathbf{c }\) into prior subspaces (subsamplings) defined by a set of constraints (\(V^{inf}, V^{sup}\)) given by expert knowledge. The bias \({\hat{B}}\) and variance \({\hat{\sigma }}\) estimators are summarized in three steps, as follows:

  1. 1.

    Extract from each prior \(E_k\) the subsampling bias and variance.

  2. 2.

    Associate each time step t to its subsampling \(E_k\).

  3. 3.

    Return the bias \({\hat{B}}_t\) and variance \({\hat{\sigma }}_t\) for each time step t.

$$\begin{aligned}&\{E_k : {t \in E_k\ |\ V_{k}^{inf}> \mathbf{c }(t) >V_{k}^{sup}} \}\\&{\hat{B}}_{E_k} = \sum _{t\in E_k}\frac{r_t}{\#E_k}={\hat{r}}_{E_k} \qquad \ {\hat{\sigma }}_{E_k} = \sqrt{\sum _{t\in E_k}\frac{(r_t-{\hat{B}}_{E_k})^2}{\#E_k}} \end{aligned}$$

2. ML: Machine learning-based estimation.

The estimation model can be learned by a machine learning algorithm. We train two prediction models to learn the bias and variance of the residues of the predictions from the contextual attributes.

The two models are similar in terms of estimating a type of mean (absolute for the bias and quadratic-centered for the variance) on a learned contextual subsample.

$$\begin{aligned}&Bias\ :\ \theta = {{\,\mathrm{argmin}\,}}_{\theta } \sum _{t} |M^{{\hat{B}}}_\theta (X_t)- r_t| {\hat{B}}(t) = M^{{\hat{B}}}_\theta (X_t) = {\hat{r}}_t \\&Variance\ :\ \theta = {{\,\mathrm{argmin}\,}}_{\theta }\sqrt{\sum _{t} |M^{{\hat{\sigma }}}_\theta (X_t) - (r_t-{\hat{B}}(t))^2|} \\&\quad {\hat{\sigma }}(t) = \sqrt{M_\theta ^{{\hat{\sigma }}}(X_t)} \end{aligned}$$

Second, we propose directly extracting an estimation of the bias and variance from a forecasting model. We propose exploring the extraction for a random forest and a deep neural network. Often, extracting the estimated bias from the model itself will lead to a result of zero since the model has been optimized to minimize this bias.

  • RF: Random forest extraction

    In [35], the authors show that we often exploit valuable information about the distribution learned from a random forest by considering only the mean of the subsamples. From this assumption, we propose extracting the variance based on a learned subsampling of our random forest forecasting model.

    Let M be a random forest composed of \((T^1,..,T^n)\) binary trees. Each tree \(T^k\) is composed of a set of leaves \(L^k\). Values \(j_i\) are assigned to each leaf during the learning phase according to their attribute modalities \(X_i\). We define a tree walk operator \(F^k(X_t)\) that takes attributes \(X_t\) and returns for the associated leaf \(L^k_i\), the set of assigned values.

    $$\begin{aligned} M(X_t) = \frac{1}{n} * \sum _{k \in [1,n]} \left( \sum _{j \in F^k(X_t)} \frac{j}{\#F^k(X_t)}\right) = {\hat{y}}_t \end{aligned}$$

    The prediction of an element by an RF model is similar to the weighted mean of a subsample formed by elements sharing a leaf. The weighting depends on the shared leaf number and shared element tree number. Shared leaf elements can be considered contextual neighbors on the basis of their attributes. Then, we can extract the bias (equal to 0) and variance from this contextual subsampling.

    $$\begin{aligned} {\hat{B}}(t) = 0\quad {\hat{\sigma }}(t) = \sqrt{ \frac{1}{n} * \sum _{k \in [1,n]}\left( \sum _{j\in F^k(X_t)} \frac{(j- {\hat{y}}_t)^2}{\#F^k(X_t)}\right) } \end{aligned}$$
  • DEEP: Neural network extraction

    A second form of extraction is based on variational dropout [38], which aims to approximate Bayesian behavior in a deterministic network. A study in [12] applies this technique to an LSTM neural network to extract the confidence in the prediction model. Following the same line of research, we use the variational dropout to estimate the variance from our LSTM encoder-predictor model.

    Let \(M_{{\hat{\theta }}}\) be a neural network that infers \(y_t\) from \(X_t\).

    $$\begin{aligned} \theta ={{\,\mathrm{argmin}\,}}_{\theta } \sum |M^{\theta }(X)-y|^2 \quad M_{\theta }(x_t) = \hat{y_t} \end{aligned}$$

    The neural network aims to capture the link between the attributes and prediction targets through an embedding of the attribute space into the prediction space. Successive nonlinear projections in the abstract space Z are used to this end. These abstract spaces give us abstract representations \(z_t\) of our elements that capture the topological structure of our data. We can exploit such spaces to perform contextual subsampling by defining a neighborhood in Z space. The contextual subsampling will be based on the contextual information captured by M. The main issue comes from the definition of a neighborhood \({\mathcal {B}}(z_t)\) in Z space.

    $$\begin{aligned}&\{ {\mathcal {B}}(z_t) : k\ tq\ z_k \in [z_t \pm \varepsilon ]\}\ with\ z,\varepsilon \in {\mathcal {R}}^{\#Z} \\&\quad {\hat{B}}(t) = \sum _{k \in {\mathcal {B}}(z_t)}\frac{|\hat{y_k}-y_t|}{\# {\mathcal {B}}(z_t)}={\hat{r}}_t\\&\quad \ {\hat{\sigma }}(t) = \sqrt{\sum _{k \in {\mathcal {B}}(z_t)}\frac{((\hat{y_k}-y_t)-{\hat{B}}(t))^2}{\# {\mathcal {B}}(z_t)}} \end{aligned}$$

    This issue can be avoided with a variational neural network \(M_{\theta }^{var}\) based on an explicit (variational layer) or implicit (variational dropout) random drawing by generating a virtual sampling that self-defines the neighborhood in Z space.

    $$\begin{aligned} \theta = {{\,\mathrm{argmin}\,}}_{\theta } \sum | (M_{\theta }^{var}(X)-y)|^2 \quad \sum _{m}\frac{M_{\theta }^{var}(x_t)}{m}= \hat{y_t} \end{aligned}$$

    The stochastic projections of model \(M_{\theta }^{var}\) transform the latent representations \(z_t\) into a collection of probabilistic points. We can access the probabilistic clouds of predictions for an element by making many predictions. This gives us a virtual contextual subsampling from which we can estimate the mean and variance.

    $$\begin{aligned} \hat{{\mathcal {B}}}=0 \quad {\hat{\sigma }}(t) =\sqrt{\sum _{m}\frac{(M_{\theta }^{var}(x_t)-{\hat{y}}_t)^2}{m}} \end{aligned}$$

1.2 Forecasting models

1.2.1 Encoding cyclical features

Cyclical encoding aims to encode continuous cyclic attributes by preserving their cyclic structure. Instead of having a large one-hot vector per feature, the sine and cosine encodings project each attribute on a two-dimensional plane. However, some contextual information contains more than one cyclical structure. Using several pairs of sines and cosines with different frequencies can allow us to better express meaningful and compact structures. For instance, we can express several pieces of periodical information (weekly, monthly, and seasonal) by encoding the position of the day in the year with several pairs of sines and cosines with well-chosen frequencies, i.e., 1/53 for a weekly structure, 1/31 for monthly, and 1/1 for a yearly structure.

This technique yields compact and meaningful attributes, in contrast to the bulky and sparse hot encoding.

1.2.2 Random forest training

The forecasting and bias-variance estimation models are optimized through a mean-square error (MSE) optimization loss. We use a standard scikit-learn random forest regressor [40]. The random forest parameters are tuned through a random search using cross-validation combined with early stopping to control the number, size and depth of trees to avoid overfitting.

1.2.3 LSTM EP : architecture and training

In our previous research [11], we proposed an LSTM encoder predictor for ridership forecasting by using both long-term and short-term attributes.

Fig. 11
figure 11

LSTM-EP architecture with the layer size for the real data

The model was designed to manage the structural variability in the data induced by the transport plan. A simplified version of the model (Fig. 11) is applied in the current work thanks to the regular structure of the data.

First, the long-term features are synthesized through a multilayer perceptron neural network. Then, a pair of encoder–predictor LSTM layers attempt to capture the contextual influence and infer the short-term dynamics of the multivariate time series. Finally, another multilayer perceptron attempts to interpret the prediction embedding \(Z^p\) to produce a prediction \({\hat{y}}\). This model takes as input the contextual attribute \(X_t\) and past horizon value \(y^p_t=[y_{t-p},\ldots ,y_t)\) and aims to forecast a future horizon \([y_{t},y_{t+f})]\). Such a model reconstructs the time step t and then infers the temporal evolution on a future horizon \([t+1,t+f]\). Dropout layers are placed in almost every layer to avoid overfitting and to allow variational dropout. The size is manually chosen through a compromise between three components of size, performance and overfitting.

The encoder predictor model is implemented based on the TensorFlow [41] environment with Keras [42] as a library and a high-level neural network API. Training is performed through 3 training loops with gradient reduction and early stopping. We use an adaptive gradient (ADAM), and we reduce the batch size between each loop. The first training loop is a type of initialization in which we keep only the reconstruction task in the learning loss. Then, we add multistep forecasting with a higher weight to the t+1 prediction loss.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pasini, K., Khouadjia, M., Samé, A. et al. Contextual anomaly detection on time series: a case study of metro ridership analysis. Neural Comput & Applic 34, 1483–1507 (2022). https://doi.org/10.1007/s00521-021-06455-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06455-z

Keywords

Navigation