Skip to main content
Log in

Gate and common pathway detection in crowd scenes and anomaly detection using motion units and LSTM predictive models

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose two approaches to analyze the crowd scenes. The first one is motion units and meta-tracking based approach (MUDAM Approach). In this approach, the scene is divided into a number of dynamic divisions with coherent motion dynamics called the motion units (MUs). By analyzing the relationships between these MUs using a proposed continuation likelihood, the scene entrance and exit gates are retrieved. A meta-tracking procedure is then applied and the scene dominant motion pathways are retrieved. To overcome the limitations of the MUDAM approach, and detect some of the anomalies, that may happen in these scenes, we proposed another new LSTM based approach. In this approach, the scene is divided into a number of static overlapped spatial regions named super regions (SRs), which cover the whole scene. Long Short Term Memory (LSTM) is used in defining a predictive model for each of the scene SRs. Each LSTM predictive model uses its SR tracklets in the training, such that, it can capture the whole motion dynamics of that SR. Using apriori known scene entrance segments, the proposed LSTM predictive models are applied and the scene dominant motion pathways are retrieved. an anomaly metric is formulated to be used with the LSTM predictive models to detect the scene anomalies. Prototypes of our proposed approaches were developed and evaluated on the challenging New York Grand Central station scene, in addition to four other crowded scenes. Four types of anomalies that may happen in the crowded scenes were defined in the context, and our proposed LSTM based approach was used in detecting these anomalies. Experimental results on anomalies detection were applied too on a number of data sets. Ov erall, the proposed approaches managed to outperform the state of the art methods in retrieving the scene gates and common pathways, in addition to detecting motion anomalies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. This point is more discussed at the end of Section 3.2.2.

  2. By retrieving a crowd scene entrance / exit gates, we mean the terminal points of that scene. These gates are shown for all of the scenes used in our experiments in Figs. 8a, c, and 9.

  3. This non-parametric clustering technique was selected, as it does presume a predetermined number of clusters, represents a state of the art technique, and the code is publicly available. The final clusters are obtained losing the temporal information as we are only interested in retrieving the overall scene motion dynamics (assuming such dynamics are statistically stationary).

  4. The basic intuition of dividing the scene into a group of SRs is explained later in Section 3.2.2

  5. In our implementation we used the first stage of the non-parametric clustering algorithm used by [13], but any other clustering algorithm can be used.

  6. More details about the MUs spatial area, overall orientation, and the representing mean tracklets are discussed in details in [23]

  7. This section is explained in more details in [23].

  8. Typical examples shows the importance of the automatic gate detection: (1) Outdoor scenes, where typically, gates are not well defined. (2) Gates outside the camera’s field of view, however, their presence can be recognized from the motion dynamics (for example, the gates in the bottom part of the Grand Central scene. (3) Dynamic scenes, due to construction work. (4) alleviating the burden of annual annotation. Our proposed MUDAM approach can detect the gates before discovering the scene pathways. Therefore, in situations when the gates are a priori known, this information can be incorporated into our framework and pathways can be directly detected.

  9. The figure shows the scene MUs as circles only for clarification, but in reality, the MUs can take any irregular shape as shown in Fig. 2b.

  10. Mean shift was selected since it can automatically estimate the number of clusters. Also, it is computationally efficient in case of dealing with points that lie on a 2D Euclidean space (which is our case).

  11. The Mean tracklet of an MU is defined as the average of all the tracklets contained in that MU, i.e. the ith point in the mean tracklet is the average of the ith points of all tracklets belonging to that MU.

  12. In our experiments for example, New York Grand Central Data set resolution is 1920 x 1080 pixel. So, \(d_{max} = \sqrt {{1920}^{2}+{1080}^{2}}= 2203\)pixel.

  13. Particles advection locations are defined by the tails of the obtained entrance tracklets.

  14. Considering the current MU mean tracklet, we can select its neighborhood from the set of mean tracklets of all the MUs whose spatial layouts contain the current position of the particle, and hence, we have all the neighboring MUs of the current MU.

  15. Increasing the number of synthesized trajectories is done by increasing the number of particles advected in the scene. One possible solution is advecting particles at the obtained entrance point and around them instead of using only the exact retrieved point location.

  16. An entrance segment is a small part at the start of an entrance tracklet with a definite length. For example, a segment of length 4 points, means the starting four points of an entrance tracklet.

  17. Data points are normalized before training, because having a large scale values of the the different features when training the LSTM model makes it weight these features not equally and so false priorities of the features over the others happens. So to avoid this false prioritization, we pass all the features data in a normalized form to train the model.

  18. In this paragraph we are supposing 4 points to predict the fifth one because that guarantees to keep a good amount of the motion history before that point. Also, 4 is not a fixed number it is only mentioned here for clarification.

  19. The selected values of the optimizer, loss function, number of epochs, and batch size parameters were selected after many experiments to define the most appropriate parameters for our problem over various data sets.

  20. The selected value for the overlap area is mentioned later in Section 4.6

  21. The number of input locations used in our experiments is given and justified in Section 4.6

  22. Each tracklet is tested using its SR LSTM model.

  23. The SR containing the tracklet trtest is that one which have in its spatial extent all the points of that tracklet. In case of the tracklet spans multiple SRs, in this case, any of them can be used. That is mainly because of the overlap proposed criterion. In our experiments, we set the overlap distance to be larger than any of the scene tracklets. So in case of a tracklet that lies between two SRs, It will be lying in the overlap area. That area is considered at the training of the LSTM models of all the SRs sharing this overlap area. So any of the containing SRs can be used.

  24. The process of computing the Slopesravg for each SR, occurs offline only once for all the scene SRs.

  25. The JA and HA were selected to be our baseline state of art approaches, because they share with us the same goals of discovering the crowd scene basic structure elements (Entrance/Exit gates and the common motion pathways).

  26. Any cluster approaches can be used to cluster the tracklets under the condition of guaranteeing the compactness of the obtained MUs, and also guaranteeing the coherency of them (small variations in orientation between tracklets in the same MU).

  27. These parameter values are discussed in details in Section 4.6.

  28. The number written on the gate is the GT gate number to which this gate is matched to, and ‘X’ means false detection (there is no match between the detected ‘X’ gate and any of the GT gates).

  29. The number of SRs that the scene will be divided into is analyzed in Section 4.4.1.

  30. chi-square χ2 distance is used to test the amount of fit between two distributions [25].

  31. Pathway heat map is a spatial probability map that is basically constructed by overlaying all the pathway trajectories, and accumulating them on top of each other, and then normalize all the values of the whole map to be in the range [0,1], for 1 representing the highest motion dynamics at this point in the map, and 0 for no motion.

  32. Defined by [14], the motion orientation histogram is formed for any specific PW by computing the motion direction between each two consecutive points of that PW trajectories or tracklets, and then quantize them into one histogram. The full mathematical definition and clarification can be found in [14].

  33. The experiments on the Marathon dataset were tested on a machine with Core i7 processor and 8GB of Ram.

  34. The processing time to obtain the scene pathways is the time needed for training the scene LSTM predictive models in addition to the time needed to synthesize the complete trajectories using the entrance segments. Since the time needed to predict the trajectories is almost fixed and very small compared to the training time, so we didn’t consider it in our time complexity comparison. Also, since all LSTM predictive models are trained in parallel, we only reported the time of training one of these models (the largest time).

  35. Controlled by the available hardware resources, this number of SRs was empirically chosen to give good and smooth PWs results in an appropriate processing time. In future work, we will investigate how to automatically set this parameter through a pre processing stage that studies the complexity of motion dynamics inside the scene.

  36. The scene entrance tracklets are those of the entrance gates. For our proposed LSTM based approach to work and retrieve the scene common PWs, it needs entrance segment of the entrance tracklets as an input. Those tracklets should be a priori known, or be obtained using our proposed MUDAM approach by discovering the scene gates first. In our experiments, we used our retrieved gates entrance tracklets of MUDAM approach as an input to the LSTM proposed approach.

  37. Richness of a PW is measured in MUDAM approach, MT approach, LSTM approach, and GT by counting the number of the trajectories of that PW. While for the HA approach it is measured by counting the number of the tracklets contained in that PW.

  38. Time comparison was applied on a machine with Intel(R) Xeon(R) CPU E5-2699 v3 @2.30GHZ (2 processors), and 256 GB of RAM

  39. ”enough” means that there is no need for more divisions for these scenes. Also more details for choosing this value is given in Section 4.6.

  40. Motion map of any scene is a binary image that we create for that scene by marking all the pixels that contain any motion dynamics by 1 and all the other pixels of the image with zero. So as shown in Fig. 15b (which shows the motion map of the Marathon dataset) all the static areas of the scene that don’t contain any kind of motion will take the value of zero. We used these motion maps to identify whether or not any tracklet is moving outside the active motion area of the scene.

  41. The same parameter values of 𝜃FOV and δ can be applied for the other four data sets (Shown in Table 2) too without any conflict, because 𝜃FOV is a parameter that shows how much tolerance we permit for our system to accept a possible variation in its current orientation. As the GC scene is considered the most complicated scene where any moving person is permitted to go in any direction, we can argue that the same parameters that can handle the GC scene can as well work effectively with other video scenes. The δ parameter also shows the distance that each MU searches to find its possible neighboring MUs. For the GC data set the stability of this parameter was obtained at 260 pixels as was shown in Fig. 17, also beyond that value the stability in the number of the retrieved MUs is guaranteed and the only more cost will be a computational one. Relative to the GC scene all the other four data sets are smaller in the dimension. So taking the same parameters of the GC will also achieve the stability in the number of the retrieved MUs.

References

  1. Ali S, Shah M (2008) Floor fields for tracking in high density crowd scenes. In: European conference on computer vision. Springer, Berlin, pp 1–14

  2. Alahi A, Goel K, Ramanathan V, Robicquet A, Fei-Fei L, Savarese S (2016) Social lstm: human trajectory prediction in crowded spaces. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–971

  3. Arias-Castro E, Mason D, Pelletier B (2016) On the estimation of the gradient lines of a density and the consistency of the mean-shift algorithm. J Mach Learn Res 17(1):1487–1514

    MathSciNet  MATH  Google Scholar 

  4. Chacón JE, Monfort P (2013) A comparison of bandwidth selectors for mean shift clustering. arXiv:https://arxiv.org/abs/1310.7855

  5. Chen K, Kamarainen JK (2016) Pedestrian density analysis in public scenes with spatiotemporal tensor features. IEEE Trans Intell Transp Syst 17(7):1968–1977

    Article  Google Scholar 

  6. Chongjing W, Xu Z, Yi Z, Yuncai L (2013) Analyzing motion patterns in crowded scenes via automatic tracklets clustering. Chin Commun 10(4):144–154

    Article  Google Scholar 

  7. Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619

    Article  Google Scholar 

  8. Comaniciu D, Ramesh V, Meer P (2001) The variable bandwidth mean shift and data-driven scale selection. In: Eighth IEEE international conference on computer vision, 2001. Proceedings. ICCV 2001, vol 1. IEEE, pp 438–445

  9. Cong Y, Yuan J, Liu J (2013) Abnormal event detection in crowded scenes using sparse representation. Pattern Recogn 46(7):1851–1864

    Article  Google Scholar 

  10. Conte D, Foggia P, Sansone C, Vento M (2004) Thirty years of graph matching in pattern recognition. Int J Pattern Recogn Artif Intell 18(03):265–298

    Article  Google Scholar 

  11. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118

  12. Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE International conference on Acoustics, speech and signal processing (ICASSP). IEEE, pp 6645–6649

  13. Hassanein AS, Hussein ME, Gomaa W (2016) Semantic analysis for crowded scenes based on non-parametric tracklet clustering. In: IJCAI, pp 3389–3395

  14. Hassanein AS, Hussein ME, Gomaa W, Makihara Y, Yagi Y (2018) Identifying motion pathways in highly crowded scenes: a non-parametric tracklet clustering approach. Computer Vision and Image Understanding

  15. Jodoin PM, Benezeth Y, Wang Y (2013) Meta-tracking for video scene understanding. In: 2013 10th IEEE International conference on advanced video and signal based surveillance (AVSS). IEEE, pp 1–6

  16. Kratz L, Nishino K (2009) Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models. In: 2009 IEEE Conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 1446–1453

  17. Kratz L, Nishino K (2012) Tracking pedestrians using local spatio-temporal motion patterns in extremely crowded scenes. IEEE Trans Pattern Anal Mach Intell 34 (5):987–1002

    Article  Google Scholar 

  18. Kuo CH, Huang C, Nevatia R (2010) Multi-target tracking by on-line learned discriminative appearance models. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 685–692

  19. Li W, Mahadevan V, Vasconcelos N (2014) Anomaly detection and localization in crowded scenes. IEEE Trans Pattern Anal Mach Intell 36(1):18–32

    Article  Google Scholar 

  20. Li T, Chang H, Wang M, Ni B, Hong R, Yan S (2015) Crowded scene analysis: a survey. IEEE Trans Circ Syst Video Technol 25(3):367–386

    Article  Google Scholar 

  21. Mehran R, Oyama A, Shah M (2009) Abnormal crowd behavior detection using social force model. In: IEEE Conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 935–942

  22. Mehran R, Moore BE, Shah M (2010) A streakline representation of flow in crowded scenes. In: European conference on computer vision. Springer, Berlin, pp 439–452

  23. Moustafa AN, Hussein M, Gomaa W (2017) Gate and common pathway detection in crowd scenes using motion units and meta-tracking. In: 2017 International conference on digital image computing: techniques and applications (DICTA). IEEE, pp 1–8

  24. Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning, pp 1310–1318

  25. Pele O, Werman M (2010) The quadratic-chi histogram distance family. In: European conference on computer vision. Springer, Berlin, pp 749–762

  26. Saleemi I, Hartung L, Shah M (2010) Scene understanding by statistical modeling of motion patterns. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 2069–2076

  27. Shao J, Change Loy C, Wang X (2014) Scene-independent group profiling in crowd. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2219–2226

  28. Su H, Yang H, Zheng S, Fan Y, Wei S (2013) The large-scale crowd behavior perception based on spatio-temporal viscous fluid field. IEEE Trans Inform Forensics Secur 8(10):1575–1589

    Article  Google Scholar 

  29. Su H, Dong Y, Zhu J, Ling H, Zhang B (2016) Crowd scene understanding with coherent recurrent neural networks. IJCAI 1:2

    Google Scholar 

  30. Topkaya IS, Erdogan H, Porikli F (2016) Tracklet clustering for robust multiple object tracking using distance dependent Chinese restaurant processes. SIViP 10(5):795–802

    Article  Google Scholar 

  31. Tripathi G, Singh K, Vishwakarma DK (2018) Convolutional neural networks for crowd behaviour analysis: a survey. Vis Comput, 1–24

  32. Tomasi C, Kanade T (1991) Detection and tracking of point features. School of Computer Science, Carnegie Mellon Univ. Pittsburgh

  33. UMN (2006) Unusual crowd activity dataset of University of Minnesota. http://mha.cs.umn.edu/movies/crowdactivity-all.avi. Accessed: 2010-09-30

  34. Wang X, Yang X, He X, Teng Q, Gao M (2014) A high accuracy flow segmentation method in crowded scenes based on streakline. Optik-Int J Light Electron Opt 125(3):924–929

    Article  Google Scholar 

  35. Wen ZQ, Cai ZX (2006) Mean shift algorithm and its application in tracking of objects. In: 2006 International conference on machine learning and cybernetics. IEEE, pp 4024–4028

  36. Wu S, Moore BE, Shah M (2010) Chaotic invariants of Lagrangian particle trajectories for anomaly detection in crowded scenes. In: 2010 IEEE Computer society conference on computer vision and pattern recognition, San Francisco, pp 2054–2060

  37. Xue H, Huynh DQ, Reynolds M (2017) Bi-prediction: pedestrian trajectory prediction based on bidirectional LSTM classification. In: 2017 International conference on digital image computing: techniques and applications (DICTA). IEEE, pp 1–8

  38. Yi S, Li H, Wang X (2015) Understanding pedestrian behaviors from stationary crowd groups. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3488–3496

  39. Zhou B, Wang X, Tang X (2011) Random field topic model for semantic region analysis in crowded scenes from tracklets

  40. Zhou B, Wang X, Tang X (2012) Understanding collective crowd behaviors: learning a mixture model of dynamic pedestrian-agents. In: 2012 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 2871–2878

  41. Zhuang N, Ye J, Hua KA (2017) Convolutional DLSTM for crowd scene understanding. In: 2017 IEEE International symposium on multimedia (ISM). IEEE, pp 61–68

  42. Zou Y, Zhao X, Liu Y (2015) Detect coherent motions in crowd scenes based on tracklets association. In: 2015 IEEE International conference on image processing (ICIP). IEEE, pp 4456–4460

Download references

Acknowledgements

This work is Funded by the Science and Technology Development Fund STDF 992 (Egypt); Project id: 42519 - “Automatic Video Surveillance System for Crowd Scenes”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdullah N. Moustafa.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moustafa, A.N., Gomaa, W. Gate and common pathway detection in crowd scenes and anomaly detection using motion units and LSTM predictive models. Multimed Tools Appl 79, 20689–20728 (2020). https://doi.org/10.1007/s11042-020-08840-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08840-7

Keywords

Navigation