Abstract
Action recognition has wide applications from video surveillance, scene understanding to forensic investigation. While recent methods typically focus on a single action recognition from video clips, we investigate the problem of action recognition in crowd, which better replicates real video surveillance scenarios. We propose to perform actions recognition in crowd based on an efficient coarse-to-fine multi-object tracking algorithm. With Faster R-CNN as our human detector, we utilize a coarse-to-fine strategy for multi-object tracking in crowd, consisting of multi-object fast tracking and per-object fine tracking. The tracking results are used to extract the action cuboids, and spatial-temporal features are computed for action classification. We evaluate the proposed approach on a self-collected actions-in-crowd dataset, and two public domain databases (CMU and and MOT2015). The results show the effectiveness of the proposed approach for multi-action recognition in crowd.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Siva, P., Xiang, T.: Action detection in crowd. In: BMVC, pp. 1–11 (2010)
Luo, Y., Cheong, L.F., Tran, A.: Actionness-assisted recognition of actions. In: ICCV, pp. 3244–3252 (2015)
Li, Y., Ye, J., Wang, T., Huang, S.: Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition. Visual Comput. 31, 1383–1394 (2015)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)
Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R*CNN. In: ICCV, pp. 1080–1088 (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
Fu, Z., Han, Y.: Centroid weighted Kalman filter for visual object tracking. Measurement 45, 650–655 (2012)
Efros, A.A., Berg, A.C., G.M., Malik, J.: Recognizing action at a distance. In: ICCV, pp. 726–733 (2003)
Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y., Huang, T.S.: Action detection in complex scenes with spatial and temporal ambiguities. In: ICCV, pp. 128–135 (2009)
Jain, M., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR, pp. 2555–2562 (2013)
Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they NG to me? In: CVPR, pp. 2730–2737 (2013)
Zhou, S., Shen, W., Zeng, D., Zhang, Z.: Unusual event detection in crowded scenes by trajectory analysis. In: ICASSP, pp. 1300–1304 (2015)
Zhu, Y., Nayak, N.M., Roy-Chowdhury, A.K.: Context-aware modeling and recognition of activities in video. In: CVPR, pp. 2491–2498 (2013)
Li, W., Wen, L., Choo Chuah, M., Lyu, S.: Category-blind human action recognition: a practical recognition system. In: ICCV, pp. 4444–4452 (2015)
Wu, J., Hu, D., Chen, F.: Action recognition by hidden temporal models. Visual Comput. 30, 1395–1404 (2014)
Hoai, M., Zisserman, A.: Improving human action recognition using score distribution and ranking. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9007, pp. 3–20. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16814-2_1
Ni, B., Moulin, P., Yang, X., Yan, S.: Motion part regularization: improving action recognition via trajectory group selection. In: Proceedings of CVPR, pp. 3698–3706 (2015)
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vision 103, 60–79 (2013)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)
Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intel. 34, 1704–1716 (2012)
Chen, W., Corso, J.J.: Action detection by implicit intentional motion clustering. In: ICCV, pp. 3298–3306 (2015)
Breitenstein, M.D., Reichlin, F., Leibe, B., Koller-Meier, E., Van Gool, L.: Robust tracking-by-detection using a detector confidence particle filter. In: ICCV, pp. 1515–1522 (2009)
Choi, W.: Near-online multi-target tracking with aggregated local flow descriptor. In: ICCV, pp. 3029–3037 (2015)
Chari, V., Lacoste-Julien, S., Laptev, I., Sivic, J.: On pairwise costs for network flow multi-object tracking. In: CVPR, pp. 5537–5545 (2015)
Ma, C., Huang, J.B., Yang, X., Yang, M.H.: Hierarchical convolutional features for visual tracking. In: ICCV, pp. 3074–3082 (2015)
Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Learning spatially regularized correlation filters for visual tracking. In: ICCV, pp. 4310–4318 (2015)
Tang, M., Feng, J.: Multi-kernel correlation filter for visual tracking. In: ICCV, pp. 3038–3046 (2015)
Liu, T., Wang, G., Yang, Q.: Real-time part-based visual tracking via adaptive correlation filters. In: CVPR, pp. 4902–4912 (2015)
Wang, L., Ouyang, W., Wang, X., Lu, H.: Visual tracking with fully convolutional networks. In: ICCV, pp. 3119–3127 (2015)
Bae, S.H., Yoon, K.J.: Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In: CVPR, pp. 1218–1225 (2014)
Xing, J., Ai, H., Lao, S.: Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses. In: CVPR, pp. 1200–1207 (2009)
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2, 83–97 (1955)
Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17, 185–203 (1981)
Chatfield, K., Karen Simonyan, A.: Return of the devil in the details: delving deep into convolutional nets. In: BMVC, pp. 2491–2498 (2014)
Everingham, M., Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88, 303–338 (2010)
Kang, D., Han, H., Jain, A.K., Lee, S.W.: Nighttime face recognition at large standoff: cross-distance and cross-spectral matching. Pattern Recogn. 47, 3750–3766 (2014)
Klum, S.J., Han, H., Klare, B.F., Jain, A.K.: The FaceSketchID system: matching facial composites to mugshots. IEEE Trans. Inf. Forensics Secur. 9, 2248–2263 (2014)
Han, H., Shan, S., Chen, X., Lao, S., Gao, W.: Separability oriented preprocessing for illumination-insensitive face recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7578, pp. 307–320. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33786-4_23
Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: ICCV, pp. 1–8 (2007)
Hubel, D., Wiesel, T.: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962)
Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans. Image Process. 11, 467–476 (2002)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: ICPR, pp. 32–36 (2004)
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos. In: CVPR, pp. 1996–2003 (2009)
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR, pp. 2929–2936 (2009)
Tran, D., Sorokin, A.: Human activity recognition with metric learning. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 548–561. Springer, Heidelberg (2008). doi:10.1007/978-3-540-88682-2_42
Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV, pp. 1–8 (2007)
Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K.: Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942 (2015)
Han, H., Otto, C., Liu, X., Jain, A.K.: Demographic estimation from face images: human vs. machine performance. IEEE Trans. Pattern Anal. Mach. Intell. 37, 1148–1161 (2015)
Acknowledgement
This research was partially supported by 973 Program (grant No. 2015CB351802), and Natural Science Foundation of China (grant No. 61672496). The authors would like to thank Xiaoyan Li for her proofreading of this paper. H. Han gratefully acknowledges the support of NVIDIA Corporation with the donation of the Titan X GPU used for his research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Gong, S., Han, H., Shan, S., Chen, X. (2017). Actions Recognition in Crowd Based on Coarse-to-Fine Multi-object Tracking. In: Chen, CS., Lu, J., Ma, KK. (eds) Computer Vision – ACCV 2016 Workshops. ACCV 2016. Lecture Notes in Computer Science(), vol 10118. Springer, Cham. https://doi.org/10.1007/978-3-319-54526-4_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-54526-4_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54525-7
Online ISBN: 978-3-319-54526-4
eBook Packages: Computer ScienceComputer Science (R0)