Action detection and classification in kitchen activities videos using graph decoding

Ramadan, Mona; El-Jaroudi, Amro

doi:10.1007/s00371-021-02346-5

Action detection and classification in kitchen activities videos using graph decoding

Original article
Published: 10 January 2022

Volume 39, pages 799–812, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

421 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

In this work, we propose a hybrid deep network/graph decoding using hidden Markov model system for the classification of kitchen activities for the Actions for Cooking Eggs data set. We use and compare two deep learning architectures, a deep convolutional neural network (CNN) alone and a long short-term memory network built on top of a CNN. We address the video classification problem both on the level of actions performed in certain frames and the full-length video level. Our proposed system detects a sequence of cooking actions and outputs a menu class for the entire video. Our approach achieves the highest reported accuracy on the data set for identifying cooking actions with an overall accuracy of 81% compared to the state of the art of 76% and succeeds in assigning a menu label to a sequence of cooking actions with an accuracy of 100% compared to an accuracy range of 10–30% reported in previous work. We also explore the effects of processing a subset of the available frames and imposing a state occupancy constraint during decoding. Our best reported results are achieved when using a common-sense dictionary grammar expansion when processing one frame out of every 35 frames and when restricting state transitions for at least five consecutive frames.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A human activity recognition framework in videos using segmented human subject focus

Article 06 February 2024

Event and Activity Recognition in Video Surveillance for Cyber-Physical Systems

Videos as Space-Time Region Graphs

Data availability and materials

The data sets analyzed during the current study are available through the [ ICPR 2012 Contest-Kitchen Scene Context-based Gesture Recognition] available online at http://www.murase.m.is.nagoya-u.ac.jp/KSCGR/

Abbreviations

HMM:: Hidden Markov model
CNN:: Convolutional neural network
LSTM:: Long short-term memory
SVM:: Support vector machines
NN:: Neural networks
SBR:: Symbolic behavior recognition
HOGV:: Histogram of oriented gradient variation

References

Hoai, M., Lan, Z.-Z., De la Torre, F.: Joint segmentation and classification of human actions in video. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3265–3272 (2011). IEEE
Aggarwal, J.K., Cai, Q.: Human motion analysis: a review. Comput. Vision Image Understand. 73(3), 428–440 (1999)
Article Google Scholar
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks supplementary materials
Crowley, J.L., Reignier, P., Pesnel, S.: Context aware vision using image-based active recognition (2004)
Kim, T.-K., Cipolla, R.: Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Trans. Pattern Anal. Mach. Intell. 31(8), 1415–1428 (2008)
Google Scholar
Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surveys (CSUR) 43(3), 1–43 (2011)
Article Google Scholar
Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. IEEE Trans. Pattern Anal. Mach. Intell. 33(9), 1728–1743 (2011)
Article Google Scholar
Chaquet, J.M., Carmona, E.J., Fernández-Caballero, A.: A survey of video datasets for human action and activity recognition. Comput. Vision Image Understand. 117(6), 633–659 (2013)
Article Google Scholar
Liu, Y., Lu, Z., Li, J., Yang, T.: Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Trans. Circuit Syst. Video Technol. 29(8), 2416–2430 (2018)
Article Google Scholar
Liu, Y., Lu, Z., Li, J., Yang, T., Yao, C.: Global temporal representation based cnns for infrared action recognition. IEEE Signal Process. Lett. 25(6), 848–852 (2018)
Article Google Scholar
Liu, J., Li, Y., Song, S., Xing, J., Lan, C., Zeng, W.: Multi-modality multi-task recurrent neural network for online action detection. IEEE Trans. Circuit Syst. Video Technol. 29(9), 2667–2682 (2018)
Article Google Scholar
Liu, Y., Lu, Z., Li, J., Yao, C., Deng, Y.: Transferable feature representation for visible-to-infrared cross-dataset human action recognition. Complexity (2018). https://doi.org/10.1155/2018/5345241
Article Google Scholar
Liu, Y., Lu, Z., Li, J., Yang, T., Yao, C.: Deep image-to-video adaptation and fusion networks for action recognition. IEEE Transactions on Image Processing 29, 3168–3182 (2019)
Article MATH Google Scholar
Tu, Z., Li, H., Zhang, D., Dauwels, J., Li, B., Yuan, J.: Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions on Image Processing 28(6), 2799–2812 (2019)
Article MathSciNet MATH Google Scholar
Liu, Y., Wang, K., Li, G., Lin, L.: Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Transactions on Image Processing (2021)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1), 221–231 (2013)
Article Google Scholar
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Int. Workshop Human Behav. Understand., pp. 29–39. Springer, Berlin (2011)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision, pp. 740–755. Springer, Cham (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Action classification in soccer videos with long short-term memory recurrent neural networks. In: Int. Conf. Artif. Neural Netw., pp. 154–159. Springer, Berlin (2010)
Google Scholar
Wu, Z., Wang, X., Jiang, Y.-G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 461–470 (2015). ACM
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing, vol. 3. Pearson, London (2014)
Google Scholar
Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1194–1201 (2012). IEEE
Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: ICCV ’09: Proceedings of the Twelfth IEEE International Conference on Computer Vision. IEEE Computer Society, Washington, DC, USA (2009). http://www.cs.rochester.edu/u/rmessing/uradl/
Hashimoto, A., Mori, S., Iiyama, M., Minoh, M.: Kusk object dataset: recording access to objects in food preparation. In: 2016 IEEE International Conference on Multimedia Expo Workshops (ICMEW) (2016)
la Torre Frade, F.D., Hodgins, J.K., Bargteil, A.W., Artal, X.M., Macey, J.C., Castells, A.C.I., Beltran, J.: Guide to the carnegie mellon university multimodal activity (CMU-MMAC) database. Technical Report CMU-RI-TR-08-22, Carnegie Mellon University, Pittsburgh, PA (April 2008). http://kitchen.cs.cmu.edu/
Shimada, A., Kondo, K., Deguchi, D., Morin, G., Stern, H.: Kitchen scene context based gesture recognition: a contest in ICPR2012. In: Advances in Depth Image Analysis and Applications, pp. 168–185. Springer, Berlin (2013)
Chapter Google Scholar
Ni, B., Paramathayalan, V.R., Li, T., Moulin, P.: Multiple granularity modeling: a coarse-to-fine framework for fine-grained action analysis. Int. J. Comput. Vision 120(1), 28–43 (2016)
Article MathSciNet Google Scholar
Ni, B., Moulin, P., Yan, S.: Pose adaptive motion feature pooling for human action analysis. Int. J. Comput. Vision 111(2), 229–248 (2015)
Article Google Scholar
Hung, N.T., Kim, J.Y., et al.: Gesture recognition in cooking video based on image features and motion features using Bayesian network classifier. In: Emerging Trends in Image Processing, Computer Vision and Pattern Recognition, pp. 379–392. Elsevier (2015)
Ohyama, W., Hotta, S., Wakabayashi, T.: Spatiotemporal auto-correlation of grayscale gradient with importance map for cooking gesture recognition. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 166–170 (2015). IEEE
Bansal, S., Khandelwal, S., Gupta, S., Goyal, D.: Kitchen activity recognition based on scene context. In: 2013 20th IEEE International Conference on Image Processing (ICIP), pp. 3461–3465 (2013). IEEE
Monteiro, J., Granada, R., Barros, R.C., Meneguzzi, F.: Deep neural networks for kitchen activity recognition. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2048–2055 (2017). https://doi.org/10.1109/IJCNN.2017.7966102
Kojima, S., Ohyama, W., Wakabayashi, T.: Gesture recognition based on spatiotemporal histogram of oriented gradient variation. In: Informatics, Electronics and Vision & 2017 7th International Symposium in Computational Medical and Health Technology (ICIEV-ISCMHT), 2017 6th International Conference On, pp. 1–4 (2017). IEEE
Granada, R., Pereira, R.F., Monteiro, J., Barros, R., Ruiz, D., Meneguzzi, F.: Hybrid activity and plan recognition for video streams. In: The AAAI 2017 Workshop on Plan, Activity, and Intent Recognition (2017)
Hussein, F., Piccardi, M.: V-JAUNE: a framework for joint action recognition and video summarization. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 13(2), 1–19 (2017)
Article Google Scholar
Olson, D.L., Delen, D.: Advanced Data Mining Techniques. Springer, Berlin (2008)
MATH Google Scholar

Download references

Funding

This manuscript was prepared during MR’s work toward her self-funded PhD degree.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Pittsburgh, 3700 O’Hara St, Pittsburgh, PA, 15213, USA
Mona Ramadan & Amro El-Jaroudi

Authors

Mona Ramadan
View author publications
You can also search for this author in PubMed Google Scholar
Amro El-Jaroudi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MR processed and analyzed the data and results. MR was the major contributor in writing the manuscript. AE provided advice and guidance through the study. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Mona Ramadan.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ramadan, M., El-Jaroudi, A. Action detection and classification in kitchen activities videos using graph decoding. Vis Comput 39, 799–812 (2023). https://doi.org/10.1007/s00371-021-02346-5

Download citation

Accepted: 26 October 2021
Published: 10 January 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s00371-021-02346-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action detection and classification in kitchen activities videos using graph decoding

Abstract

Access this article

Similar content being viewed by others

A human activity recognition framework in videos using segmented human subject focus

Event and Activity Recognition in Video Surveillance for Cyber-Physical Systems

Videos as Space-Time Region Graphs

Data availability and materials

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Action detection and classification in kitchen activities videos using graph decoding

Abstract

Access this article

Similar content being viewed by others

A human activity recognition framework in videos using segmented human subject focus

Event and Activity Recognition in Video Surveillance for Cyber-Physical Systems

Videos as Space-Time Region Graphs

Data availability and materials

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation