Skip to main content
Log in

Action detection and classification in kitchen activities videos using graph decoding

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

In this work, we propose a hybrid deep network/graph decoding using hidden Markov model system for the classification of kitchen activities for the Actions for Cooking Eggs data set. We use and compare two deep learning architectures, a deep convolutional neural network (CNN) alone and a long short-term memory network built on top of a CNN. We address the video classification problem both on the level of actions performed in certain frames and the full-length video level. Our proposed system detects a sequence of cooking actions and outputs a menu class for the entire video. Our approach achieves the highest reported accuracy on the data set for identifying cooking actions with an overall accuracy of 81% compared to the state of the art of 76% and succeeds in assigning a menu label to a sequence of cooking actions with an accuracy of 100% compared to an accuracy range of 10–30% reported in previous work. We also explore the effects of processing a subset of the available frames and imposing a state occupancy constraint during decoding. Our best reported results are achieved when using a common-sense dictionary grammar expansion when processing one frame out of every 35 frames and when restricting state transitions for at least five consecutive frames.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability and materials

The data sets analyzed during the current study are available through the [ ICPR 2012 Contest-Kitchen Scene Context-based Gesture Recognition] available online at http://www.murase.m.is.nagoya-u.ac.jp/KSCGR/

Abbreviations

HMM:

Hidden Markov model

CNN:

Convolutional neural network

LSTM:

Long short-term memory

SVM:

Support vector machines

NN:

Neural networks

SBR:

Symbolic behavior recognition

HOGV:

Histogram of oriented gradient variation

References

  1. Hoai, M., Lan, Z.-Z., De la Torre, F.: Joint segmentation and classification of human actions in video. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3265–3272 (2011). IEEE

  2. Aggarwal, J.K., Cai, Q.: Human motion analysis: a review. Comput. Vision Image Understand. 73(3), 428–440 (1999)

    Article  Google Scholar 

  3. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks supplementary materials

  4. Crowley, J.L., Reignier, P., Pesnel, S.: Context aware vision using image-based active recognition (2004)

  5. Kim, T.-K., Cipolla, R.: Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Trans. Pattern Anal. Mach. Intell. 31(8), 1415–1428 (2008)

    Google Scholar 

  6. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surveys (CSUR) 43(3), 1–43 (2011)

    Article  Google Scholar 

  7. Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. IEEE Trans. Pattern Anal. Mach. Intell. 33(9), 1728–1743 (2011)

    Article  Google Scholar 

  8. Chaquet, J.M., Carmona, E.J., Fernández-Caballero, A.: A survey of video datasets for human action and activity recognition. Comput. Vision Image Understand. 117(6), 633–659 (2013)

    Article  Google Scholar 

  9. Liu, Y., Lu, Z., Li, J., Yang, T.: Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Trans. Circuit Syst. Video Technol. 29(8), 2416–2430 (2018)

    Article  Google Scholar 

  10. Liu, Y., Lu, Z., Li, J., Yang, T., Yao, C.: Global temporal representation based cnns for infrared action recognition. IEEE Signal Process. Lett. 25(6), 848–852 (2018)

    Article  Google Scholar 

  11. Liu, J., Li, Y., Song, S., Xing, J., Lan, C., Zeng, W.: Multi-modality multi-task recurrent neural network for online action detection. IEEE Trans. Circuit Syst. Video Technol. 29(9), 2667–2682 (2018)

    Article  Google Scholar 

  12. Liu, Y., Lu, Z., Li, J., Yao, C., Deng, Y.: Transferable feature representation for visible-to-infrared cross-dataset human action recognition. Complexity (2018). https://doi.org/10.1155/2018/5345241

    Article  Google Scholar 

  13. Liu, Y., Lu, Z., Li, J., Yang, T., Yao, C.: Deep image-to-video adaptation and fusion networks for action recognition. IEEE Transactions on Image Processing 29, 3168–3182 (2019)

    Article  MATH  Google Scholar 

  14. Tu, Z., Li, H., Zhang, D., Dauwels, J., Li, B., Yuan, J.: Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions on Image Processing 28(6), 2799–2812 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  15. Liu, Y., Wang, K., Li, G., Lin, L.: Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Transactions on Image Processing (2021)

  16. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

  17. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1), 221–231 (2013)

    Article  Google Scholar 

  18. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)

  19. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Int. Workshop Human Behav. Understand., pp. 29–39. Springer, Berlin (2011)

    Google Scholar 

  20. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

  21. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)

    Google Scholar 

  22. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  23. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  24. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)

    Google Scholar 

  25. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision, pp. 740–755. Springer, Cham (2014)

    Google Scholar 

  26. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  27. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Action classification in soccer videos with long short-term memory recurrent neural networks. In: Int. Conf. Artif. Neural Netw., pp. 154–159. Springer, Berlin (2010)

    Google Scholar 

  28. Wu, Z., Wang, X., Jiang, Y.-G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 461–470 (2015). ACM

  29. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  30. Jurafsky, D., Martin, J.H.: Speech and Language Processing, vol. 3. Pearson, London (2014)

    Google Scholar 

  31. Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1194–1201 (2012). IEEE

  32. Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: ICCV ’09: Proceedings of the Twelfth IEEE International Conference on Computer Vision. IEEE Computer Society, Washington, DC, USA (2009). http://www.cs.rochester.edu/u/rmessing/uradl/

  33. Hashimoto, A., Mori, S., Iiyama, M., Minoh, M.: Kusk object dataset: recording access to objects in food preparation. In: 2016 IEEE International Conference on Multimedia Expo Workshops (ICMEW) (2016)

  34. la Torre Frade, F.D., Hodgins, J.K., Bargteil, A.W., Artal, X.M., Macey, J.C., Castells, A.C.I., Beltran, J.: Guide to the carnegie mellon university multimodal activity (CMU-MMAC) database. Technical Report CMU-RI-TR-08-22, Carnegie Mellon University, Pittsburgh, PA (April 2008). http://kitchen.cs.cmu.edu/

  35. Shimada, A., Kondo, K., Deguchi, D., Morin, G., Stern, H.: Kitchen scene context based gesture recognition: a contest in ICPR2012. In: Advances in Depth Image Analysis and Applications, pp. 168–185. Springer, Berlin (2013)

    Chapter  Google Scholar 

  36. Ni, B., Paramathayalan, V.R., Li, T., Moulin, P.: Multiple granularity modeling: a coarse-to-fine framework for fine-grained action analysis. Int. J. Comput. Vision 120(1), 28–43 (2016)

    Article  MathSciNet  Google Scholar 

  37. Ni, B., Moulin, P., Yan, S.: Pose adaptive motion feature pooling for human action analysis. Int. J. Comput. Vision 111(2), 229–248 (2015)

    Article  Google Scholar 

  38. Hung, N.T., Kim, J.Y., et al.: Gesture recognition in cooking video based on image features and motion features using Bayesian network classifier. In: Emerging Trends in Image Processing, Computer Vision and Pattern Recognition, pp. 379–392. Elsevier (2015)

  39. Ohyama, W., Hotta, S., Wakabayashi, T.: Spatiotemporal auto-correlation of grayscale gradient with importance map for cooking gesture recognition. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 166–170 (2015). IEEE

  40. Bansal, S., Khandelwal, S., Gupta, S., Goyal, D.: Kitchen activity recognition based on scene context. In: 2013 20th IEEE International Conference on Image Processing (ICIP), pp. 3461–3465 (2013). IEEE

  41. Monteiro, J., Granada, R., Barros, R.C., Meneguzzi, F.: Deep neural networks for kitchen activity recognition. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2048–2055 (2017). https://doi.org/10.1109/IJCNN.2017.7966102

  42. Kojima, S., Ohyama, W., Wakabayashi, T.: Gesture recognition based on spatiotemporal histogram of oriented gradient variation. In: Informatics, Electronics and Vision & 2017 7th International Symposium in Computational Medical and Health Technology (ICIEV-ISCMHT), 2017 6th International Conference On, pp. 1–4 (2017). IEEE

  43. Granada, R., Pereira, R.F., Monteiro, J., Barros, R., Ruiz, D., Meneguzzi, F.: Hybrid activity and plan recognition for video streams. In: The AAAI 2017 Workshop on Plan, Activity, and Intent Recognition (2017)

  44. Hussein, F., Piccardi, M.: V-JAUNE: a framework for joint action recognition and video summarization. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 13(2), 1–19 (2017)

    Article  Google Scholar 

  45. Olson, D.L., Delen, D.: Advanced Data Mining Techniques. Springer, Berlin (2008)

    MATH  Google Scholar 

Download references

Funding

This manuscript was prepared during MR’s work toward her self-funded PhD degree.

Author information

Authors and Affiliations

Authors

Contributions

MR processed and analyzed the data and results. MR was the major contributor in writing the manuscript. AE provided advice and guidance through the study. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Mona Ramadan.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ramadan, M., El-Jaroudi, A. Action detection and classification in kitchen activities videos using graph decoding. Vis Comput 39, 799–812 (2023). https://doi.org/10.1007/s00371-021-02346-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-021-02346-5

Keywords

Navigation