skip to main content
research-article

Unsupervised Learning of Human Action Categories in Still Images with Deep Representations

Published: 16 December 2019 Publication History

Abstract

In this article, we propose a novel method for unsupervised learning of human action categories in still images. In contrast to previous methods, the proposed method explores distinctive information of actions directly from unlabeled image databases, attempting to learn discriminative deep representations in an unsupervised manner to distinguish different actions. In the proposed method, action image collections can be used without manual annotations. Specifically, (i) to deal with the problem that unsupervised discriminative deep representations are difficult to learn, the proposed method builds a training dataset with surrogate labels from the unlabeled dataset, then learns discriminative representations by alternately updating convolutional neural network (CNN) parameters and the surrogate training dataset in an iterative manner; (ii) to explore the discriminatory information among different action categories, training batches for updating the CNN parameters are built with triplet groups and the triplet loss function is introduced to update the CNN parameters; and (iii) to learn more discriminative deep representations, a Random Forest classifier is adopted to update the surrogate training dataset, and more beneficial triplet groups then can be built with the updated surrogate training dataset. Extensive experiments on four benchmark datasets demonstrate the effectiveness of the proposed method.

References

[1]
Kashif Ahmad, Mohamed Lamine Mekhalfi, Nicola Conci, Farid Melgani, and Francesco G. B. De Natale. 2018. Ensemble of deep models for event recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2, 51:1--51:20.
[2]
Miguel Ángel Bautista, Artsiom Sanakoyeu, and Björn Ommer. 2017. Deep unsupervised similarity learning using partially ordered sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1923--1932.
[3]
Miguel Ángel Bautista, Artsiom Sanakoyeu, Ekaterina Tikhoncheva, and Björn Ommer. 2016. CliqueCNN: Deep unsupervised exemplar learning. In Advances in Neural Information Processing Systems. NIPSF, 3846--3854.
[4]
Anna Bosch, Andrew Zisserman, and Xavier Muñoz. 2007. Image classification using random forests and ferns. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1--8.
[5]
Lukas Bossard, Matthieu Guillaumin, and Luc J. Van Gool. 2014. Food-101— mining discriminative components with random forests. In Proceedings of the European Conference on Computer Vision. Springer, 446--461.
[6]
Leo Breiman. 2001. Random forests. Machine Learning 45, 1, 5--32.
[7]
Deng Cai, Xiaofei He, and Jiawei Han. 2005. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering 17, 12, 1624--1637.
[8]
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision. Springer, 139--156.
[9]
Vincent Delaitre, Ivan Laptev, and Josef Sivic. 2010. Recognizing human actions in still images: A study of bag-of-features and part-based representations. In Proceedings of the British Machine Vision Conference. BMVA, 1--11.
[10]
Vincent Delaitre, Josef Sivic, and Ivan Laptev. 2011. Learning person-object interactions for action recognition in still images. In Advances in Neural Information Processing Systems. NIPSF, 1503--1511.
[11]
Carl Doersch, Abhinav Gupta, and Alexei A. Efros. 2015. Unsupervised visual representation learning by context prediction. In Advances in Neural Information Processing Systems. NIPSF, 1422--1430.
[12]
Alexey Dosovitskiy, Jost Tobias Springenberg, Martin A. Riedmiller, and Thomas Brox. 2014. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems. NIPSF, 766--774.
[13]
Haoshu Fang, Jinkun Cao, Yu-Wing Tai, and Cewu Lu. 2018. Pairwise body-part attention for recognizing human-object interactions. In Proceedings of the European Conference on Computer Vision. Springer, 52--68.
[14]
Basura Fernando, Sareh Shirazi, and Stephen Gould. 2017. Unsupervised human action detection by action matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1604--1612.
[15]
Georgia Gkioxari, Ross B. Girshick, and Jitendra Malik. 2015. Actions and attributes from wholes and parts. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2470--2478.
[16]
Georgia Gkioxari, Ross B. Girshick, and Jitendra Malik. 2015. Contextual action recognition with R*CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1080--1088.
[17]
Guodong Guo and Alice Lai. 2014. A survey on still image based human action recognition. Pattern Recognition 47, 10, 3343--3361.
[18]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2980--2988.
[19]
Min Huang, Song-Zhi Su, Hongbo Zhang, Guo-Rong Cai, Dong-Ying Gong, Donglin Cao, and Shao-Zi Li. 2018. Multifeature selection for 3D human action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2, 45:1--45:18.
[20]
Nazli Ikizler, Ramazan Gokberk Cinbis, Selen Pehlivan, and Pinar Duygulu. 2008. Recognizing actions from still images. In Proceedings of the International Conference on Pattern Recognition. IEEE, 1--4.
[21]
Anil K. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31, 8, 651--666.
[22]
Herve Jegou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3304---3311.
[23]
Shuhui Jiang, Yue Wu, and Yun Fu. 2018. Deep bidirectional cross-triplet embedding for online clothing shopping. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1, 5:1--5:22.
[24]
Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Ahmed Sohel, and Farid Boussaïd. 2018. Learning clip representations for skeleton-based 3D action recognition. IEEE Transactions on Image Processing 27, 6, 2842--2855.
[25]
Alex Krizhevsky and Geoffrey E. Hinton. 2011. Using very deep autoencoders for content-based image retrieval. In Proceedings of the European Symposium on Artificial Neural Networks. i6doc.com publication, 489--494.
[26]
Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2169--2178.
[27]
Dieu-Thu Le, Raffaella Bernardi, and Jasper R. R. Uijlings. 2013. Exploiting language models to recognize unseen actions. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 231--238.
[28]
Quoc V. Le. 2013. Building high-level features using large scale unsupervised learning. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 8595--8598.
[29]
Honglak Lee, Roger B. Grosse, Rajesh Ranganath, and Andrew Y. Ng. 2009. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the International Conference on Machine Learning. ACM, 609--616.
[30]
Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2017. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 667--676.
[31]
Fei-Fei Li, Rob Fergus, and Pietro Perona. 2004. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 178--178.
[32]
Piji Li, Jun Ma, and Shuai Gao. 2011. Actions in still web images: Visualization, detection and retrieval. In Web-Age Information Management. 302--313.
[33]
Sheng Li, Kang Li, and Yun Fu. 2018. Early recognition of 3D human actions. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1s, 20:1--20:21.
[34]
Xin Li and Mooi Choo Chuah. 2018. ReHAR: Robust and efficient human activity recognition. In Proceedings of the IEEE Conference on Applications of Computer Vision. IEEE, 362--371.
[35]
Jun Liu, Amir Shahroudy, Gang Wang, Ling-Yu Duan, and Alex C. Kot. 2019. Skeleton-based online action prediction using scale selection network. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[36]
Jiawei Liu, Zheng-Jun Zha, Xuejin Chen, Zilei Wang, and Yongdong Zhang. 2019. Dense 3D-convolutional neural network for person re-identification in videos. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s, 8:1--8:19.
[37]
David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2, 91--110.
[38]
Fan Ma, Deyu Meng, Qi Xie, Zina Li, and Xuanyi Dong. 2017. Self-paced co-training. In Proceedings of the International Conference on Machine Learning. IMLS, 2275--2284.
[39]
Shugao Ma, Sarah Adel Bargal, Jianming Zhang, Leonid Sigal, and Stan Sclaroff. 2017. Do less and achieve more: Training CNNs for action recognition utilizing action images from the web. Pattern Recognition 68, 334--345.
[40]
Subhransu Maji, Lubomir D. Bourdev, and Jitendra Malik. 2011. Action recognition from a distributed representation of pose and appearance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3177--3184.
[41]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
[42]
Juan Carlos Niebles, Hongcheng Wang, and Fei-Fei Li. 2008. Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision 79, 3, 299--318.
[43]
Christos H. Papadimitriou and Kenneth Steiglitz. 1998. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall.
[44]
Alessandro Prest, Cordelia Schmid, and Vittorio Ferrari. 2012. Weakly supervised learning of interactions between humans and objects. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 3, 601--614.
[45]
Lei Qi, Xiaoqiang Lu, and Xuelong Li. 2018. Action recognition by jointly using video proposal and trajectory. In ACM International Conference on Vision, Image and Signal Processing. ACM, 4--4.
[46]
Hossein Rahmani and Mohammed Bennamoun. 2017. Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5833--5842.
[47]
Nima Razavi, Juergen Gall, and Luc J. Van Gool. 2011. Scalable multi-class object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1505--1512.
[48]
Marko Ristin, Matthieu Guillaumin, Juergen Gall, and Luc J. Van Gool. 2016. Incremental learning of random forests for large-scale image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 3, 490--503.
[49]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 815--823.
[50]
Fadime Sener, Cagdas Bas, and Nazli Ikizler-Cinbis. 2012. On recognizing actions in still images via multiple features. In Proceedings of the European Conference on Computer Vision. Springer, 263--272.
[51]
Gaurav Sharma, Frédéric Jurie, and Cordelia Schmid. 2017. Expanded parts model for semantic description of humans in still images. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 1, 87--101.
[52]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations. ICLR.
[53]
Khurram Soomro and Mubarak Shah. 2017. Unsupervised action discovery and localization in videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 696--705.
[54]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: a dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01.
[55]
Alexander Strehl and Joydeep Ghosh. 2002. Cluster ensembles — a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583--617.
[56]
Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas S. Huang, and Yihong Gong. 2010. Locality-constrained linear coding for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3360--3367.
[57]
Peisong Wang, Qinghao Hu, Zhiwei Fang, Chaoyang Zhao, and Jian Cheng. 2018. DeepSearch: a fast image search framework for mobile devices. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1, 6:1--6:22.
[58]
Xiaolong Wang, Kaiming He, and Abhinav Gupta. 2017. Transitive invariance for self-supervised visual representation learning. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1338--1347.
[59]
Yang Wang, Hao Jiang, Mark S. Drew, Ze-Nian Li, and Greg Mori. 2006. Unsupervised discovery of action classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1654--1661.
[60]
Chenxia Wu, Jiemi Zhang, Silvio Savarese, and Ashutosh Saxena. 2015. Watch-n-patch: Unsupervised understanding of actions and relations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4362--4370.
[61]
Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. 2018. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5177--5186.
[62]
Jianwei Yang, Devi Parikh, and Dhruv Batra. 2016. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5147--5156.
[63]
Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas J. Guibas, and Fei-Fei Li. 2011. Human action recognition by learning bases of action attributes and parts. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1331--1338.
[64]
Bangpeng Yao and Fei-Fei Li. 2012. Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 9, 1691--1703.
[65]
Mark Yatskar, Luke S. Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition. IEEE, 5534--5542.
[66]
Yuan Yuan, Lei Qi, and Xiaoqiang Lu. 2016. Action recognition by joint learning. Image and Vision Computing 55, 77--85.
[67]
Yuan Yuan, Yang Zhao, and Qi Wang. 2018. Action recognition using spatial-optical data organization and sequential learning framework. Neurocomputing 315, 221--233.
[68]
Yu Zhang, Li Cheng, Jianxin Wu, Jianfei Cai, Minh N. Do, and Jiangbo Lu. 2016. Action recognition in still images with minimum annotation efforts. IEEE Transactions on Image Processing 25, 11, 5479--5490.
[69]
Shichao Zhao, Yanbin Liu, Yahong Han, Richang Hong, Qinghua Hu, and Qi Tian. 2018. Pooling the convolutional layers in deep ConvNets for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology 28, 8, 1839--1849.
[70]
Zhichen Zhao, Huimin Ma, and Xiaozhi Chen. 2016. Semantic parts based top-down pyramid for action recognition. Pattern Recognition Letters 84, 134--141.
[71]
Yin Zheng, Yu-Jin Zhang, Xue Li, and Bao-Di Liu. 2012. Action recognition in still images using a combination of human pose and context information. In Proceedings of the IEEE International Conference on Image Processing. IEEE, 785--788.
[72]
Zhedong Zheng, Liang Zheng, and Yi Yang. 2018. A discriminatively learned CNN embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1 (2018), 13:1--13:20.
[73]
Yu Zhu, Wenbin Chen, and Guodong Guo. 2015. Fusing multiple features for depth-based action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 6, 2, 18:1--18:20.
[74]
Maryam Ziaeefard and Robert Bergevin. 2015. Semantic human activity recognition: a literature review. Pattern Recognition 48, 8, 2329--2345.

Cited By

View all
  • (2024)An efficient Meta-VSW method for ship behaviors recognition and applicationOcean Engineering10.1016/j.oceaneng.2024.118870311(118870)Online publication date: Nov-2024
  • (2023)Relation with Free Objects for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361759620:2(1-19)Online publication date: 26-Aug-2023
  • (2023)SSRT: A Sequential Skeleton RGB Transformer to Recognize Fine-Grained Human-Object Interactions and Action RecognitionIEEE Access10.1109/ACCESS.2023.327897411(51930-51948)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 4
November 2019
322 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3376119
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2019
Accepted: 01 September 2019
Revised: 01 June 2019
Received: 01 August 2018
Published in TOMM Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Action categorization
  2. deep representations
  3. unsupervised learning

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Key Research Program of Frontier Sciences, CAS
  • National Key R8D Program of China
  • CAS “Light of West China” Program
  • National Natural Science Foundation of China
  • Young Top-notch Talent Program of Chinese Academy of Sciences

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)An efficient Meta-VSW method for ship behaviors recognition and applicationOcean Engineering10.1016/j.oceaneng.2024.118870311(118870)Online publication date: Nov-2024
  • (2023)Relation with Free Objects for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361759620:2(1-19)Online publication date: 26-Aug-2023
  • (2023)SSRT: A Sequential Skeleton RGB Transformer to Recognize Fine-Grained Human-Object Interactions and Action RecognitionIEEE Access10.1109/ACCESS.2023.327897411(51930-51948)Online publication date: 2023
  • (2021)UNSUPERVISED MACHINE LEARNING ALGORITHM TO SOLVE KNIGHT COVERING PROBLEM FOR 6 BY 6 BOARD6'YA 6 TAHTA ÜZERİNDE AT KAPLAMA PROBLEMİNİ ÇÖZMEK İÇİN DENETİMSİZ MAKİNE ÖĞRENME ALGORİTMASIAdıyaman Üniversitesi Mühendislik Bilimleri Dergisi10.54365/adyumbd.9806608:15(414-426)Online publication date: 31-Dec-2021
  • (2021)Human behaviour recognition with mid‐level representations for crowd understanding and analysisIET Image Processing10.1049/ipr2.1214715:14(3414-3424)Online publication date: 25-Feb-2021

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media