Abstract
We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. This is achieved by using latent topic models such as the probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA). Our approach can handle noisy feature points arisen from dynamic background and moving cameras due to the application of the probabilistic models. Given a novel video sequence, the algorithm can categorize and localize the human action(s) contained in the video. We test our algorithm on three challenging datasets: the KTH human motion dataset, the Weizmann human action dataset, and a recent dataset of figure skating actions. Our results reflect the promise of such a simple approach. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 2, pp. 1395–1402). Los Alamitos: IEEE Computer Society.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267.
Boiman, O., & Irani, M. (2005). Detecting irregularities in images and in video. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 462–469). Los Alamitos: IEEE Computer Society.
Cheung, V., Frey, B. J., & Jojic, N. (2005). Video epitomes. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (Vol. 1, pp. 42–49). Los Alamitos: IEEE Computer Society.
Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In European conference on computer vision (Vol. 2, pp. 428–441).
Dance, C., Willamowski, J., Fan, L., Bray, C., & Csurka, G. (2004). Visual categorization with bags of keypoints. In ECCV international workshop on statistical learning in computer vision.
Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In 2nd joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance (pp. 65–72).
Efros, A. A., Berg, A. C., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In Proceedings of the ninth IEEE international conference on computer vision (Vol. 2, pp. 726–733). Los Alamitos: IEEE Computer Society.
Fanti, C., Zelnik-Manor, L., & Perona, P. (2005). Hybrid models for human motion recognition. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 1166–1173). Los Alamitos: IEEE Computer Society.
Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (pp. 524–531). Los Alamitos: IEEE Computer Society.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Feng, X., & Perona, P. (2002). Human action recognition by sequence of movelet codewords. In 1st international symposium on 3D data processing visualization and transmission (3DPVT 2002) (pp. 717–721).
Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In Proceedings of the tenth international conference on computer vision (Vol. 2, pp. 1816–1823). Los Alamitos: IEEE Computer Society.
Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proceedings of the fourth Alvey vision conference (pp. 147–152).
Hoey, J. (2001). Hierarchical unsupervised learning of facial expression categories. In IEEE workshop on detection and recognition of events in video (pp. 99–106).
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57), August 1999.
Kadir, T., & Brady, M. (2003). Scale saliency: a novel approach to salient feature and scale selection. In International conference on visual information engineering (pp. 25–28).
Ke, Y., Sukthankar, R., & Hebert, M. (2005). Efficient visual event detection using volumetric features. In Proceedings of the tenth IEEE international conference on computer vision (pp. 166–173). Los Alamitos: IEEE Computer Society.
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.
Laptev, I., & Lindeberg, T. (2006). Local descriptors for spatio-temporal recognition. In Lecture notes in computer science (Vol. 3667). Spatial coherence for visual motion analysis, first international workshop, SCVMA 2004, Prague, Czech Republic, 15 May 2004. Berlin: Springer.
Niebles, J. C., & Fei-Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In Proceedings of the 2007 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.
Niebles, J. C., Wang, H., & Fei-Fei, L. (2006). Unsupervised learning of human action categories using spatial-temporal words. In Proceedings of British machine vision conference 2006 (Vol. 3, pp. 1249–1258), September 2006.
Oikonomopoulos, A., Patras, I., & Pantic, M. (2006). Human action recognition with spatiotemporal salient points. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 36(3), 710–719.
Ramanan, D., & Forsyth, D. A. (2004). Automatic annotation of everyday movements. In Thrun, S., Saul, L., & Schölkopf, B. (Eds.), Advances in neural information processing systems (Vol. 16). Cambridge: MIT Press.
Savarese, S., Winn, J. M., & Criminisi, A. (2006). Discriminative object class models of appearance and shape by correlations. In Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.
Schmid, C., Mohr, R., & Bauckhage, C. (2000). Evaluation of interest point detectors. International Journal of Computer Vision, 2(37), 151–172.
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local svm approach. In ICPR (pp. 32–36).
Shechtman, E., & Irani, M. (2005). Space-time behavior based correlation. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (Vol. 1, pp. 405–412). Los Alamitos: IEEE Computer Society.
Sidenbladh, H., & Black, M. J. (2003). Learning the statistics of people in images and video. International Journal of Computer Vision, 54(1-3), 181–207.
Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering objects and their location in images. In Proceedings of the tenth IEEE international conference on computer vision (pp. 370–377), October 2005. Los Alamitos: IEEE Computer Society.
Song, Y., Goncalves, L., & Perona, P. (2003). Unsupervised learning of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(25), 1–14.
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.
Wang, Y., Jiang, H., Drew, M. S., Li, Z.-N., & Mori, G. (2006). Unsupervised discovery of action classes. In Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.
Xiang, T., & Gong, S. (2005). Video behaviour profiling and abnormality detection without manual labelling. In Proceedings of the tenth IEEE international conference on computer vision (pp. 1238–1245). Los Alamitos: IEEE Computer Society.
Yilmaz, A., & Shah, M. (2005). Recognizing human actions in videos acquired by uncalibrated moving cameras. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 150–157). Los Alamitos: IEEE Computer Society.
Zhong, H., Shi, J., & Visontai, M. (2004). Detecting unusual activity in video. In Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition (pp. 819–826). Los Alamitos: IEEE Computer Society.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Niebles, J.C., Wang, H. & Fei-Fei, L. Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words. Int J Comput Vis 79, 299–318 (2008). https://doi.org/10.1007/s11263-007-0122-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-007-0122-4