Learning Relevance Restricted Boltzmann Machine for Unstructured Group Activity and Event Understanding

Zhao, Fang; Huang, Yongzhen; Wang, Liang; Xiang, Tao; Tan, Tieniu

doi:10.1007/s11263-016-0896-3

Learning Relevance Restricted Boltzmann Machine for Unstructured Group Activity and Event Understanding

Published: 15 March 2016

Volume 119, pages 329–345, (2016)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Fang Zhao¹,
Yongzhen Huang^1,3,
Liang Wang^1,3,
Tao Xiang² &
…
Tieniu Tan¹

909 Accesses
9 Citations
Explore all metrics

Abstract

Analyzing unstructured group activities and events in uncontrolled web videos is a challenging task due to (1) the semantic gap between class labels and low-level visual features, (2) the demanding computational cost given high-dimensional low-level feature vectors and (3) the lack of labeled training data. These difficulties can be overcome by learning a meaningful and compact mid-level video representation. To this end, in this paper a novel supervised probabilistic graphical model termed Relevance Restricted Boltzmann Machine (ReRBM) is developed to learn a low-dimensional latent semantic representation for complex activities and events. Our model is a variant of the Restricted Boltzmann Machine (RBM) with a number of critical extensions: (1) sparse Bayesian learning is incorporated into the RBM to learn features which are relevant to video classes, i.e., discriminative; (2) binary stochastic hidden units in the RBM are replaced by rectified linear units in order to better explain complex video contents and make variational inference tractable for the proposed model; and (3) an efficient variational EM algorithm is formulated for model parameter estimation and inference. We conduct extensive experiments on two recent challenging benchmarks: the Unstructured Social Activity Attribute dataset and the Event Video dataset. The experimental results demonstrate that the relevant features learned by our model provide better semantic and discriminative description for videos than a number of alternative supervised latent variable models, and achieves state of the art performance in terms of classification accuracy and retrieval precision, particularly when only a few labeled training samples are available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-supervised Multi-actor Social Activity Understanding in Streaming Videos

Features of Interest Points Based Human Interaction Prediction

SCGTS: Semantic Content Guiding Teacher-Student Network for Group Activity Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Bengio, Y., Courville, A.C., Vincent, P. (2012). Unsupervised feature learning and deep learning: A review and new perspectives. CoRR arXiv:1206.5538.
Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Bohning, D. (1992). Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics, 44(1), 197–200.
Article MathSciNet MATH Google Scholar
Desjardins, G., Courville, A.C., & Bengio, Y. (2012). On training deep boltzmann machines. CoRR arXiv:1203.4416.
Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1778–1785). IEEE.
Fu, Y., Hospedales, T., Xiang, T., & Gong, S. (2012). Attribute learning for understanding unstructured social activity. In: European Conference on Computer Vision (pp. 530–543). Springer.
Gopalan, R. (2013). Joint sparsity-based representation and analysis of unconstrained activities. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 2738–2745). IEEE.
Harva, M., & Kaban, A. (2007). Variational learning for rectified factor analysis. Signal Processing, 87(3), 509–527.
Article MATH Google Scholar
Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Article MathSciNet MATH Google Scholar
Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Article MathSciNet MATH Google Scholar
Hinton, G., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Article MathSciNet MATH Google Scholar
Izadinia, H., & Shah, M. (2012). Recognizing complex events using large margin joint low-level event model. In: European Conference on Computer Vision (pp. 430–444). Springer.
Jegou, H., & Chum, O. (2012). Negative evidences and co-occurrences in image retrieval: The benefit of pca and whitening. In: European Conference on Computer Vision (pp. 774–787). Springer.
Jegou, H., Perronnin, F., Douze, M., Sanchez, J., Perez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.
Article Google Scholar
Lampert, C., Nickisch, H., & Harmeling, S. (2013). Attribute-based classification for zero-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 453–465.
Article Google Scholar
Lampert, C.H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 951–958). IEEE.
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.
Article Google Scholar
Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8).
Larochelle, H., Mandel, M., Pascanu, R., & Bengio, Y. (2012). Learning algorithms for the classification restricted Boltzmann machine. The Journal of Machine Learning Research, 13, 643–669.
MathSciNet MATH Google Scholar
Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 3337–3344). IEEE.
Logan, B. (2000). Mel frequency cepstral coefficients for music modeling. In International Symposium on Music Information Retrieval.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
Mittelman, R., Lee, H., Kuipers, B., & Savarese, S. (2013). Weakly supervised learning of mid-level features with beta-bernoulli process restricted Boltzmann machines. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 476–483). IEEE.
Murphy, K. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT Press.
MATH Google Scholar
Nair, V., & Hinton, G. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (pp. 807–814). ACM.
Neal, R. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto.
Perronnin, F., Sanchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In European Conference on Computer Vision (pp. 143–156). Springer.
Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabulary and fast spatial matching. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8). IEEE.
Ranzato, M., & Hinton, G.E. (2010). Modeling pixel means and covariances using factorized third-order boltzmann machines. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2551–2558).
Rasiwasia, N., & Vasconcelos, N. (2013). Latent dirichlet allocation models for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2665–2679.
Article Google Scholar
Reddy, K., & Shah, M. (2013). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5), 971–981.
Article Google Scholar
Revaud, J., Douze, M., Schmid, C., & Jegou, H. (2013). Event retrieval in large video collections with circulant temporal encoding. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 2459–2466). IEEE.
Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action mach: A spatio-temporal maximum average correlation height filter for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8). IEEE.
Salakhutdinov, R., & Hinton, G. (2009). Replicated softmax: An undirected topic model. Advances in neural information processing systems (pp. 1607–1614). Cambridge, MA: MIT Press.
Google Scholar
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition (vol. 3, pp. 32–36). IEEE.
Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In Ninth IEEE International Conference on Computer Vision (pp. 1470–1477). IEEE.
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1, 194–281.
Google Scholar
Sun, Y., Wang, X., & Tang, X. (2013). Hybrid deep learning for face verification. In IEEE International Conference on Computer Vision (pp. 1489–1496). IEEE.
Taylor, G., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In European Conference on Computer Vision (pp. 140–153). Springer.
Tipping, M. (2001). Sparse bayesian learning and the relevance vector machine. The Journal of Machine Learning Research, 1, 211–244.
MathSciNet MATH Google Scholar
Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y. (2004). Support vector learning for interdependent and structured output spaces. In: Proceedings of the twenty-first international conference on Machine Learning, ACM, p 104.
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV) (pp. 3551–3558). IEEE.
Wang, H., Klaser, A., Schmid, C., & Liu, C. (2011). Action recognition by dense trajectories. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 3169–3176). IEEE.
Wang, Y., & Mori, G. (2009). Human action recognition by semilatent topic models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1762–1774.
Article Google Scholar
Wei, X., Jiang, Y., & Ngo, C. (2011). Concept-driven multi-modality fusion for video search. IEEE Transactions on Circuits and Systems for Video Technology, 21(1), 62–73.
Article Google Scholar
Yang, Y., & Shah, M. (2012). Complex events detection using data-driven concepts. In European Conference on Computer Vision (pp. 722–735). Springer.
Zhao, F., Huang, Y., Wang, L., & Tan, T. (2013). Relevance topic model for unstructured social group activity recognition. Advances in neural information processing systems (pp. 2580–2588). Cambridge, MA: MIT Press.
Google Scholar
Zhu, J., Ahmed, A., & Xing, E. (2012). Medlda: Maximum margin supervised topic models. The Journal of Machine Learning Research, 13(1), 2237–2278.
MathSciNet MATH Google Scholar

Download references

Acknowledgments

This work is jointly supported by National Basic Research Program of China (2012CB316300), National Natural Science Foundation of China (61525306, 61573354, 61135002, 61420106015), and Strategic Priority Research Program of the CAS (XDB02070100).

Author information

Authors and Affiliations

Center for Research on Intelligent Perception and Computing, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Fang Zhao, Yongzhen Huang, Liang Wang & Tieniu Tan
School of Electronic Engineering and Computer Science, Queen Mary, University of London, London, UK
Tao Xiang
Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Beijing, China
Yongzhen Huang & Liang Wang

Authors

Fang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yongzhen Huang
View author publications
You can also search for this author in PubMed Google Scholar
Liang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Tieniu Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongzhen Huang.

Additional information

Communicated by Ivan Laptev, Josef Sivic, Deva Ramanan.

Appendix: Parameters of Free-Form Variational Posterior $\varvec{q(t_{mj}})$

The expressions of parameters in $q(t_{mj})$ (28) are listed as follows:

$$\begin{aligned} {\omega _{pos}}= & {} \mathcal {N}(\alpha |\beta ,\gamma + 1 ), \ \sigma _{pos}^2 = {({\gamma ^{ - 1}} + 1)^{ - 1}}, \end{aligned}$$

(33)

$$\begin{aligned} {\mu _{pos}}= & {} \sigma _{pos}^2(\frac{\alpha }{\gamma } + \beta ), \end{aligned}$$

(34)

$$\begin{aligned} {\omega _{neg}}= & {} \mathcal {N}(\alpha |0,\gamma ), \ \sigma _{neg}^2 = 1 , \ {\mu _{neg}} = \beta , \end{aligned}$$

(35)

$$\begin{aligned} Z= & {} \frac{1}{2}{\omega _{pos}}\text {erfc}\left( \frac{ - {\mu _{pos}}}{\sqrt{2\sigma _{pos}^2} } \right) + \frac{1}{2}{\omega _{neg}}\text {erfc}\left( \frac{{\mu _{neg}}}{\sqrt{2\sigma _{neg}^2} } \right) ,\nonumber \\ \end{aligned}$$

(36)

where $\text {erfc}(\cdot )$ is the complementary error function and

$$\begin{aligned} \alpha= & {} {\left\langle {\frac{{{{\varvec{\upeta }}_{ \cdot j}}\left( {{{\mathbf {y}}_m} + {{\mathbf {s}}_m} - \sum \nolimits _{j' \ne j} {{{\varvec{\upeta }}_{ \cdot j'}}{\mathbf {A}}t_{mj'}^r} } \right) }}{{{{\varvec{\upeta }}_{ \cdot j}}{\mathbf {A}\varvec{\upeta }}_{ \cdot j}^T}}} \right\rangle _{q({\varvec{\upeta }})q({\mathbf {t}})}}, \end{aligned}$$

(37)

$$\begin{aligned} \gamma= & {} \left\langle {{{\varvec{\upeta }}_{ \cdot j}}{\mathbf {A}\varvec{\upeta } }_{ \cdot j}^T} \right\rangle _{q({\varvec{\upeta }})}^{ - 1}, \ \beta = \sum \limits _{i = 1}^N {{W_{ij}}{v_{mi}}} + K{b_j}. \end{aligned}$$

(38)

We can see that $q(t_{mj})$ depends on expectations over $\varvec{\upeta }$ and $\{ {t_{mj'}}\} _{j' \ne j}$, which is consistent with the graphical model representation of ReRBM in Fig. 3.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, F., Huang, Y., Wang, L. et al. Learning Relevance Restricted Boltzmann Machine for Unstructured Group Activity and Event Understanding. Int J Comput Vis 119, 329–345 (2016). https://doi.org/10.1007/s11263-016-0896-3

Download citation

Received: 15 June 2014
Accepted: 19 February 2016
Published: 15 March 2016
Issue Date: September 2016
DOI: https://doi.org/10.1007/s11263-016-0896-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Relevance Restricted Boltzmann Machine for Unstructured Group Activity and Event Understanding

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Self-supervised Multi-actor Social Activity Understanding in Streaming Videos

Features of Interest Points Based Human Interaction Prediction

SCGTS: Semantic Content Guiding Teacher-Student Network for Group Activity Recognition

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Parameters of Free-Form Variational Posterior \(\varvec{q(t_{mj}})\)

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Learning Relevance Restricted Boltzmann Machine for Unstructured Group Activity and Event Understanding

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Self-supervised Multi-actor Social Activity Understanding in Streaming Videos

Features of Interest Points Based Human Interaction Prediction

SCGTS: Semantic Content Guiding Teacher-Student Network for Group Activity Recognition

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Parameters of Free-Form Variational Posterior \(\varvec{q(t_{mj}})\)

Appendix: Parameters of Free-Form Variational Posterior \(\varvec{q(t_{mj}})\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation