Skip to main content
Log in

Learning Relevance Restricted Boltzmann Machine for Unstructured Group Activity and Event Understanding

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Analyzing unstructured group activities and events in uncontrolled web videos is a challenging task due to (1) the semantic gap between class labels and low-level visual features, (2) the demanding computational cost given high-dimensional low-level feature vectors and (3) the lack of labeled training data. These difficulties can be overcome by learning a meaningful and compact mid-level video representation. To this end, in this paper a novel supervised probabilistic graphical model termed Relevance Restricted Boltzmann Machine (ReRBM) is developed to learn a low-dimensional latent semantic representation for complex activities and events. Our model is a variant of the Restricted Boltzmann Machine (RBM) with a number of critical extensions: (1) sparse Bayesian learning is incorporated into the RBM to learn features which are relevant to video classes, i.e., discriminative; (2) binary stochastic hidden units in the RBM are replaced by rectified linear units in order to better explain complex video contents and make variational inference tractable for the proposed model; and (3) an efficient variational EM algorithm is formulated for model parameter estimation and inference. We conduct extensive experiments on two recent challenging benchmarks: the Unstructured Social Activity Attribute dataset and the Event Video dataset. The experimental results demonstrate that the relevant features learned by our model provide better semantic and discriminative description for videos than a number of alternative supervised latent variable models, and achieves state of the art performance in terms of classification accuracy and retrieval precision, particularly when only a few labeled training samples are available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. http://www.youtube.com/t/faq.

  2. http://www.hollywoodreporter.com/news/video-accounts-53-percent-internet-655203.

  3. http://www.eecs.qmul.ac.uk/~yf300/USAA/download.

  4. http://pascal.inrialpes.fr/data/evve.

References

  • Bengio, Y., Courville, A.C., Vincent, P. (2012). Unsupervised feature learning and deep learning: A review and new perspectives. CoRR arXiv:1206.5538.

  • Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Bohning, D. (1992). Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics, 44(1), 197–200.

    Article  MathSciNet  MATH  Google Scholar 

  • Desjardins, G., Courville, A.C., & Bengio, Y. (2012). On training deep boltzmann machines. CoRR arXiv:1203.4416.

  • Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1778–1785). IEEE.

  • Fu, Y., Hospedales, T., Xiang, T., & Gong, S. (2012). Attribute learning for understanding unstructured social activity. In: European Conference on Computer Vision (pp. 530–543). Springer.

  • Gopalan, R. (2013). Joint sparsity-based representation and analysis of unconstrained activities. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 2738–2745). IEEE.

  • Harva, M., & Kaban, A. (2007). Variational learning for rectified factor analysis. Signal Processing, 87(3), 509–527.

    Article  MATH  Google Scholar 

  • Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.

    Article  MathSciNet  MATH  Google Scholar 

  • Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.

    Article  MathSciNet  MATH  Google Scholar 

  • Hinton, G., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.

    Article  MathSciNet  MATH  Google Scholar 

  • Izadinia, H., & Shah, M. (2012). Recognizing complex events using large margin joint low-level event model. In: European Conference on Computer Vision (pp. 430–444). Springer.

  • Jegou, H., & Chum, O. (2012). Negative evidences and co-occurrences in image retrieval: The benefit of pca and whitening. In: European Conference on Computer Vision (pp. 774–787). Springer.

  • Jegou, H., Perronnin, F., Douze, M., Sanchez, J., Perez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.

    Article  Google Scholar 

  • Lampert, C., Nickisch, H., & Harmeling, S. (2013). Attribute-based classification for zero-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 453–465.

    Article  Google Scholar 

  • Lampert, C.H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 951–958). IEEE.

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.

    Article  Google Scholar 

  • Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8).

  • Larochelle, H., Mandel, M., Pascanu, R., & Bengio, Y. (2012). Learning algorithms for the classification restricted Boltzmann machine. The Journal of Machine Learning Research, 13, 643–669.

    MathSciNet  MATH  Google Scholar 

  • Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 3337–3344). IEEE.

  • Logan, B. (2000). Mel frequency cepstral coefficients for music modeling. In International Symposium on Music Information Retrieval.

  • Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Mittelman, R., Lee, H., Kuipers, B., & Savarese, S. (2013). Weakly supervised learning of mid-level features with beta-bernoulli process restricted Boltzmann machines. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 476–483). IEEE.

  • Murphy, K. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT Press.

    MATH  Google Scholar 

  • Nair, V., & Hinton, G. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (pp. 807–814). ACM.

  • Neal, R. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto.

  • Perronnin, F., Sanchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In European Conference on Computer Vision (pp. 143–156). Springer.

  • Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabulary and fast spatial matching. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8). IEEE.

  • Ranzato, M., & Hinton, G.E. (2010). Modeling pixel means and covariances using factorized third-order boltzmann machines. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2551–2558).

  • Rasiwasia, N., & Vasconcelos, N. (2013). Latent dirichlet allocation models for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2665–2679.

    Article  Google Scholar 

  • Reddy, K., & Shah, M. (2013). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5), 971–981.

    Article  Google Scholar 

  • Revaud, J., Douze, M., Schmid, C., & Jegou, H. (2013). Event retrieval in large video collections with circulant temporal encoding. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 2459–2466). IEEE.

  • Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action mach: A spatio-temporal maximum average correlation height filter for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8). IEEE.

  • Salakhutdinov, R., & Hinton, G. (2009). Replicated softmax: An undirected topic model. Advances in neural information processing systems (pp. 1607–1614). Cambridge, MA: MIT Press.

    Google Scholar 

  • Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition (vol. 3, pp. 32–36). IEEE.

  • Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In Ninth IEEE International Conference on Computer Vision (pp. 1470–1477). IEEE.

  • Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1, 194–281.

    Google Scholar 

  • Sun, Y., Wang, X., & Tang, X. (2013). Hybrid deep learning for face verification. In IEEE International Conference on Computer Vision (pp. 1489–1496). IEEE.

  • Taylor, G., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In European Conference on Computer Vision (pp. 140–153). Springer.

  • Tipping, M. (2001). Sparse bayesian learning and the relevance vector machine. The Journal of Machine Learning Research, 1, 211–244.

    MathSciNet  MATH  Google Scholar 

  • Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y. (2004). Support vector learning for interdependent and structured output spaces. In: Proceedings of the twenty-first international conference on Machine Learning, ACM, p 104.

  • Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV) (pp. 3551–3558). IEEE.

  • Wang, H., Klaser, A., Schmid, C., & Liu, C. (2011). Action recognition by dense trajectories. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 3169–3176). IEEE.

  • Wang, Y., & Mori, G. (2009). Human action recognition by semilatent topic models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1762–1774.

    Article  Google Scholar 

  • Wei, X., Jiang, Y., & Ngo, C. (2011). Concept-driven multi-modality fusion for video search. IEEE Transactions on Circuits and Systems for Video Technology, 21(1), 62–73.

    Article  Google Scholar 

  • Yang, Y., & Shah, M. (2012). Complex events detection using data-driven concepts. In European Conference on Computer Vision (pp. 722–735). Springer.

  • Zhao, F., Huang, Y., Wang, L., & Tan, T. (2013). Relevance topic model for unstructured social group activity recognition. Advances in neural information processing systems (pp. 2580–2588). Cambridge, MA: MIT Press.

    Google Scholar 

  • Zhu, J., Ahmed, A., & Xing, E. (2012). Medlda: Maximum margin supervised topic models. The Journal of Machine Learning Research, 13(1), 2237–2278.

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work is jointly supported by National Basic Research Program of China (2012CB316300),  National Natural Science Foundation of China (61525306, 61573354, 61135002, 61420106015), and Strategic Priority Research Program of the CAS (XDB02070100).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongzhen Huang.

Additional information

Communicated by Ivan Laptev, Josef Sivic, Deva Ramanan.

Appendix: Parameters of Free-Form Variational Posterior \(\varvec{q(t_{mj}})\)

Appendix: Parameters of Free-Form Variational Posterior \(\varvec{q(t_{mj}})\)

The expressions of parameters in \(q(t_{mj})\) (28) are listed as follows:

$$\begin{aligned} {\omega _{pos}}= & {} \mathcal {N}(\alpha |\beta ,\gamma + 1 ), \ \sigma _{pos}^2 = {({\gamma ^{ - 1}} + 1)^{ - 1}}, \end{aligned}$$
(33)
$$\begin{aligned} {\mu _{pos}}= & {} \sigma _{pos}^2(\frac{\alpha }{\gamma } + \beta ), \end{aligned}$$
(34)
$$\begin{aligned} {\omega _{neg}}= & {} \mathcal {N}(\alpha |0,\gamma ), \ \sigma _{neg}^2 = 1 , \ {\mu _{neg}} = \beta , \end{aligned}$$
(35)
$$\begin{aligned} Z= & {} \frac{1}{2}{\omega _{pos}}\text {erfc}\left( \frac{ - {\mu _{pos}}}{\sqrt{2\sigma _{pos}^2} } \right) + \frac{1}{2}{\omega _{neg}}\text {erfc}\left( \frac{{\mu _{neg}}}{\sqrt{2\sigma _{neg}^2} } \right) ,\nonumber \\ \end{aligned}$$
(36)

where \(\text {erfc}(\cdot )\) is the complementary error function and

$$\begin{aligned} \alpha= & {} {\left\langle {\frac{{{{\varvec{\upeta }}_{ \cdot j}}\left( {{{\mathbf {y}}_m} + {{\mathbf {s}}_m} - \sum \nolimits _{j' \ne j} {{{\varvec{\upeta }}_{ \cdot j'}}{\mathbf {A}}t_{mj'}^r} } \right) }}{{{{\varvec{\upeta }}_{ \cdot j}}{\mathbf {A}\varvec{\upeta }}_{ \cdot j}^T}}} \right\rangle _{q({\varvec{\upeta }})q({\mathbf {t}})}}, \end{aligned}$$
(37)
$$\begin{aligned} \gamma= & {} \left\langle {{{\varvec{\upeta }}_{ \cdot j}}{\mathbf {A}\varvec{\upeta } }_{ \cdot j}^T} \right\rangle _{q({\varvec{\upeta }})}^{ - 1}, \ \beta = \sum \limits _{i = 1}^N {{W_{ij}}{v_{mi}}} + K{b_j}. \end{aligned}$$
(38)

We can see that \(q(t_{mj})\) depends on expectations over \(\varvec{\upeta }\) and \(\{ {t_{mj'}}\} _{j' \ne j}\), which is consistent with the graphical model representation of ReRBM in Fig. 3.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, F., Huang, Y., Wang, L. et al. Learning Relevance Restricted Boltzmann Machine for Unstructured Group Activity and Event Understanding. Int J Comput Vis 119, 329–345 (2016). https://doi.org/10.1007/s11263-016-0896-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0896-3

Keywords

Navigation