Abstract
Analyzing unstructured group activities and events in uncontrolled web videos is a challenging task due to (1) the semantic gap between class labels and low-level visual features, (2) the demanding computational cost given high-dimensional low-level feature vectors and (3) the lack of labeled training data. These difficulties can be overcome by learning a meaningful and compact mid-level video representation. To this end, in this paper a novel supervised probabilistic graphical model termed Relevance Restricted Boltzmann Machine (ReRBM) is developed to learn a low-dimensional latent semantic representation for complex activities and events. Our model is a variant of the Restricted Boltzmann Machine (RBM) with a number of critical extensions: (1) sparse Bayesian learning is incorporated into the RBM to learn features which are relevant to video classes, i.e., discriminative; (2) binary stochastic hidden units in the RBM are replaced by rectified linear units in order to better explain complex video contents and make variational inference tractable for the proposed model; and (3) an efficient variational EM algorithm is formulated for model parameter estimation and inference. We conduct extensive experiments on two recent challenging benchmarks: the Unstructured Social Activity Attribute dataset and the Event Video dataset. The experimental results demonstrate that the relevant features learned by our model provide better semantic and discriminative description for videos than a number of alternative supervised latent variable models, and achieves state of the art performance in terms of classification accuracy and retrieval precision, particularly when only a few labeled training samples are available.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bengio, Y., Courville, A.C., Vincent, P. (2012). Unsupervised feature learning and deep learning: A review and new perspectives. CoRR arXiv:1206.5538.
Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
Bohning, D. (1992). Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics, 44(1), 197–200.
Desjardins, G., Courville, A.C., & Bengio, Y. (2012). On training deep boltzmann machines. CoRR arXiv:1203.4416.
Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1778–1785). IEEE.
Fu, Y., Hospedales, T., Xiang, T., & Gong, S. (2012). Attribute learning for understanding unstructured social activity. In: European Conference on Computer Vision (pp. 530–543). Springer.
Gopalan, R. (2013). Joint sparsity-based representation and analysis of unconstrained activities. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 2738–2745). IEEE.
Harva, M., & Kaban, A. (2007). Variational learning for rectified factor analysis. Signal Processing, 87(3), 509–527.
Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Hinton, G., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Izadinia, H., & Shah, M. (2012). Recognizing complex events using large margin joint low-level event model. In: European Conference on Computer Vision (pp. 430–444). Springer.
Jegou, H., & Chum, O. (2012). Negative evidences and co-occurrences in image retrieval: The benefit of pca and whitening. In: European Conference on Computer Vision (pp. 774–787). Springer.
Jegou, H., Perronnin, F., Douze, M., Sanchez, J., Perez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.
Lampert, C., Nickisch, H., & Harmeling, S. (2013). Attribute-based classification for zero-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 453–465.
Lampert, C.H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 951–958). IEEE.
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.
Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8).
Larochelle, H., Mandel, M., Pascanu, R., & Bengio, Y. (2012). Learning algorithms for the classification restricted Boltzmann machine. The Journal of Machine Learning Research, 13, 643–669.
Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 3337–3344). IEEE.
Logan, B. (2000). Mel frequency cepstral coefficients for music modeling. In International Symposium on Music Information Retrieval.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Mittelman, R., Lee, H., Kuipers, B., & Savarese, S. (2013). Weakly supervised learning of mid-level features with beta-bernoulli process restricted Boltzmann machines. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 476–483). IEEE.
Murphy, K. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT Press.
Nair, V., & Hinton, G. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (pp. 807–814). ACM.
Neal, R. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto.
Perronnin, F., Sanchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In European Conference on Computer Vision (pp. 143–156). Springer.
Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabulary and fast spatial matching. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8). IEEE.
Ranzato, M., & Hinton, G.E. (2010). Modeling pixel means and covariances using factorized third-order boltzmann machines. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2551–2558).
Rasiwasia, N., & Vasconcelos, N. (2013). Latent dirichlet allocation models for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2665–2679.
Reddy, K., & Shah, M. (2013). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5), 971–981.
Revaud, J., Douze, M., Schmid, C., & Jegou, H. (2013). Event retrieval in large video collections with circulant temporal encoding. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 2459–2466). IEEE.
Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action mach: A spatio-temporal maximum average correlation height filter for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8). IEEE.
Salakhutdinov, R., & Hinton, G. (2009). Replicated softmax: An undirected topic model. Advances in neural information processing systems (pp. 1607–1614). Cambridge, MA: MIT Press.
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition (vol. 3, pp. 32–36). IEEE.
Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In Ninth IEEE International Conference on Computer Vision (pp. 1470–1477). IEEE.
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1, 194–281.
Sun, Y., Wang, X., & Tang, X. (2013). Hybrid deep learning for face verification. In IEEE International Conference on Computer Vision (pp. 1489–1496). IEEE.
Taylor, G., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In European Conference on Computer Vision (pp. 140–153). Springer.
Tipping, M. (2001). Sparse bayesian learning and the relevance vector machine. The Journal of Machine Learning Research, 1, 211–244.
Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y. (2004). Support vector learning for interdependent and structured output spaces. In: Proceedings of the twenty-first international conference on Machine Learning, ACM, p 104.
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV) (pp. 3551–3558). IEEE.
Wang, H., Klaser, A., Schmid, C., & Liu, C. (2011). Action recognition by dense trajectories. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 3169–3176). IEEE.
Wang, Y., & Mori, G. (2009). Human action recognition by semilatent topic models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1762–1774.
Wei, X., Jiang, Y., & Ngo, C. (2011). Concept-driven multi-modality fusion for video search. IEEE Transactions on Circuits and Systems for Video Technology, 21(1), 62–73.
Yang, Y., & Shah, M. (2012). Complex events detection using data-driven concepts. In European Conference on Computer Vision (pp. 722–735). Springer.
Zhao, F., Huang, Y., Wang, L., & Tan, T. (2013). Relevance topic model for unstructured social group activity recognition. Advances in neural information processing systems (pp. 2580–2588). Cambridge, MA: MIT Press.
Zhu, J., Ahmed, A., & Xing, E. (2012). Medlda: Maximum margin supervised topic models. The Journal of Machine Learning Research, 13(1), 2237–2278.
Acknowledgments
This work is jointly supported by National Basic Research Program of China (2012CB316300), National Natural Science Foundation of China (61525306, 61573354, 61135002, 61420106015), and Strategic Priority Research Program of the CAS (XDB02070100).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ivan Laptev, Josef Sivic, Deva Ramanan.
Appendix: Parameters of Free-Form Variational Posterior \(\varvec{q(t_{mj}})\)
Appendix: Parameters of Free-Form Variational Posterior \(\varvec{q(t_{mj}})\)
The expressions of parameters in \(q(t_{mj})\) (28) are listed as follows:
where \(\text {erfc}(\cdot )\) is the complementary error function and
We can see that \(q(t_{mj})\) depends on expectations over \(\varvec{\upeta }\) and \(\{ {t_{mj'}}\} _{j' \ne j}\), which is consistent with the graphical model representation of ReRBM in Fig. 3.
Rights and permissions
About this article
Cite this article
Zhao, F., Huang, Y., Wang, L. et al. Learning Relevance Restricted Boltzmann Machine for Unstructured Group Activity and Event Understanding. Int J Comput Vis 119, 329–345 (2016). https://doi.org/10.1007/s11263-016-0896-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-016-0896-3