Joint Estimation of Human Pose and Conversational Groups from Social Scenes

Varadarajan, Jagannadan; Subramanian, Ramanathan; Bulò, Samuel Rota; Ahuja, Narendra; Lanz, Oswald; Ricci, Elisa

doi:10.1007/s11263-017-1026-6

Joint Estimation of Human Pose and Conversational Groups from Social Scenes

Published: 14 July 2017

Volume 126, pages 410–429, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jagannadan Varadarajan ORCID: orcid.org/0000-0003-2519-4836¹,
Ramanathan Subramanian^2,3,
Samuel Rota Bulò^4,5,
Narendra Ahuja^1,6,
Oswald Lanz⁵ &
…
Elisa Ricci^5,7

1513 Accesses
21 Citations
Explore all metrics

Abstract

Despite many attempts in the last few years, automatic analysis of social scenes captured by wide-angle camera networks remains a very challenging task due to the low resolution of targets, background clutter and frequent and persistent occlusions. In this paper, we present a novel framework for jointly estimating (i) head, body orientations of targets and (ii) conversational groups called F-formations from social scenes. In contrast to prior works that have (a) exploited the limited range of head and body orientations to jointly learn both, or (b) employed the mutual head (but not body) pose of interactors for deducing F-formations, we propose a weakly-supervised learning algorithm for joint inference. Our algorithm employs body pose as the primary cue for F-formation estimation, and an alternating optimization strategy is proposed to iteratively refine F-formation and pose estimates. We demonstrate the increased efficacy of joint inference over the state-of-the-art via extensive experiments on three social datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Article 26 March 2024

Notes

We use the term pose to refer to orientation in the ground plane (pan) rather than the articulated spatial configuration of the human body. In line with several previous works (Benfold and Reid 2011; Chen and Odobez 2012), we will use the terms pose and orientation interchangeably.
The head and body angles are orientations in the ground plane.
Most available datasets on head and body pose estimation in low resolution settings only provide quantized pose annotations.
Details on tracking can be found in the supplementary material. Tracking data for Cocktail Party and SALSA datasets are made available at tev.fbk.eu/datasets/cp and tev.fbk.eu/datasets/salsa respectively.

References

Alameda-Pineda, X., Staiano, J., Subramanian, R., Batrinca, L., Ricci, E., Lepri, B., et al. (2016). Salsa: A novel dataset for multimodal group behavior analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8), 1707–1720.
Article Google Scholar
Alameda-Pineda, X., Yan, Y., Ricci, E., Lanz, O., & Sebe, N. (2015). Analyzing free-standing conversational groups: A multimodal approach. In ACM multimedia.
Alletto, S., Serra, G., Calderara, S., Solera, F., & Cucchiara, R. (2014). From ego to nos-vision: Detecting social relationships in first-person views. In Workshop on egocentric vision.
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In Computer vision and pattern recognition, pp. 1014–1021.
Ba, S., & Odobez, J. M. (2008). Multi-party focus of attention recognition in meetings from head pose and multimodal contextual cues. In IEEE international conference on acoustics, speech, and signal processing (ICASSP).
Ba, S. O., & Odobez, J. M. (2006). A study on visual focus of attention recognition from head pose in a meeting room. In Machine learning for multimodal interaction. Springer, Berlin, Heidelberg, pp. 75–87.
Bazzani, L., Tosato, D., Cristani, M., Farenzena, M., Pagetti, G., Menegaz, G., et al. (2013). Social interactions by visual focus of attention in a three-dimensional environment. Expert Systems, 30, 115–127.
Article Google Scholar
Benfold, B., & Reid, I. (2011). Unsupervised learning of a scene-specific coarse gaze estimator. In International conference on computer vision.
Butko, T., Canton-Ferrer, C., Segura, C., Giró, X., Nadeu, C., Hernando, J., et al. (2011). Acoustic event detection based on feature-level fusion of audio and video modalities. Eurasip Journal on Advances in Signal Processing, 2011, 485738. doi:10.1155/2011/485738.
Article Google Scholar
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., et al. (2006) The ami meeting corpus: A pre-announcement. In International conference on machine learning for multimodal interaction, pp. 28–39.
Chamveha, I., Sugano, Y., Sugimura, D., Siriteerakul, T., Okabe, T., Sato, Y., et al. (2013). Head direction estimation from low resolution images with scene adaptation. Computer Vision and Image Understanding, 117(10), 1502–1511.
Article Google Scholar
Chen, C., Heili, A., & Odobez, J. M. (2011). A joint estimation of head and body orientation cues in surveillance video. In IEEE ICCV-SISM, international workshop on socially intelligent surveillance and monitoring.
Chen, C., & Odobez, J. M. (2012). We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video. In Computer vision and pattern recognition.
Chi, E. C., & Lange, K. (2015). Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4), 994–1013.
Article MathSciNet Google Scholar
Choi, W., Chao, Y. W., Pantofaru, C., & Savarese, S. (2014). Discovering groups of people in images. In European conference on computer vision.
Ciolek, T., & Kendon, A. (1980). Environment and the spatial arrangement of conversational encounters. Socialogical Inquiry, 50, 237–271.
Article Google Scholar
Cristani, M., Bazzani, L., Paggetti, G., Fossati, A., Tosato, D., Del Bue, A., et al. (2011) Social interaction discovery by statistical analysis of F-formations. In British machine vision conference.
Demirkus, M., Precup, D., Clark, J. J., & Arbel, T. (2014). Probabilistic temporal head pose estimation using a hierarchical graphical model. In European conference on computer vision.
Eichner, M., & Ferrari, V. (2010). We are family: Joint pose estimation of multiple persons. In European conference on computer vision.
Gan, T., Wong, Y., Zhang, D., & Kankanhalli, M. (2013). Temporal encoded F-formation system for social interaction detection. In ACM Multimedia.
Heili, A., Varadarajan, J., Ghanem, B., Ahuja, N., & Odobez, J. M. (2014). Improving head and body pose estimation through semi-supervised manifold alignment. In International conference on image processing.
Hocking, T. D., Joulin, A., Bach, F., & Vert, J. P. (2011). Clusterpath an algorithm for clustering using convex fusion penalties. In International conference on machine learning.
Hu, T., Messelodi, S., & Lanz, O. (2015). Dynamic task decomposition for decentralized object tracking in complex scenes. Computer Vision and Image Understanding, 134, 89–104.
Article Google Scholar
Krahnstoever, N., Chang, M. C., & Ge, W. (2011). Gaze and body pose estimation from a distance. In IEEE advanced video and signal-based surveillance (AVSS).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.
Google Scholar
Leal-Taixé, L., Fenzi, M., Kuznetsova, A., Rosenhahn, B., & Savarese, S. (2014). Learning an image-based motion context for multiple people tracking. In Computer vision and pattern recognition.
Liem, M. C., & Gavrila, D. M. (2014). Coupled person orientation estimation and appearance modeling using spherical harmonics. Image and Vision Computing, 32(10), 728–738.
Article Google Scholar
Marin-Jimenez, M., Zisserman, A., Eichner, M., & Ferrari, V. (2014). Detecting people looking at each other in videos. International Journal of Computer Vision, 106(3), 282–296.
Article Google Scholar
Mathias, M., Benenson, R., Timofte, R., & Gool, L. V. (2013). Handling occlusions with franken-classifiers. In International conference on computer vision.
Meyer, G. P., Gupta, S., Frosio, I., Reddy, D., & Kautz, J. (2015). Robust model-based 3d head pose estimation. In International conference on computer vision.
Murphy-Chutorian, E., & Trivedi, M. M. (2009). Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 607–626.
Article Google Scholar
Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2012). Structured learning of human interactions in tv shows. IEEE Transactions Pattern Analysis and Machine Intelligence, 34(12), 2441–2453.
Article Google Scholar
Pellegrini, S., Ess, A., & Van Gool, L. (2010). Improving data association by joint modeling of pedestrian trajectories and groupings. In European conference on computer vision.
Rajagopal, A. K., Subramanian, R., Ricci, E., Vieriu, R. L., Lanz, O., & Sebe, N. (2014). Exploring transfer learning approaches for head pose classification from multi-view surveillance images. International Journal of Computer Vision, 109(1–2), 146–167.
Article Google Scholar
Ricci, E., Varadarajan, J., Subramanian, R., Rota Bulo, S., Ahuja, N., & Lanz, O. (2015). Uncovering interactions and interactors: Joint estimation of head, body orientation and f-formations from surveillance videos. In International conference on computer vision (ICCV).
Robertson, N., & Reid, I. (2006). Estimating gaze direction from low-resolution faces in video. In European conference on computer vision.
Setti, F., Hung, H., & Cristani, M. (2013). Group detection in still images by F-formation modeling: A comparative study. In International workshop on image analysis for multimedia interactive services (WIAMIS).
Setti, F., Lanz, O., Ferrario, R., Murino, V., & Cristani, M. (2013). Multi-scale F-formation discovery for group detection. In International conference on image processing.
Setti, F., Russell, C., Bassetti, C., & Cristani, M. (2015). F-formation detection: Individuating free-standing conversational groups in images. PLoS ONE, 10(5), e0123,783.
Article Google Scholar
Smith, K., Ba, S. O., Odobez, J. M., & Gatica-Perez, D. (2008). Tracking the visual focus of attention for a varying number of wandering people. IEEE Transaction of Pattern Analysis and Machine Intelligence, 30(7), 1212–1229.
Article Google Scholar
Tang, S., Andriluka, M., & Schiele, B. (2014). Detection and tracking of occluded people. International Journal of Computer Vision, 110, 58–69.
Article Google Scholar
Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27, pp. 1799–1807). Red Hook: Curran Associates.
Google Scholar
Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In Computer vision and pattern recognition.
Tran, K. N., Bedagkar-Gala, A., Kakadiaris, I. A., & Shah, S. K. (2013). Social cues in group formation and local interactions for collective activity analysis. In International joint conference on computer vision, imaging and computer graphics theory and applications (VISAPP).
Vascon, S., Mequanint, E. Z., Cristani, M., Hung, H., Pelillo, M., & Murino, V. (2014). A game theoretic probabilistic approach for detecting conversational groups. In Asian conference on computer vision.
Vascon, S., Mequanint, E. Z., Cristani, M., Hung, H., Pelillo, M., & Murino, V. (2016). Detecting conversational groups in images and sequences: A robust game-theoretic approach. Computer Vision and Image Understanding, 143, 11–24.
Article Google Scholar
Voit, M., & Stiefelhagen, R. (2009). A system for probabilistic joint 3d head tracking and pose estimation in low-resolution, multi-view environments. In International conference on computer vision systems, pp. 415–424
Wojek, C., Walk, S., Roth, S., & Schiele, B. (2011). Monocular 3d scene understanding with explicit occlusion reasoning. In Computer vision and pattern recognition.
Yan, S., Wang, H., Fu, Y., Yan, J., Tang, X., & Huang, T. (2009). Synchronized submanifold embedding for person-independent pose estimation and beyond. IEEE Transaction of the Image Processing, 18(1), 202–210.
Article MathSciNet MATH Google Scholar
Yan, Y., Ricci, E., Subramanian, R., Lanz, O., & Sebe, N. (2013). No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion. In International conference on computer vision.
Yan, Y., Ricci, E., Subramanian, R., Liu, G., Lanz, O., & Sebe, N. (2016). A multi-task learning framework for head pose estimation under target motion. IEEE Transaction of the Pattern Analysis and Machine Intelligence, 38(6), 1070–1083.
Article Google Scholar
Zen, G., Lepri, B., Ricci, E., & Lanz, O. (2010). Space speaks: Towards socially and personality aware visual surveillance. In ACM multimedia workshop on multimodal pervasive video analysis.
Zhu, X. (2005). Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison.
Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1), 1–130.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Advanced Digital Sciences Center, Singapore, Singapore
Jagannadan Varadarajan & Narendra Ahuja
International Institute of Information Technology, Hyderabad, India
Ramanathan Subramanian
University of Glasgow, Glasgow, UK
Ramanathan Subramanian
Mapillary Research, Graz, Austria
Samuel Rota Bulò
Fondazione Bruno Kessler, Trento, Italy
Samuel Rota Bulò, Oswald Lanz & Elisa Ricci
University of Illinois Urbana Champaign, Champaign, IL, USA
Narendra Ahuja
Department of Engineering, University of Perugia, Perugia, Italy
Elisa Ricci

Authors

Jagannadan Varadarajan
View author publications
You can also search for this author in PubMed Google Scholar
Ramanathan Subramanian
View author publications
You can also search for this author in PubMed Google Scholar
Samuel Rota Bulò
View author publications
You can also search for this author in PubMed Google Scholar
Narendra Ahuja
View author publications
You can also search for this author in PubMed Google Scholar
Oswald Lanz
View author publications
You can also search for this author in PubMed Google Scholar
Elisa Ricci
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jagannadan Varadarajan.

Additional information

Communicated by Bernt Schiele.

This work is supported by the research grant for the Human-Centered Cyber-physical Systems Programme at the Advanced Digital Sciences Center from Singapore’s Agency for Science, Technology and Research (A*STAR). We thank NVIDIA for GPU donation.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1710 KB)

Appendix: Derivation of Update Rules for ${{\varvec{\varTheta }}}_{\mathtt {H}}$ and ${{\varvec{\varTheta }}}_{\mathtt {B}}$

Consider the body and head regressors defined in Sect. 3.4. The update rules for ${{\varvec{\varTheta }}}_{\mathtt {H}}$ and ${{\varvec{\varTheta }}}_{\mathtt {B}}$ that we provide in Sect. 3.5 are obtained by setting to zero the partial derivative of the objective function in (2) with respect to ${{\varvec{\varTheta }}}_\diamond $ with $\diamond \in \{{\mathtt {B}},{\mathtt {H}}\}$, and by solving the resulting equations, which are given by

$$\begin{aligned}&l\frac{\partial }{\partial {{\varvec{\varTheta }}}_{\mathtt {H}}}L_{\mathtt {H}}(f_{\mathtt {H}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {H}});\mathcal {T},\mathcal {S})+ \frac{\partial }{\partial {{\varvec{\varTheta }}}_{\mathtt {H}}}L_C(f_{\mathtt {B}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {B}}), f_{\mathtt {H}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {H}});\mathcal {S})=0\end{aligned}$$

(A)

$$\begin{aligned}&l\frac{\partial }{\partial {{\varvec{\varTheta }}}_{\mathtt {B}}}L_{\mathtt {B}}(f_{\mathtt {B}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {B}});\mathcal {T},\mathcal {S})+\frac{\partial }{\partial {{\varvec{\varTheta }}}_{\mathtt {B}}}L_C(f_{\mathtt {B}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {B}}),f_{\mathtt {H}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {H}});\mathcal {S}) \nonumber \\&\quad +\frac{\partial }{\partial {{\varvec{\varTheta }}}_{\mathtt {B}}}L_F(f_{\mathtt {B}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {B}}),{{\varvec{C}}};\mathcal {S})=0 \end{aligned}$$

(B)

where we have replaced $L_P$ in (2) with its definition in (3). The $L_\diamond $ term is given by

$$\begin{aligned} L_\diamond (f_\diamond (\cdot ;{{\varvec{\varTheta }}}_\diamond );\mathcal {T}_\diamond ,\mathcal {S})= & {} \sum _{i=1}^{N_\diamond }\Vert {{\varvec{\varTheta }}}_\diamond {{\varvec{v}}}_i^\diamond -{{\varvec{y}}}_i^\diamond \Vert ^2_{{{\varvec{M}}}}+\lambda _R\Vert {{\varvec{\varTheta }}}_\diamond \Vert _F^2\\&+\,\lambda _U\sum _{(i,j)\in \mathcal {E}_\diamond }\omega _{ij}^\diamond \Vert {{\varvec{\varTheta }}}_\diamond ({{\varvec{v}}}_i^\diamond -{{\varvec{v}}}_j^\diamond ) \Vert ^2_{{{\varvec{M}}}}, \end{aligned}$$

and its derivative with respect to ${{\varvec{\varTheta }}}_\diamond $ is

$$\begin{aligned}&\frac{\partial }{\partial {{\varvec{\varTheta }}}_\diamond }L_\diamond (f_\diamond (\cdot ;{{\varvec{\varTheta }}}_\diamond );\mathcal {T},\mathcal {S})\\&\quad =2{{\varvec{M}}}({{\varvec{\varTheta }}}_\diamond \hat{{{\varvec{X}}}_\diamond } -{{\varvec{Y}}}_\diamond )\hat{{{\varvec{X}}}}_\diamond ^\top +2\lambda _R {{\varvec{\varTheta }}}_\diamond +2\lambda _U{{\varvec{M}}}{{\varvec{\varTheta }}}_\diamond {{\varvec{V}}}_\diamond {{\varvec{L}}}_\diamond {{\varvec{V}}}_\diamond ^\top \\&\quad =2{{\varvec{M}}}{{\varvec{\varTheta }}}_\diamond (\hat{{{\varvec{X}}}}_\diamond \hat{{{\varvec{X}}}}_\diamond ^\top +\lambda _U{{\varvec{V}}}_\diamond {{\varvec{L}}}_\diamond {{\varvec{V}}}_\diamond ^\top )+2\lambda _R{{\varvec{\varTheta }}}_\diamond -2 {{\varvec{M}}}{{\varvec{Y}}}_\diamond \hat{{{\varvec{X}}}}_\diamond ^\top . \end{aligned}$$

Term $L_C$ is given by

$$\begin{aligned} L_C(f_{\mathtt {B}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {B}}),f_{\mathtt {H}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {H}});\mathcal {S})=\lambda _C\sum _{k=1}^{N_K}\sum _{t=1}^{N_T}\Vert {{\varvec{\varTheta }}}_{\mathtt {B}}{{\varvec{x}}}^{\mathtt {B}}_{kt}-{{\varvec{\varTheta }}}_{\mathtt {H}}{{\varvec{x}}}^{\mathtt {H}}_{kt}\Vert ^2_{{{\varvec{M}}}}, \end{aligned}$$

and its derivative with respect to ${{\varvec{\varTheta }}}_\diamond $ is

$$\begin{aligned}&\frac{\partial }{\partial {{\varvec{\varTheta }}}_\diamond }L_C(f_{\mathtt {B}}(\cdot ; {{\varvec{\varTheta }}}_{\mathtt {B}}),f_{\mathtt {H}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {H}});\mathcal {S})\\&\quad =2 \lambda _C{{\varvec{M}}}({{\varvec{\varTheta }}}_\diamond {{\varvec{X}}}_\diamond - {{\varvec{\varTheta }}}_\star {{\varvec{X}}}_\star ){{\varvec{X}}}_\diamond ^\top \\&\quad =2\lambda _C{{\varvec{M}}}{{\varvec{\varTheta }}}_\diamond {{\varvec{X}}}_\diamond {{\varvec{X}}}_\diamond ^\top -2\lambda _C{{\varvec{M}}}{{\varvec{\varTheta }}}_\star {{\varvec{X}}}_\star {{\varvec{X}}}_\diamond ^\top , \end{aligned}$$

where $(\diamond ,\star )\in \{({\mathtt {H}},{\mathtt {B}}),({\mathtt {B}},{\mathtt {H}})\}$.

Term $L_F$ is given by

$$\begin{aligned} L_F(f_{\mathtt {B}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {B}}),{{\varvec{C}}};\mathcal {S})= & {} \lambda _F\sum _{k=1}^{N_K}\sum _{t=1}^{N_T}\Vert {{\varvec{c}}}_{kt}\\&-\,({{\varvec{p}}}_{kt}+D{{\varvec{A}}}{{\varvec{\varTheta }}}_{\mathtt {B}}{{\varvec{x}}}^{\mathtt {B}}_{kt} )\Vert ^2_2+\text {const}, \end{aligned}$$

where “$\text {const}$” indicates terms not depending on ${{\varvec{\varTheta }}}_{\mathtt {B}}$, and its derivative with respect to ${{\varvec{\varTheta }}}_{\mathtt {B}}$ is

$$\begin{aligned}&\frac{\partial }{\partial {{\varvec{\varTheta }}}_{\mathtt {B}}}L_F(f_{\mathtt {B}}(\cdot ;{{\varvec{\varTheta }}}_{\mathtt {B}}), {{\varvec{C}}};\mathcal {S})\\&\quad =2\lambda _FD{{\varvec{A}}}^\top (D{{\varvec{A}}}{{\varvec{\varTheta }}}_{\mathtt {B}}{{\varvec{X}}}_{\mathtt {B}}+{{\varvec{P}}}-{{\varvec{C}}}){{\varvec{X}}}_B^\top \\&\quad =2\lambda _FD^2{{\varvec{A}}}^\top {{\varvec{A}}}{{\varvec{\varTheta }}}_{\mathtt {B}}{{\varvec{X}}}_{\mathtt {B}}{{\varvec{X}}}_{\mathtt {B}}^\top +2\lambda _F D{{\varvec{A}}} ^\top ({{\varvec{P}}}-{{\varvec{C}}}){{\varvec{X}}}_B^\top . \end{aligned}$$

By replacing the computed gradient terms in (A), and after few algebraic manipulations, we obtain

$$\begin{aligned} {{\varvec{M}}}{{\varvec{\varTheta }}}_{\mathtt {H}}(\hat{{{\varvec{X}}}}_{\mathtt {H}}\hat{{{\varvec{X}}}}_{\mathtt {H}}^\top +\lambda _U{{\varvec{V}}}_{\mathtt {H}}{{\varvec{L}}}_{\mathtt {H}}{{\varvec{V}}}_{\mathtt {H}}^\top +\lambda _C {{\varvec{X}}}_{\mathtt {H}}{{\varvec{X}}}_{\mathtt {H}}^\top )+\lambda _R{{\varvec{\varTheta }}}_{\mathtt {H}}-{{\varvec{F}}}_{\mathtt {H}}=0, \end{aligned}$$

and by vectorizing both sides we get

$$\begin{aligned} {{\varvec{E}}}_{\mathtt {H}}\text {vec}({{\varvec{\varTheta }}}_{\mathtt {H}})=\text {vec}({{\varvec{F}}}_{\mathtt {H}}) \qquad \implies \qquad \text {vec}({{\varvec{\varTheta }}}_{\mathtt {H}})={{\varvec{E}}}_{\mathtt {H}}^{-1} \text {vec}({{\varvec{F}}}_{\mathtt {H}}). \end{aligned}$$

By replacing the computed gradient terms in (B), and after few algebraic manipulations, we obtain

$$\begin{aligned}&{{\varvec{M}}}{{\varvec{\varTheta }}}_{\mathtt {B}}(\hat{{{\varvec{X}}}}_{\mathtt {B}}\hat{{{\varvec{X}}}}_{\mathtt {B}}^\top +\lambda _U{{\varvec{V}}}_{\mathtt {B}}{{\varvec{L}}}_{\mathtt {B}}{{\varvec{V}}}_{\mathtt {B}}^\top + \lambda _C{{\varvec{X}}}_{\mathtt {B}}{{\varvec{X}}}_{\mathtt {B}}^\top )+\lambda _R{{\varvec{\varTheta }}}_{\mathtt {B}}\\&\quad +\,\lambda _FD^2{{\varvec{A}}}^\top {{\varvec{A}}}{{\varvec{\varTheta }}}_{\mathtt {B}}{{\varvec{X}}}_{\mathtt {B}}{{\varvec{X}}}_{\mathtt {B}}^\top -{{\varvec{G}}}=0, \end{aligned}$$

and by vectorizing both sides we get

$$\begin{aligned}&({{\varvec{E}}}_{\mathtt {B}}+\lambda _FD^2{{\varvec{X}}}_{\mathtt {B}}{{\varvec{X}}}_{\mathtt {B}}^\top \otimes {{\varvec{A}}}^\top {{\varvec{A}}})\text {vec}({{\varvec{\varTheta }}}_{\mathtt {B}}) =\text {vec}({{\varvec{G}}})\implies \\&\quad \text {vec}({{\varvec{\varTheta }}}_{\mathtt {B}})=({{\varvec{E}}}_{\mathtt {B}}+\lambda _FD^2{{\varvec{X}}}_{\mathtt {B}}{{\varvec{X}}}_{\mathtt {B}}^\top \otimes {{\varvec{A}}}^\top {{\varvec{A}}})^{-1}\text {vec}({{\varvec{G}}}). \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Varadarajan, J., Subramanian, R., Bulò, S.R. et al. Joint Estimation of Human Pose and Conversational Groups from Social Scenes. Int J Comput Vis 126, 410–429 (2018). https://doi.org/10.1007/s11263-017-1026-6

Download citation

Received: 15 March 2016
Accepted: 02 June 2017
Published: 14 July 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11263-017-1026-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Joint Estimation of Human Pose and Conversational Groups from Social Scenes

Abstract

Access this article

Similar content being viewed by others

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 1710 KB)

Appendix: Derivation of Update Rules for \({{\varvec{\varTheta }}}_{\mathtt {H}}\) and \({{\varvec{\varTheta }}}_{\mathtt {B}}\)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Joint Estimation of Human Pose and Conversational Groups from Social Scenes

Abstract

Access this article

Similar content being viewed by others

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 1710 KB)

Appendix: Derivation of Update Rules for \({{\varvec{\varTheta }}}_{\mathtt {H}}\) and \({{\varvec{\varTheta }}}_{\mathtt {B}}\)

Appendix: Derivation of Update Rules for \({{\varvec{\varTheta }}}_{\mathtt {H}}\) and \({{\varvec{\varTheta }}}_{\mathtt {B}}\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation