Skip to main content
Log in

Customer pose estimation using orientational spatio-temporal network from surveillance camera

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

The analysis of customer pose draws more and more attention of retailers and researchers, because this information can reveal the customer habits and the customer interest level to the merchandise. In the retail store environment, customers’ poses are highly related to their body orientations. For example, when a customer is picking an item from merchandise shelf, he or she must face to the shelf. On the other hand, if the customer body orientation is parallel to the shelf, this customer is probably just walking through. Considering this fact, we propose a customer pose estimation system using orientational spatio-temporal deep neural network from surveillance camera. This system first generates the initial joint heatmaps using a fully convolutional network. Based on these heatmaps, we propose a set of novel orientational message-passing layers to fine-tune joint heatmaps by introducing the body orientation information into the conventional message-passing layers. In addition, we apply a bi-directional recurrent neural network on top of the system to improve the estimation accuracy by considering both forward and backward image sequences. Therefore, in this system, the global body orientation, local joint connections, and temporal pose continuity are integrally considered. At last, we conduct a series of comparison experiments to show the effectiveness of our system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

References

  1. Sminchisescu, C., Telea, A.: Human pose estimation from silhouettes. A consistent approach using distance level sets. In: Proceedings of the International Conference on Computer Graphics, Visualization and Computer Vision (WSCG) (2002)

  2. Wagg, D.K., Nixon, M.S.: Model-based gait enrolment in real-world imagery. In: Proceedings of the Workshop on Multimodal User Authentication, pp. 189–195 (2003)

  3. Tafazzoli, F., Safabakhsh, R.: Model-based human gait recognition using leg and arm movements. Eng. Appl. Artif. Intell. 23(8), 1237–1246 (2010)

    Article  Google Scholar 

  4. Zhao, L.: Dressed human modeling, detection, and parts localization, Ph.D. thesis, Carnegie Mellon University Pittsburgh, PA, (2001)

  5. Mittal, A., Zhao, L., Davis, L.S.: Human body pose estimation using silhouette shape analysis. In: Proceedings of IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 263–270 (2003)

  6. Kushwaha, A.K.S., Srivastava, S., Srivastava, R.: Multi-view human activity recognition based on silhouette and uniform rotation invariant local binary patterns. Multimed. Syst. pp. 1–17 (2016)

  7. Ramanan, D., Forsyth, D.A., Zisserman, A.: Tracking people by learning their appearance. IEEE Trans. Pattern Anal. Mach. Intell. 29(1), 65–81 (2007)

    Article  Google Scholar 

  8. Andriluka, M., Roth, S., Schiele, B.: Monocular 3D pose estimation and tracking by detection. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 2010, 623–630 (2010)

    Google Scholar 

  9. Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: people detection and articulated pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1014–1021 (2009)

  10. Moutzouris, A., Martinez-del-Rincon, J., Lewandowski, M., Nebel, J., Makris, D.: Human pose tracking in low dimensional space enhanced by limb correction. In: 2011 18th IEEE International Conference on Image Processing (ICIP), pp. 2301–2304 (2011)

  11. Weiss, D., Sapp, B., Taskar, B.: Sidestepping intractable inference with structured ensemble cascades. In: Advances in Neural Information Processing Systems, pp. 2415–2423 (2010)

  12. Sapp, B., Weiss, D., Taskar, B.: Parsing human motion with stretchable models. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 2011, 1281–1288 (2011)

    Google Scholar 

  13. Eichner, M., Marin-Jimenez, M., Zisserman, A., Ferrari, V.: 2d articulated human pose estimation and retrieval in (almost) unconstrained still images. Int. J. Comput. Vis. 99(2), 190–214 (2012)

    Article  MathSciNet  Google Scholar 

  14. Cherian, A., Mairal, J., Alahari, K., Schmid, C.: Mixing body-part sequences for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2353–2360 (2014)

  15. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. Int. J. Comput. Vis. IJCV 61(1), 55–79 (2005)

    Article  Google Scholar 

  16. Sun, M., Savarese, S.: Articulated part-based model for joint object detection and pose estimation. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 723–730 (2011)

  17. Dantone, M., Gall, J., Leistner, C., Van Gool, L.: Human pose estimation using body parts dependent joint regressors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3041–3048 (2013)

  18. Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. PAMI 35(12), 2878–2890 (2013)

    Article  Google Scholar 

  19. Eichner, M., Ferrari, V.: Appearance sharing for collective human pose estimation. In: Proceedings of the Asian Conference on Computer Vision (ACCV), pp. 138–151 (2012)

  20. Li, S., Zhang, M., Su, S., Shuai, B., Ji, R.: Decomposed human localization from social photo album. Multimed. Syst. 22(1), 137–148 (2016)

    Article  Google Scholar 

  21. Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 588–595 (2013)

  22. Le Cun, B.B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In: Neural Information Processing Systems (NIPS) (1989)

  23. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time-series. Handb. Brain Theory Neural Netw. 3361(10), (1995)

  24. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)

  25. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ArXiv Prepr. ArXiv14091556, (2014)

  26. C. Szegedy et al., “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9

  27. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. ArXiv Prepr. ArXiv151203385, (2015)

  28. Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1653–1660 (2014)

  29. Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. ArXiv Prepr. ArXiv150706550, (2015)

  30. Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., Fei-Fei, L.: Towards viewpoint invariant 3D human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 160–177 (2016)

  31. Chen, X., Yuille, A.L.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in Neural Information Processing Systems, pp. 1736–1744 (2014)

  32. Chen, X., Yuille, A.L.: Parsing occluded people by flexible compositions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3945–3954 (2015)

  33. Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in neural information processing systems, pp. 1799–1807 (2014)

  34. Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656 (2015)

  35. Yang, W., Ouyang, W., Li, H., Wang, X.: End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4715–4723 (2016)

  36. Chu, X., Ouyang, W., Li, H., Wang, X.: Structured feature learning for pose estimation. ArXiv Prepr. ArXiv160309065, 2016

  37. Jain, A., Tompson, J., LeCun, Y., Bregler, C.: MoDeep: a deep learning framework using motion features for human pose estimation. In: Proceedings of the Asian Conference on Computer Vision (ACCV), pp. 302–315 (2014)

  38. Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1385–1392 (2013)

  39. Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1913–1921 (2015)

  40. Wang, L., Qiao, Y., Tang, X.: Video action detection with relational dynamic-poselets. In: European Conference on Computer Vision, pp. 565–580 (2014)

  41. Yao, J., Odobez, J.: Multi-layer background subtraction based on color and texture. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07, pp. 1–8 (2007)

  42. Liu, J., Gu, Y., Kamijo, S.: Customer behavior classification using surveillance camera for marketing. Multimed. Tools Appl., pp. 1–28 (2016)

  43. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893 (2005)

  44. Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., Fei-Fei, L.: Viewpoint invariant 3D human pose estimation with recurrent error feedback. ArXiv160307076 Cs, (2016)

  45. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

  46. Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035 (2007)

  47. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  48. Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: Proceedings of the British Machine Vision Conference (BMVC), 2010

  49. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)

  50. Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1465–1472 (2011)

  51. Rafi, U., Leibe, B., Gall, J., Kostrikov, I.: An efficient convolutional network for human pose estimation. In: BMVC, vol. 1, p. 2 (2016)

  52. Yu, X., Zhou, F., Chandraker, M.: Deep deformation network for object landmark localization. ArXiv Prepr. ArXiv160501014, 2016

  53. Xiaohan Nie, B., Xiong, C., Zhu, S.-C.: Joint action recognition and pose estimation from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1293–1301 (2015)

  54. Iqbal, U., Garbade, M., Gall, J.: Pose for action-action for pose. In: 2017 12th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017), pp. 438–445 (2017)

  55. Song, J., Wang, L., Van Gool, L., Hilliges, O.: Thin-slicing network: a deep structured model for pose estimation in videos. ArXiv170310898 Cs, 2017

Download references

Acknowledgements

The authors thank Haitao Wang, Yongjie Liu, and Qianlong Wang for their helps for labeling data. The faces of customers are blurred for the purpose of privacy in this paper. This research is permitted by the Compliance Committee of the University of Tokyo.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingwen Liu.

Additional information

Communicated by M. Cooper.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Gu, Y. & Kamijo, S. Customer pose estimation using orientational spatio-temporal network from surveillance camera. Multimedia Systems 24, 439–457 (2018). https://doi.org/10.1007/s00530-017-0570-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-017-0570-9

Keywords

Navigation