Skip to main content

Adaptive Retraining of Visual Recognition-Model in Human Activity Recognition by Collaborative Humanoid Robots

  • Conference paper
  • First Online:
Intelligent Systems and Applications (IntelliSys 2020)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1251))

Included in the following conference series:

  • 968 Accesses

Abstract

We present a vision-based activity recognition system for centrally connected humanoid robots. The robots interact with several human participants who have varying behavioral styles and inter-activity-variability. A cloud server provides and updates the recognition model in all robots. The server continuously fetches the new activity videos recorded by the robots. It also fetches corresponding results and ground-truths provided by the human interacting with the robot. A decision on when to retrain the recognition model is made by an evolving performance-based logic. In the current article, we present the aforementioned adaptive recognition system with special emphasis on the partitioning logic employed for the division of new videos in training, cross-validation, and test groups of the next retraining instance. The distinct operating logic is based on class-wise recognition inaccuracies of the existing model. We compare this approach to a probabilistic partitioning approach in which the videos are partitioned with no performance considerations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A mechanism exists to generate a robot specific recognition model tailor-made with a more elevated emphasis on videos recorded by that particular robot in its environment.

  2. 2.

    The server records meta-data on model performance across history (current/past/cumulative), class (specific/cumulative), group (Database/TR/CV/TS) and many other parameters such as time since last retraining, run-time addition or deletion of an activity class and more. Depending on the objective of an experiment, one or a combination of these is used as a trigger for retraining. For the experiment presented in the current article, the retraining was triggered whenever the current F score (TS-group class-cumulative) drops below that of cumulative F score (TS-group class-cumulative) producing a simplistic class-neutral mechanism.

  3. 3.

    Class-wise recognition inaccuracies of the existing model are used for performance-based partitioning. This is not to be confused with performance measure used for triggering the retraining mechanism. Partitioning is part of the retraining mechanism and is not the mechanism that triggers retraining.

  4. 4.

    Nao robot (specifications): 25 DoF, 2 face-mounted cameras (920p HD maximum 1280x960 resolution at 30 fps) pointing front and floor, animated LEDs, speakers, 4 microphones, voice recognition capabilities on a predefined dictionary, capability to identify human faces, Infrared, pressure and tactile sensors, wireless LAN, battery-operated, Intel Atom 1.6 GHz processor, Linux kernel.

  5. 5.

    In experiments other than the ones presented in this article, the “initial” #classes may be a subset of the total 22 classes, with remaining classes introduced as previously unknown activity for the recognition model.

  6. 6.

    22 ADLs: Walk [\(\times \)4](Right \(\Rightarrow \))(Left\(\Leftarrow \) )(Towards )(Away ); Open door [\(\times 2\)]( )( \(\boxtimes \)); Close door [\(\times 2\)]( )( \(\Box \)); Sit and Stand; Human enacts gestures [\(\times \)6] (Clap hands)(Pick up the phone)(Pick up the glass to drink)(Thumbs up)(Wave hands)(Italian gesture); Human enacts gestures looking towards robot [\(\times \)6] (Come closer)(Go away)(Stand up)(Sit down)(Move towards my left)(Move towards my right).

  7. 7.

    Video used by the robot for recognition (Signal): Camera: One (Front facing); Resolution: 160 \(\times \) 120 \(\times \) 1 (Gray); Time: 2 s; 24 Frames (2 s); Frame-rate: 12; Scales: 4 Scale Dense sampling. Outside the scope of this article, we use other signals as well.

  8. 8.

    Video recorded by the robot (Record): Camera: One (Front facing); Resolution: 1280 \(\times \) 960 \(\times \) 3 (RGB); Time: 5 s (2 s of activity, 1.5 s of pre-activity and 1.5 s of post-activity recording); Frame-rate: 12; Scales: 8 Scale Dense sampling. We have stored all original videos in a secondary database for future employment when robots with better computational and memory capabilities will be accessible.

  9. 9.

    F score, also known as F1 score/F measure is a measure of accuracy for classification results that considers both precision and recall to compute the score. It is the harmonic mean of precision and recall i.e. 2*((precision*recall)/(precision+recall)).

Abbreviations

ADLs::

Activities of daily living

BOW::

Bag of words

CV/cv Group::

Cross-validation group

EADLs::

Enhanced ADLs

GMM::

Generalized method of moments

HAR::

Human activity recognition

HOF::

Histogram of optical flows

HOG::

Histogram of gradients

HSV::

Hue, saturation and value

IADLs::

Instrumental ADLs

IKSVM::

Intersection kernel based SVM

IP::

Interest point

LSTM::

Long short-term memory

MBH::

Motion boundary histogram

MBHx::

MBH in x orientation

MBHy::

MBH in y orientation

NLP::

Natural language processing

P-CS/CS::

Probabilistic contribution Split

P-RS/RS::

Probabilistic ratio Split

RNN::

Recurrent neural network

STIPs::

Space-time interest points

ST-LSTM::

Spatio-temporal LSTM

SVM::

Support vector machine

TR/tr Group::

Training group

TS/ts Group::

Test group

References

  1. Apple. https://www.apple.com/ios/siri/

  2. Begum, M., et al.: Performance of daily activities by older adults with dementia: the role of an assistive robot. In: 2013 IEEE 13th International Conference on Rehabilitation Robotics (ICORR), pp. 1–8 (2013). https://doi.org/10.1109/ICORR.2013.6650405

  3. Bertsch, F.A., Hafner, V.V.: Real-time dynamic visual gesture recognition in human-robot interaction. In: 2009 9th IEEE-RAS International Conference on Humanoid Robots, pp. 447–453 (2009). https://doi.org/10.1109/ICHR.2009.5379541

  4. Bilinski, P., Bremond, F.: Contextual statistics of space-time ordered features for human action recognition. In: 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, pp. 228–233 (2012)

    Google Scholar 

  5. Boucenna, S., et al.: Learning of social signatures through imitation game between a robot and a human partner. IEEE Trans. Auton. Mental Dev. 6(3), 213–225 (2014). https://doi.org/10.1109/TAMD.2014.2319861. ISSN 1943-0604

    Article  MathSciNet  Google Scholar 

  6. Chen, T.L., et al.: Robots for humanity: using assistive robotics to empower people with disabilities. IEEE Robot. Autom. Mag. 20(1), 30–39 (2013). https://doi.org/10.1109/MRA.2012.2229950. ISSN 1070-9932

    Article  Google Scholar 

  7. Cho, K., Chen, X.: Classifying and visualizing motion capture sequences using deep neural networks. In: VISAPP 2014 - Proceedings of the 9th International Conference on Computer Vision Theory and Applications, vol. 2, June 2013

    Google Scholar 

  8. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: In CVPR, pp. 886–893 (2005)

    Google Scholar 

  9. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. pp. 1110–1118, June 2015. https://doi.org/10.1109/CVPR.2015.7298714

  10. El-Yacoubi, M.A., et al.: Vision-based recognition of activities by a humanoid robot. Int. J. Adv. Robot. Syst. 12(12), 179 (2015). https://doi.org/10.5772/61819

  11. Falco, P., et al.: Representing human motion with FADE and U-FADE: an efficient frequency-domain approach. In: Autonomous Robots, March 2018

    Google Scholar 

  12. Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, 29 June–2 July 2003 Proceedings, pp. 363–370. Springer, Heidelberg (2003). ISBN: 978-3-540-45103-7

    Google Scholar 

  13. Ho, Y., et al.: A hand gesture recognition system based on GMM method for human-robot interface. In: 2013 Second International Conference on Robot, Vision and Signal Processing, pp. 291–294 (2013). https://doi.org/10.1109/RVSP.2013.72

  14. Kotseruba, I., Tsotsos, J.K.: 40 years of cognitive architectures: core cognitive abilities and practical applications. In: Artificial Intelligence Review (2018). https://doi.org/10.1007/s10462-018-9646-y. ISSN 1573-7462

  15. Kragic, D., et al.: Interactive, collaborative robots: challenges and opportunities. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), pp. 18–25. AAAI Press, Stockholm (2018). http://dl.acm.org/citation.cfm?id=3304415.3304419. ISBN 978-0-9992411-2-7

  16. Kruger, V., et al.: Learning actions from observations. IEEE Robot. Autom. Mag. 17(2), 30–43 (2010). https://doi.org/10.1109/MRA.2010.936961. ISSN 1070-9932

    Article  Google Scholar 

  17. Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2), 107–123 (2005). ISSN 1573-1405

    Article  Google Scholar 

  18. Laptev, I., et al.: Learning realistic human actions from movies, June 2008. https://doi.org/10.1109/CVPR.2008.4587756

  19. Lee, D., Soloperto, R., Saveriano, M.: Bidirectional invariant representation of rigid body motions and its application to gesture recognition and reproduction. Auton. Robots 42, 1–21 (2017). https://doi.org/10.1007/s10514-017-9645-x

    Article  Google Scholar 

  20. Liu, J., et al.: Spatio-temporal LSTM with trust gates for 3D human action recognition. vol. 9907, October 2016. https://doi.org/10.1007/978-3-319-46487-9_50

  21. Maji, S., Berg, A.C., Malik, J.: Classification using intersection kernel support vector machines is efficient. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008). https://doi.org/10.1109/CVPR.2008.4587630

  22. Margheri, L.: Dialogs on robotics horizons [student’s corner]. IEEE Robot. Autom. Mag. 21(1), 74–76 (2014). https://doi.org/10.1109/MRA.2014.2298365. ISSN 1070-9932

    Article  MathSciNet  Google Scholar 

  23. Microsoft. https://www.microsoft.com/en-in/windows/cortana

  24. Microsoft. https://developer.microsoft.com/en-us/windows/kinect

  25. Myagmarbayar, N., et al.: Human body contour data based activity recognition. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5634–5637 (2013). https://doi.org/10.1109/EMBC.2013.6610828

  26. NYTimes. http://www.nytimes.com/interactive/2013/07/01/world/europe/A-Short-Lexicon-of-Italian-Gestures.html?_r=0

  27. Oi, F., et al.: Sequence of the most informative joints (SMIJ): a new representation for human skeletal action recognition. vol. 25, pp. 8 –13, June 2012. https://doi.org/10.1109/CVPRW.2012.6239231

  28. Okamoto, T., et al.: Toward a dancing robot with listening capability: keypose-based integration of lower-, middle-, and upper-body motions for varying music tempos. IEEE Trans. Robot. 30(3), 771–778 (2014). https://doi.org/10.1109/TRO.2014.2300212. ISSN 1552-3098

    Article  Google Scholar 

  29. Olatunji, I.E.: Human activity recognition for mobile robot. In: CoRR abs/1801.07633 arXiv: 1801.07633 (2018). http://arxiv.org/abs/1801.07633

  30. Pers, J., et al.: Histograms of optical ow for efficient representation of body motion. Pattern Recog. Lett. 31, 1369–1376 (2010). https://doi.org/10.1016/j.patrec.2010.03.024

    Article  Google Scholar 

  31. Santos, L., Khoshhal, K., Dias, J.: Trajectory-based human action segmentation. Pattern Recogn. 48(2), 568–579 (2015). https://doi.org/10.1016/j.patcog.2014.08.015. ISSN 0031-3203

  32. Sasaki, Y.: The truth of the F-measure. In: Teach Tutor Mater, January 2007

    Google Scholar 

  33. Saveriano, M., Lee, D.: Invariant representation for user independent motion recognition. In: 2013 IEEE RO-MAN, pp. 650–655 (2013). https://doi.org/10.1109/ROMAN.2013.6628422

  34. Schenck, C., et al.: Which object fits best? solving matrix completion tasks with a humanoid robot. IEEE Trans. Auton. Mental Dev. 6(3), 226–240 (2014). https://doi.org/10.1109/TAMD.2014.2325822. ISSN 1943-0604

    Article  Google Scholar 

  35. Nandi, G.C., Siddharth, S., Akash, A.: Human-robot communication through visual game and gesture learning. In: International Advance Computing Conference (IACC), vol. 2, pp. 1395–1402 (2013). https://doi.org/10.1109/ICCV.2005.28

  36. Wang, H., et al.: Action recognition by dense trajectories. In: CVPR 2011, pp. 3169–3176 (2011). https://doi.org/10.1109/CVPR.2011.5995407

  37. Wang, H., et al.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013). https://doi.org/10.1007/s11263-012-0594-8. https://hal.inria.fr/hal-00803241

  38. Yuan, F., et al.: Mid-level features and spatio-temporal context for activity recognition. Pattern Recogn. 45(12), 4182 –4191 (2012). https://doi.org/10.1016/j.patcog.2012.05.001. http://www.sciencedirect.com/science/article/pii/S0031320312002129. ISSN 0031-3203

  39. Zhen, X., Shao, L.: Spatio-temporal steerable pyramid for human action recognition. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–6 (2013). https://doi.org/10.1109/FG.2013.6553732

  40. Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: CoRR abs/1603.07772. arXiv: 1603.07772 (2016). http://arxiv.org/abs/1603.07772

Download references

Acknowledgment

The authors would like to thank CARNOT MINES-TSN for funding this work through the ‘Robot apprenant’ project.

We are thankful to the Service Robotics Research Center at Technische Hochschule Ulm (SeRoNet project) for supporting the consolidation period of this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vineet Nagrath .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nagrath, V., Hariz, M., Yacoubi, M.A.E. (2021). Adaptive Retraining of Visual Recognition-Model in Human Activity Recognition by Collaborative Humanoid Robots. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1251. Springer, Cham. https://doi.org/10.1007/978-3-030-55187-2_12

Download citation

Publish with us

Policies and ethics