Adaptive Retraining of Visual Recognition-Model in Human Activity Recognition by Collaborative Humanoid Robots

Nagrath, Vineet; Hariz, Mossaab; Yacoubi, Mounim A. El

doi:10.1007/978-3-030-55187-2_12

Vineet Nagrath^17,18,
Mossaab Hariz¹⁷ &
Mounim A. El Yacoubi¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1251))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

968 Accesses

Abstract

We present a vision-based activity recognition system for centrally connected humanoid robots. The robots interact with several human participants who have varying behavioral styles and inter-activity-variability. A cloud server provides and updates the recognition model in all robots. The server continuously fetches the new activity videos recorded by the robots. It also fetches corresponding results and ground-truths provided by the human interacting with the robot. A decision on when to retrain the recognition model is made by an evolving performance-based logic. In the current article, we present the aforementioned adaptive recognition system with special emphasis on the partitioning logic employed for the division of new videos in training, cross-validation, and test groups of the next retraining instance. The distinct operating logic is based on class-wise recognition inaccuracies of the existing model. We compare this approach to a probabilistic partitioning approach in which the videos are partitioned with no performance considerations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A mechanism exists to generate a robot specific recognition model tailor-made with a more elevated emphasis on videos recorded by that particular robot in its environment.
2.
The server records meta-data on model performance across history (current/past/cumulative), class (specific/cumulative), group (Database/TR/CV/TS) and many other parameters such as time since last retraining, run-time addition or deletion of an activity class and more. Depending on the objective of an experiment, one or a combination of these is used as a trigger for retraining. For the experiment presented in the current article, the retraining was triggered whenever the current F score (TS-group class-cumulative) drops below that of cumulative F score (TS-group class-cumulative) producing a simplistic class-neutral mechanism.
3.
Class-wise recognition inaccuracies of the existing model are used for performance-based partitioning. This is not to be confused with performance measure used for triggering the retraining mechanism. Partitioning is part of the retraining mechanism and is not the mechanism that triggers retraining.
4.
Nao robot (specifications): 25 DoF, 2 face-mounted cameras (920p HD maximum 1280x960 resolution at 30 fps) pointing front and floor, animated LEDs, speakers, 4 microphones, voice recognition capabilities on a predefined dictionary, capability to identify human faces, Infrared, pressure and tactile sensors, wireless LAN, battery-operated, Intel Atom 1.6 GHz processor, Linux kernel.
5.
In experiments other than the ones presented in this article, the “initial” #classes may be a subset of the total 22 classes, with remaining classes introduced as previously unknown activity for the recognition model.
6.
22 ADLs: Walk [\(\times \)4](Right \(\Rightarrow \))(Left\(\Leftarrow \) )(Towards )(Away ); Open door [\(\times 2\)]( )( \(\boxtimes \)); Close door [\(\times 2\)]( )( \(\Box \)); Sit and Stand; Human enacts gestures [\(\times \)6] (Clap hands)(Pick up the phone)(Pick up the glass to drink)(Thumbs up)(Wave hands)(Italian gesture); Human enacts gestures looking towards robot [\(\times \)6] (Come closer)(Go away)(Stand up)(Sit down)(Move towards my left)(Move towards my right).
7.
Video used by the robot for recognition (Signal): Camera: One (Front facing); Resolution: 160 \(\times \) 120 \(\times \) 1 (Gray); Time: 2 s; 24 Frames (2 s); Frame-rate: 12; Scales: 4 Scale Dense sampling. Outside the scope of this article, we use other signals as well.
8.
Video recorded by the robot (Record): Camera: One (Front facing); Resolution: 1280 \(\times \) 960 \(\times \) 3 (RGB); Time: 5 s (2 s of activity, 1.5 s of pre-activity and 1.5 s of post-activity recording); Frame-rate: 12; Scales: 8 Scale Dense sampling. We have stored all original videos in a secondary database for future employment when robots with better computational and memory capabilities will be accessible.
9.
F score, also known as F1 score/F measure is a measure of accuracy for classification results that considers both precision and recall to compute the score. It is the harmonic mean of precision and recall i.e. 2*((precision*recall)/(precision+recall)).

Abbreviations

ADLs::: Activities of daily living
BOW::: Bag of words
CV/cv Group::: Cross-validation group
EADLs::: Enhanced ADLs
GMM::: Generalized method of moments
HAR::: Human activity recognition
HOF::: Histogram of optical flows
HOG::: Histogram of gradients
HSV::: Hue, saturation and value
IADLs::: Instrumental ADLs
IKSVM::: Intersection kernel based SVM
IP::: Interest point
LSTM::: Long short-term memory
MBH::: Motion boundary histogram
MBHx::: MBH in x orientation
MBHy::: MBH in y orientation
NLP::: Natural language processing
P-CS/CS::: Probabilistic contribution Split
P-RS/RS::: Probabilistic ratio Split
RNN::: Recurrent neural network
STIPs::: Space-time interest points
ST-LSTM::: Spatio-temporal LSTM
SVM::: Support vector machine
TR/tr Group::: Training group
TS/ts Group::: Test group

References

Apple. https://www.apple.com/ios/siri/
Begum, M., et al.: Performance of daily activities by older adults with dementia: the role of an assistive robot. In: 2013 IEEE 13th International Conference on Rehabilitation Robotics (ICORR), pp. 1–8 (2013). https://doi.org/10.1109/ICORR.2013.6650405
Bertsch, F.A., Hafner, V.V.: Real-time dynamic visual gesture recognition in human-robot interaction. In: 2009 9th IEEE-RAS International Conference on Humanoid Robots, pp. 447–453 (2009). https://doi.org/10.1109/ICHR.2009.5379541
Bilinski, P., Bremond, F.: Contextual statistics of space-time ordered features for human action recognition. In: 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, pp. 228–233 (2012)
Google Scholar
Boucenna, S., et al.: Learning of social signatures through imitation game between a robot and a human partner. IEEE Trans. Auton. Mental Dev. 6(3), 213–225 (2014). https://doi.org/10.1109/TAMD.2014.2319861. ISSN 1943-0604
Article MathSciNet Google Scholar
Chen, T.L., et al.: Robots for humanity: using assistive robotics to empower people with disabilities. IEEE Robot. Autom. Mag. 20(1), 30–39 (2013). https://doi.org/10.1109/MRA.2012.2229950. ISSN 1070-9932
Article Google Scholar
Cho, K., Chen, X.: Classifying and visualizing motion capture sequences using deep neural networks. In: VISAPP 2014 - Proceedings of the 9th International Conference on Computer Vision Theory and Applications, vol. 2, June 2013
Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: In CVPR, pp. 886–893 (2005)
Google Scholar
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. pp. 1110–1118, June 2015. https://doi.org/10.1109/CVPR.2015.7298714
El-Yacoubi, M.A., et al.: Vision-based recognition of activities by a humanoid robot. Int. J. Adv. Robot. Syst. 12(12), 179 (2015). https://doi.org/10.5772/61819
Falco, P., et al.: Representing human motion with FADE and U-FADE: an efficient frequency-domain approach. In: Autonomous Robots, March 2018
Google Scholar
Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, 29 June–2 July 2003 Proceedings, pp. 363–370. Springer, Heidelberg (2003). ISBN: 978-3-540-45103-7
Google Scholar
Ho, Y., et al.: A hand gesture recognition system based on GMM method for human-robot interface. In: 2013 Second International Conference on Robot, Vision and Signal Processing, pp. 291–294 (2013). https://doi.org/10.1109/RVSP.2013.72
Kotseruba, I., Tsotsos, J.K.: 40 years of cognitive architectures: core cognitive abilities and practical applications. In: Artificial Intelligence Review (2018). https://doi.org/10.1007/s10462-018-9646-y. ISSN 1573-7462
Kragic, D., et al.: Interactive, collaborative robots: challenges and opportunities. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), pp. 18–25. AAAI Press, Stockholm (2018). http://dl.acm.org/citation.cfm?id=3304415.3304419. ISBN 978-0-9992411-2-7
Kruger, V., et al.: Learning actions from observations. IEEE Robot. Autom. Mag. 17(2), 30–43 (2010). https://doi.org/10.1109/MRA.2010.936961. ISSN 1070-9932
Article Google Scholar
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2), 107–123 (2005). ISSN 1573-1405
Article Google Scholar
Laptev, I., et al.: Learning realistic human actions from movies, June 2008. https://doi.org/10.1109/CVPR.2008.4587756
Lee, D., Soloperto, R., Saveriano, M.: Bidirectional invariant representation of rigid body motions and its application to gesture recognition and reproduction. Auton. Robots 42, 1–21 (2017). https://doi.org/10.1007/s10514-017-9645-x
Article Google Scholar
Liu, J., et al.: Spatio-temporal LSTM with trust gates for 3D human action recognition. vol. 9907, October 2016. https://doi.org/10.1007/978-3-319-46487-9_50
Maji, S., Berg, A.C., Malik, J.: Classification using intersection kernel support vector machines is efficient. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008). https://doi.org/10.1109/CVPR.2008.4587630
Margheri, L.: Dialogs on robotics horizons [student’s corner]. IEEE Robot. Autom. Mag. 21(1), 74–76 (2014). https://doi.org/10.1109/MRA.2014.2298365. ISSN 1070-9932
Article MathSciNet Google Scholar
Microsoft. https://www.microsoft.com/en-in/windows/cortana
Microsoft. https://developer.microsoft.com/en-us/windows/kinect
Myagmarbayar, N., et al.: Human body contour data based activity recognition. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5634–5637 (2013). https://doi.org/10.1109/EMBC.2013.6610828
NYTimes. http://www.nytimes.com/interactive/2013/07/01/world/europe/A-Short-Lexicon-of-Italian-Gestures.html?_r=0
Oi, F., et al.: Sequence of the most informative joints (SMIJ): a new representation for human skeletal action recognition. vol. 25, pp. 8 –13, June 2012. https://doi.org/10.1109/CVPRW.2012.6239231
Okamoto, T., et al.: Toward a dancing robot with listening capability: keypose-based integration of lower-, middle-, and upper-body motions for varying music tempos. IEEE Trans. Robot. 30(3), 771–778 (2014). https://doi.org/10.1109/TRO.2014.2300212. ISSN 1552-3098
Article Google Scholar
Olatunji, I.E.: Human activity recognition for mobile robot. In: CoRR abs/1801.07633 arXiv: 1801.07633 (2018). http://arxiv.org/abs/1801.07633
Pers, J., et al.: Histograms of optical ow for efficient representation of body motion. Pattern Recog. Lett. 31, 1369–1376 (2010). https://doi.org/10.1016/j.patrec.2010.03.024
Article Google Scholar
Santos, L., Khoshhal, K., Dias, J.: Trajectory-based human action segmentation. Pattern Recogn. 48(2), 568–579 (2015). https://doi.org/10.1016/j.patcog.2014.08.015. ISSN 0031-3203
Sasaki, Y.: The truth of the F-measure. In: Teach Tutor Mater, January 2007
Google Scholar
Saveriano, M., Lee, D.: Invariant representation for user independent motion recognition. In: 2013 IEEE RO-MAN, pp. 650–655 (2013). https://doi.org/10.1109/ROMAN.2013.6628422
Schenck, C., et al.: Which object fits best? solving matrix completion tasks with a humanoid robot. IEEE Trans. Auton. Mental Dev. 6(3), 226–240 (2014). https://doi.org/10.1109/TAMD.2014.2325822. ISSN 1943-0604
Article Google Scholar
Nandi, G.C., Siddharth, S., Akash, A.: Human-robot communication through visual game and gesture learning. In: International Advance Computing Conference (IACC), vol. 2, pp. 1395–1402 (2013). https://doi.org/10.1109/ICCV.2005.28
Wang, H., et al.: Action recognition by dense trajectories. In: CVPR 2011, pp. 3169–3176 (2011). https://doi.org/10.1109/CVPR.2011.5995407
Wang, H., et al.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013). https://doi.org/10.1007/s11263-012-0594-8. https://hal.inria.fr/hal-00803241
Yuan, F., et al.: Mid-level features and spatio-temporal context for activity recognition. Pattern Recogn. 45(12), 4182 –4191 (2012). https://doi.org/10.1016/j.patcog.2012.05.001. http://www.sciencedirect.com/science/article/pii/S0031320312002129. ISSN 0031-3203
Zhen, X., Shao, L.: Spatio-temporal steerable pyramid for human action recognition. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–6 (2013). https://doi.org/10.1109/FG.2013.6553732
Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: CoRR abs/1603.07772. arXiv: 1603.07772 (2016). http://arxiv.org/abs/1603.07772

Download references

Acknowledgment

The authors would like to thank CARNOT MINES-TSN for funding this work through the ‘Robot apprenant’ project.

We are thankful to the Service Robotics Research Center at Technische Hochschule Ulm (SeRoNet project) for supporting the consolidation period of this article.

Author information

Authors and Affiliations

Institute Mines-Telecom, Telecom SudParis, Institut Polytechnique de Paris, Paris, France
Vineet Nagrath, Mossaab Hariz & Mounim A. El Yacoubi
Service Robotics Research Center, Technische Hochschule Ulm, Ulm, Germany
Vineet Nagrath

Authors

Vineet Nagrath
View author publications
You can also search for this author in PubMed Google Scholar
Mossaab Hariz
View author publications
You can also search for this author in PubMed Google Scholar
Mounim A. El Yacoubi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vineet Nagrath .

Editor information

Editors and Affiliations

Saga University, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Supriya Kapoor
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nagrath, V., Hariz, M., Yacoubi, M.A.E. (2021). Adaptive Retraining of Visual Recognition-Model in Human Activity Recognition by Collaborative Humanoid Robots. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1251. Springer, Cham. https://doi.org/10.1007/978-3-030-55187-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-55187-2_12
Published: 25 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-55186-5
Online ISBN: 978-3-030-55187-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics