Abstract
To enable a natural and fluent human robot collaboration flow, it is critical for a robot to comprehend their human peers’ on-going actions, predict their behaviors in the near future, and plan its actions correspondingly. Specifically, the capability of making early predictions is important, so that the robot can foresee the precise timing of a turn-taking event and start motion planning and execution early enough to smooth the turn-taking transition. Such proactive behavior would reduce human’s waiting time, increase efficiency and enhance naturalness in collaborative task. To that end, this paper presents the design and implementation of an early turn-taking prediction algorithm, catered for physical human robot collaboration scenarios. Specifically, a robotic scrub nurse system which can comprehend surgeon’s multimodal communication cues and perform turn-taking prediction is presented. The developed algorithm was tested on a collected data set of simulated surgical procedures in a surgeon–nurse tandem. The proposed turn-taking prediction algorithm is found to be significantly superior to its algorithmic counterparts, and is more accurate than human baseline when little partial input is given (less than 30% of full action). After observing more information, the algorithm can achieve comparable performances as humans with a F1 score of 0.90.













Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow. org (Vol. 1).
Abdulla, W. H., Chow, D., Sin, G. (2003). Cross-words reference template for DTW-based speech recognition systems. In Conference on convergent technologies for the Asia-Pacific region TENCON 2003 (Vol. 4, pp. 1576–1579). IEEE. Available: https://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1273186.
Andersen, D., Popescu, V., Cabrera, M. E., Shanghavi, A., Gomez, G., Marley, S., et al. (2016). Virtual annotations of the surgical field through an augmented reality transparent display. The Visual Computer, 32(11), 1481–1498.
Arsikere, H., Shriberg, E., Ozertem, U. (2015). Enhanced end-of-turn detection for speech to a personal assistant. In 2015 AAAI spring symposium series, March 2015. Available: https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10256.
Arsikere, H., Shriberg, E., Ozertem, U. (2015). Enhanced end-of-turn detection for speech to a personal assistant. In 2015 AAAI Spring symposium series.
Bartlett, M. S., Littlewort, G., Fasel, I., Movellan, J. R. (2003). Real time face detection and facial expression recognition: Development and applications to human computer interaction. In Conference on computer vision and pattern recognition workshop, 2003. CVPRW’03 (Vol. 5, pp. 53–53). IEEE.
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, 281–305.
Buerhaus, P. I., Auerbach, D. I., & Staiger, D. O. (2009). The recent surge in nurse employment: Causes and implications. Health Affairs, 28(4), w657–w668.
Cakmak, M., Srinivasa, S. S., Lee, M. K., Kiesler, S., Forlizzi, J. (2011). Using spatial and temporal contrast for fluent robot–human hand-overs. In Proceedings of the 6th international conference on human–robot interaction (pp. 489–496). ACM.
Calisgan, E., Haddadi, A., Van der Loos, H. M., Alcazar, J. A., Croft, E. A. (2012). Identifying nonverbal cues for automated human-robot turn-taking. In Robot and human interactive communication (RO-MAN 2012). 21st IEEE International Symposium (pp. 418–423). IEEE.
Canny, J. (1988). The complexity of robot motion planning. MIT press. Available: https://books.google.com/books?hl=en&lr=&id=_VRM_sczrKgC&oi=fnd&pg=PR11&dq=robot+motion+planning+&ots=zGjK-1puFO&sig=eJrIFFS7FYe9ROsnV4mBAx2bPFs.
Chan, W. P., Kakiuchi, Y., Okada, K., Inaba, M. (2014). Determining proper grasp configurations for handovers through observation of object movement patterns and inter-object interactions during usage. In 2014 IEEE/RSJ international conference on intelligent robots and systems (IROS 2014) (pp. 1355–1360). IEEE.
Chao, C., Thomaz, A. (2012). Timed petri nets for multimodal interaction modeling. In ICMI 2012 workshop on speech and gesture production in virtually and physically embodied conversational agents. Available: https://robotics.usc.edu/~icmi/2012/docs/2012ChaoThomaz_ICMI-WS1.pdf.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv:1406.1078 [cs, stat], June 2014, arXiv:1406.1078. Available: https://arxiv.org/abs/1406.1078.
Cutler, A., Pearson, M. (1986). On the analysis of prosodic turn-taking cues. Intonation in Discourse (pp. 139–156). Available: https://pubman.mpdl.mpg.de/pubman/item/escidoc:76883:7/component/escidoc:506929/Cutler_1985_On.
De Kok, I., Heylen, D. (2009). Multimodal end-of-turn prediction in multi-party meetings. In Proceedings of the 2009 international conference on Multimodal interfaces (pp. 91–98). ACM. Available: https://dl.acm.org/citation.cfm?id=1647332.
Dumas, B., Ingold, R., Lalanne, D. (2009). Benchmarking fusion engines of multimodal interactive systems. In Proceedings of the 2009 international conference on multimodal interfaces (pp. 169–176). ACM. Available: https://dl.acm.org/citation.cfm?id=1647345.
Ehrlich, S., Wykowska, A., Ramirez-Amaro, K., Cheng, G. (2014). When to engage in interaction #x2014; And how? EEG-based enhancement of robot’s ability to sense social signals in HRI. In 2014 14th IEEE-RAS international conference on humanoid robots (humanoids), November 2014, pp. 1104–1109.
Escalante, H. J., Morales, E. F., Sucar, L. E. (2016). A naive Bayes baseline for early gesture recognition. Pattern Recognition Letters, 73, 91–99. Available: https://www.sciencedirect.com/science/article/pii/S0167865516000258.
Esterman, M., Tamber-Rosenau, B. J., Chiu, Y.-C., & Yantis, S. (2010). Avoiding non-independence in fmri data analysis: Leave one subject out. Neuroimage, 50(2), 572–576.
Gravano, A., Hirschberg, J. (2011). Turn-taking cues in task-oriented dialogue. Computer Speech & Language, 25(3), 601–634. Available: https://www.sciencedirect.com/science/article/pii/S0885230810000690.
Graves, A., Liwicki, M., Fernndez, S., Bertolami, R., Bunke, H., Schmidhuber, J. (2009). A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 855–868. Available: https://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4531750.
Greff, K., Srivastava, R. K., Koutnk, J., Steunebrink, B. R., Schmidhuber, J. (2015). LSTM: A search space odyssey. arXiv:1503.04069 [cs], Mar. 2015, arXiv:1503.04069. Available: http://arxiv.org/abs/1503.04069.
Gulov, I., Grnerov, L., Breza, J. Communication in the operating room. Available: https://www.szhorizont.eu/home/archiv/roc1c22014/GUL%C3%81%C5%A0OV%C3%81,%20I.%20et%20al.%20Communication%20in%20the%20operating%20room.pdf.
Guntakandla, N., Nielsen, R. D. (2015). Modelling turn-taking in human conversations. In 2015 AAAI spring symposium series. Available: https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10313.
Harmanec, D., & Klir, G. J. (1994). Measuring total uncertainty in Dempster–Shafer theory: A novel approach. International Journal of General System, 22(4), 405–419.
Hart, J. W., Gleeson, B., Pan, M., Moon, A., MacLean, K., Croft, E. (2014). Gesture, gaze, touch, and hesitation: Timing cues for collaborative work. Available: https://milab.idc.ac.il/timinghri/wp-content/uploads/2014/02/Hart-TimingHRI-2014.pdf.
Heeman, P., Lunsford, R. (2015). Can overhearers predict who will speak next? In 2015 AAAI spring symposium series, March 2015. Available: https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10269.
Heger, D., Putze, F., & Schultz, T. (2011). An EEG adaptive information system for an empathic robot. International Journal of Social Robotics, 3(4), 415–425. https://doi.org/10.1007/s12369-011-0107-x.
Hinton, G. (2010). A practical guide to training restricted Boltzmann machines. Momentum, 9(1), 926.
Hochreiter, S., Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Available: https://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6795963.
How, D. N. T., Sahari, K. S. M., Yuhuang, H., Kiong, L. C. (2014). Multiple sequence behavior recognition on humanoid robot using long short-term memory (LSTM). In 2014 IEEE international symposium on robotics and manufacturing automation (ROMA) (pp. 109–114). IEEE. Available: https://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7295871.
Hughes, K. F., Murphy, R. R. (1992). Ultrasonic robot localization using Dempster–Shafer theory. In San Diego’92 (pp. 2–11). International Society for Optics and Photonics.
Izuta, R., Murao, K., Terada, T., Tsukamoto, M. (2015). Early gesture recognition method with an accelerometer. International Journal of Pervasive Computing and Communications, 11(3), 270–287. Available: https://www.emeraldinsight.com/doi/abs/10.1108/IJPCC-03-2015-0016.
Jacob, M. G., Li, Y.-T., Wachs, J. P. (2012). Gestonurse: A multimodal robotic scrub nurse. In Proceedings of the seventh annual ACM/IEEE international conference on human–robot interaction (pp. 153–154). ACM. Available: https://dl.acm.org/citation.cfm?id=2157731.
Jeni, L. A., Cohn, J. F., De La Torre, F. (2013). Facing imbalanced data-recommendations for the use of performance metrics. In 2013 Humaine association conference on affective computing and intelligent interaction (ACII) (pp. 245–251). IEEE.
Kingma, D., Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980 [cs], December 2014, arXiv:1412.6980. Available: https://arxiv.org/abs/1412.6980.
Kirk, R. E. (1982). Experimental design. New York: Wiley.
Klir, G., & Yuan, B. (1995). Fuzzy sets and fuzzy logic (Vol. 4). New Jersey: Prentice Hall.
Kose-Bagci, H., Dautenhahn, K., Nehaniv, C. L. (2008). Emergent dynamics of turn-taking interaction in drumming games with a humanoid robot. In RO-MAN 2008—The 17th IEEE international symposium on robot and human interactive communication, August 2008, pp. 346–353.
Li, X., Dick, A., Shen, C., Zhang, Z., van den Hengel, A., Wang, H. (2013). Visual tracking with spatio-temporal DempsterShafer information fusion. IEEE Transactions on Image Processing, 22(8), 3028–3040. Available: https://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6482637.
Lucas, J. M., Saccucci, M. S. (1990). Exponentially weighted moving average control schemes: Properties and enhancements. Technometrics, 32(1), 1–12. Available: https://www.tandfonline.com/doi/abs/10.1080/00401706.1990.10484583.
Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I. (2010). The extended Cohn–Kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW) (pp. 94–101). IEEE.
MacKenzie, L., Ibbotson, J. A., Cao, C. G. L., & Lomax, A. J. (2001). Hierarchical decomposition of laparoscopic surgery: A human factors approach to investigating the operating room environment. Minimally Invasive Therapy & Allied Technologies, 10(3), 121–127. https://doi.org/10.1080/136457001753192222.
Marsh, K. L., Richardson, M. J., Schmidt, R. C. (2009). Social connection through joint action and interpersonal coordination. Topics in Cognitive Science, 1(2), 320–339. Available: https://onlinelibrary.wiley.com/doi/10.1111/j.1756-8765.2009.01022.x/full
Martyak, S. N., Curtis, L. E. (1976). Abdominal incision and closure. The American Journal of Surgery, 131(4), 476–480. Available: https://www.sciencedirect.com/science/article/pii/0002961076901604.
Matsuyama, Y., Kobayashi, T. (2015). Towards a computational model of small group facilitation. In 2015 AAAI spring symposium series, March 2015. Available: https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10316.
Mckeever, S., Ye, J., Coyle, L., Bleakley, C., Dobson, S. (2010). Activity recognition using temporal evidence theory. Journal of Ambient Intelligence and Smart Environments, 2(3), 253–269. Available: https://content.iospress.com/articles/journal-of-ambient-intelligence-and-smart-environments/ais071.
Mei, J., Liu, M., Wang, Y. F., & Gao, H. (2016). Learning a mahalanobis distance-based dynamic time warping measure for multivariate time series classification. IEEE Transactions on Cybernetics, 46(6), 1363–1374.
Mittal, A., Zisserman, A., Torr, P. H. (2011). Hand detection using multiple proposals. In BMVC. Citeseer (pp. 1–11). Available: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.301.3602&rep=rep1&type=pdf.
Mladenović, N., & Hansen, P. (1997). Variable neighborhood search. Computers & Operations Research, 24(11), 1097–1100.
Moore, A., Butt, D., Ellis-Clarke, J., Cartmill, J. (2010). Linguistic analysis of verbal and non-verbal communication in the operating room. ANZ Journal of Surgery, 80(12), 925–929. Available: https://onlinelibrary.wiley.com/doi/10.1111/j.1445-2197.2010.05531.x/full.
Morency, L.-P., de Kok, I., Gratch, J. (2008). Context-based recognition during human interactions: Automatic feature selection and encoding dictionary. In Proceedings of the 10th international conference on multimodal interfaces (pp. 181–188). ACM. Available: https://dl.acm.org/citation.cfm?id=1452426.
Mori, A., Uchida, S., Kurazume, R., Taniguchi, R.-I., Hasegawa, T., Sakoe, H. (2006). Early recognition and prediction of gestures. In 18th International conference on pattern recognition (ICPR’06) (Vol. 3, pp. 560–563). IEEE. Available: https://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1699588.
Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT press. Available: https://books.google.com/books?hl=en&lr=&id=RC43AgAAQBAJ&oi=fnd&pg=PR7&dq=machine+learning,+a+probabilisitc+perspective&ots=ukmzeFTu-a&sig=lE9AmDT0EVtlItwhuq__zKTXfzA.
Mutlu, B., Shiwa, T., Kanda, T., Ishiguro, H., Hagita, N. (2009). Footing in human–robot conversations: How robots might shape participant roles using gaze cues. In Proceedings of the 4th ACM/IEEE international conference on human robot interaction (pp. 61–68). ACM. Available: https://dl.acm.org/citation.cfm?id=1514109.
Needleman, J., Buerhaus, P., Pankratz, V. S., Leibson, C. L., Stevens, S. R., & Harris, M. (2011). Nurse staffing and inpatient hospital mortality. New England Journal of Medicine, 364(11), 1037–1045.
Ohn-Bar, E., & Trivedi, M. M. (2014). Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Transactions on Intelligent Transportation Systems, 15(6), 2368–2377.
Padilha, E., Carletta, J. (2003). Nonverbal behaviours improving a simulation of small group discussion. In Proceedings of the first international Nordic symposium of multi-modal communication (pp. 93–105). Citeseer. Available: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.120.1326&rep=rep1&type=pdf.
Rabl, L. I., Andersen, M. L., stergaard, D., Bjrn, B., Lilja, B., Mogensen, T. (2011). Republished error management: Descriptions of verbal communication errors between staff. An analysis of 84 root cause analysis-reports from Danish hospitals. Postgraduate Medical Journal, 87(1033), 783–789. Available: https://pmj.bmj.com/content/87/1033/783.
Raux, A., Eskenazi, M. (2008). Optimizing endpointing thresholds using dialogue features in a spoken dialogue system. In Proceedings of the 9th SIGdial workshop on discourse and dialogue (pp. 1–10). Association for Computational Linguistics. Available: https://dl.acm.org/citation.cfm?id=1622066.
Raux, A., & Eskenazi, M. (2012). Optimizing the turn-taking behavior of task-oriented spoken dialog systems. ACM Transactions on Speech and Language Processing, 9(1), 1:1–1:23. https://doi.org/10.1145/2168748.2168749.
Sacks, H., Schegloff, E. A., Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language (pp. 696–735). Available: https://www.jstor.org/stable/412243.
Saito, N., Okada, S., Nitta, K., Nakano, Y., Hayashi, Y. (2015). Estimating user’s attitude in multimodal conversational system for elderly people with dementia. In 2015 AAAI spring symposium series, March 2015. Available: https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10274.
Schlangen, D. (2006). From reaction to prediction: Experiments with computational models of turn-taking. In INTERSPEECH. Available: https://www.researchgate.net/profile/David_Schlangen/publication/221492126_From_reaction_to_prediction_experiments_with_computational_models_of_turn-taking/links/0fcfd50a2b71eee032000000.pdf.
Sebanz, N., Bekkering, H., Knoblich, G. (2006). Joint action: Bodies and minds moving together. Trends in Cognitive Sciences, 10(2), 70–76. Available: https://www.sciencedirect.com/science/article/pii/S1364661305003566.
Shokoohi-Yekta, M., Hu, B., Jin, H., Wang, J., Keogh, E. (2016). Generalizing DTW to the multi-dimensional case requires an adaptive approach. Data Mining and Knowledge Discovery (pp. 1–31), February 2016. Available: https://link.springer.com/article/10.1007/s10618-016-0455-0.
Strabala, K. W., Lee, M. K., Dragan, A. D., Forlizzi, J. L., Srinivasa, S., Cakmak, M., Micelli, V. (2013). Towards seamless human–robot handovers. Journal of Human–Robot Interaction, 2(1), 112–132. Available: https://humanrobotinteraction.org/journal/index.php/HRI/article/view/114.
ten Holt, G. A., Reinders, M. J., Hendriks, E. A. (2007). Multi-dimensional dynamic time warping for gesture recognition. In Thirteenth annual conference of the advanced school for computing and imaging (Vol. 300). Available: https://mmc.tudelft.nl/sites/default/files/DTW-vASCI.pdf.
Tukey, J. W. (1949). Comparing individual means in the analysis of variance. Biometrics, 5, 99–114.
Unhelkar, V. V., Shah, J. A., Siu, H. C. (2014). Comparative performance of human and mobile robotic assistants in collaborative fetch-and-deliver tasks. In Proceedings of the 2014 ACM/IEEE international conference on Human–robot interaction (pp. 82–89). ACM.
Vakanski, A., Mantegh, I., Irish, A., Janabi-Sharifi, F. (2012). Trajectory learning for robot programming by demonstration using hidden Markov model and dynamic time warping. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(4), 1039–1052. Available: https://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6166903.
Ward, N. G., Fuentes, O., Vega, A. (2010). Dialog prediction for a general model of turn-taking. In INTERSPEECH (pp. 2662–2665). Citeseer. Available: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.443.5964&rep=rep1&type=pdf.
Wllmer, M., Kaiser, M., Eyben, F., Schuller, B., Rigoll, G. (2013). LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 31(2), 153–163. Available: https://www.sciencedirect.com/science/article/pii/S0262885612000285.
Wu, H., Siegel, M., Stiefelhagen, R., Yang, J. (2002). Sensor fusion using Dempster–Shafer theory [for context-aware hci]. In Instrumentation and Measurement Technology Conference, 2002. IMTC/2002. Proceedings of the 19th IEEE (Vol. 1, pp. 7–12). IEEE.
Yamazaki, A., Yamazaki, K., Kuno, Y., Burdelski, M., Kawashima, M., Kuzuoka, H. (2008). Precision timing in human–robot interaction: Coordination of head movement and utterance. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 131–140). ACM. Available: https://dl.acm.org/citation.cfm?id=1357077.
Zheng, W., Zhou, X., Zou, C., & Zhao, L. (2006). Facial expression recognition using kernel canonical correlation analysis (KCCA). IEEE Transactions on Neural Networks, 17(1), 233–238.
Zhou, T., Wachs, J. (2016). Early turn-taking prediction in the operating room. Available: https://www.aaai.org/ocs/index.php/FSS/FSS16/paper/view/14074.
Acknowledgements
The authors would like to thank Dr. Rashid Mazhar and Dr. Carlos Velasquez from Hamad Medical Corporation (Qatar) for their collaboration and discussion of the project. That authors would also like to thank all the members of ISAT lab for their inspiring discussions.
Author information
Authors and Affiliations
Corresponding author
Additional information
This is one of the several papers published in Autonomous Robots comprising the Special Issue on Learning for Human-Robot Collaboration.
Research supported by the NPRP award (NPRP 6-449-2-181) from the Qatar National Research Fund (a member of The Qatar Foundation). The statements made herein are solely the responsibility of the authors.
Rights and permissions
About this article
Cite this article
Zhou, T., Wachs, J.P. Early prediction for physical human robot collaboration in the operating room. Auton Robot 42, 977–995 (2018). https://doi.org/10.1007/s10514-017-9670-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10514-017-9670-9