Multi-modal Sign Language Spotting by Multi/One-Shot Learning

Liu, Landong; Zhou, Wengang; Zhao, Weichao; Hu, Hezhen; Li, Houqiang

doi:10.1007/978-3-031-25085-9_15

Landong Liu¹⁰,
Wengang Zhou¹⁰,
Weichao Zhao¹⁰,
Hezhen Hu¹⁰ &
…
Houqiang Li¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13808))

Included in the following conference series:

European Conference on Computer Vision

1347 Accesses
1 Citations

Abstract

The sign spotting task aims to identify whether and where an isolated sign of interest exists in a continuous sign language video. Recently, it has received substantial attention since it is a promising tool to annotate large-scale sign language data. Previous methods utilized multiple sources of available supervision information to localize the sign actions under the RGB domain. However, these methods overlook the complementary nature of different modalities, i.e., RGB, optical flow, and pose, which are beneficial to the sign spotting task. To this end, we propose a framework to merge multiple modalities for multiple-shot supervised learning. Furthermore, we explore the sign spotting task with the one-shot setting, which needs fewer annotations and has broader applications. To evaluate our approach, we participated in the Sign Spotting Challenge, organized by ECCV 2022. The competition contains two tracks, i.e., multiple-shot supervised learning (MSSL for track 1) and one-shot learning with weak labels (OSLWL for track 2). In track 1, our method achieves around 0.566 F1-score and is ranked 2nd. In track 2, we are ranked the 1st, with a 0.6 F1-score. These results demonstrate the effectiveness of our proposed method. We hope our solution will provide some insight for future research in the community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Watch, Read and Lookup: Learning to Spot Signs from Multiple Supervisors

Scaling Up Sign Spotting Through Sign Language Dictionaries

Article Open access 05 April 2022

Sign Language Recognition for Low Resource Languages Using Few Shot Learning

Notes

1.
https://chalearnlap.cvc.uab.cat/dataset/42/description/.

References

Albanie, S., et al.: BSL-1K: scaling up co-articulated sign language recognition using mouthing cues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 35–53. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_3
Chapter Google Scholar
Bai, Y., Wang, Y., Tong, Y., Yang, Y., Liu, Q., Liu, J.: Boundary content graph neural network for temporal action proposal generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_8
Chapter Google Scholar
Bull, H., Afouras, T., Varol, G., Albanie, S., Momeni, L., Zisserman, A.: Aligning subtitles in sign language videos. In: ICCV, pp. 11552–11561 (2021)
Google Scholar
Cai, Y., et al.: Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: ICCV, pp. 2272–2281 (2019)
Google Scholar
Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. In: CVPR, pp. 7784–7793 (2018)
Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Google Scholar
Chang, S., Wang, P., Wang, F., Li, H., Feng, J.: Augmented transformer with adaptive graph for temporal action proposal generation. arXiv (2021)
Google Scholar
Chen, K., et al.: MMDetection: Open MMLab detection toolbox and benchmark. arXiv (2019)
Google Scholar
MMP Contributors: OpenMMLab pose estimation toolbox and benchmark (2020). https://github.com/open-mmlab/mmpose
Douze, M., Jégou, H., Schmid, C., Pérez, P.: Compact video description for copy detection with precise temporal alignment. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 522–535. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_38
Chapter Google Scholar
Enriquez, M.V., Alba-Castro, J.L., Docio-Fernandez, L., Junior, J.C.S.J., Escalera, S.: ECCV 2022 sign spotting challenge: dataset, design and results. In: ECCVW (2022)
Google Scholar
Gong, G., Zheng, L., Mu, Y.: Scale matters: temporal scale aggregation network for precise action localization in untrimmed videos. In: ICME, pp. 1–6 (2020)
Google Scholar
Hu, H., Zhao, W., Zhou, W., Wang, Y., Li, H.: SignBERT: pre-training of hand-model-aware representation for sign language recognition. In: ICCV, pp. 11087–11096 (2021)
Google Scholar
Jiang, Y.G., Wang, J.: Partial copy detection in videos: a benchmark and an evaluation of popular methods. TBD 2(1), 32–42 (2016)
Google Scholar
Li, D., et al.: TSPNet: hierarchical feature learning via temporal semantic pyramid for sign language translation. In: NeurIPS, vol. 33, pp. 12034–12045 (2020)
Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV, pp. 7083–7093 (2019)
Google Scholar
Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: ICCV, pp. 3889–3898 (2019)
Google Scholar
Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: ACM MM, pp. 988–996 (2017)
Google Scholar
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: ECCV, pp. 3–19 (2018)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Liu, Y., Ma, L., Zhang, Y., Liu, W., Chang, S.F.: Multi-granularity generator for temporal action proposal. In: CVPR, pp. 3604–3613 (2019)
Google Scholar
Min, Y., Hao, A., Chai, X., Chen, X.: Visual alignment constraint for continuous sign language recognition. In: ICCV, pp. 11542–11551 (2021)
Google Scholar
Momeni, L., Afouras, T., Stafylakis, T., Albanie, S., Zisserman, A.: Seeing wake words: audio-visual keyword spotting. arXiv (2020)
Google Scholar
Ong, E.J., Koller, O., Pugeault, N., Bowden, R.: Sign spotting using hierarchical sequential patterns with temporal intervals. In: CVPR, pp. 1923–1930 (2014)
Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
Google Scholar
Sincan, O.M., Keles, H.Y.: AUTSL: a large scale multi-modal Turkish sign language dataset and baseline methods. IEEE Access 8, 181340–181355 (2020)
Article Google Scholar
Stafylakis, T., Tzimiropoulos, G.: Zero-shot keyword spotting for visual speech recognition in-the-wild. In: ECCV, pp. 513–529 (2018)
Google Scholar
Tan, H.K., Ngo, C.W., Hong, R., Chua, T.S.: Scalable detection of partial near-duplicate videos by visual-temporal consistency. In: ACM MM, pp. 145–154 (2009)
Google Scholar
Tan, J., Tang, J., Wang, L., Wu, G.: Relaxed transformer decoders for direct action proposal generation. In: ICCV, pp. 13526–13535 (2021)
Google Scholar
Varol, G., Momeni, L., Albanie, S., Afouras, T., Zisserman, A.: Read and attend: temporal localisation in sign language videos. In: CVPR, pp. 16857–16866 (2021)
Google Scholar
Viitaniemi, V., Jantunen, T., Savolainen, L., Karppa, M., Laaksonen, J.: S-pot-a benchmark in spotting signs within continuous signing. In: LREC (2014)
Google Scholar
Wang, L., Yang, H., Wu, W., Yao, H., Huang, H.: Temporal action proposal generation with transformers. arXiv (2021)
Google Scholar
Xu, M., Zhao, C., Rojas, D.S., Thabet, A., Ghanem, B.: G-tad: sub-graph localization for temporal action detection. In: CVPR, pp. 10156–10165 (2020)
Google Scholar
Zhang, C., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. arXiv preprint arXiv:2202.07925 (2022)
Zhao, P., Xie, L., Ju, C., Zhang, Y., Wang, Y., Tian, Q.: Bottom-up temporal action localization with mutual regularization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 539–555. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_32
Chapter Google Scholar

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China under Contract U20A20183. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.

Author information

Authors and Affiliations

CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China (USTC), Hefei, China
Landong Liu, Wengang Zhou, Weichao Zhao, Hezhen Hu & Houqiang Li

Authors

Landong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wengang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Weichao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Hezhen Hu
View author publications
You can also search for this author in PubMed Google Scholar
Houqiang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Wengang Zhou or Houqiang Li .

Editor information

Editors and Affiliations

IBM Research AI and MIT-IBM Watson AI Lab, Haifa, Israel
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, L., Zhou, W., Zhao, W., Hu, H., Li, H. (2023). Multi-modal Sign Language Spotting by Multi/One-Shot Learning. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13808. Springer, Cham. https://doi.org/10.1007/978-3-031-25085-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-25085-9_15
Published: 12 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25084-2
Online ISBN: 978-3-031-25085-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-modal Sign Language Spotting by Multi/One-Shot Learning