Skip to main content

Multi-modal Sign Language Spotting by Multi/One-Shot Learning

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13808))

Abstract

The sign spotting task aims to identify whether and where an isolated sign of interest exists in a continuous sign language video. Recently, it has received substantial attention since it is a promising tool to annotate large-scale sign language data. Previous methods utilized multiple sources of available supervision information to localize the sign actions under the RGB domain. However, these methods overlook the complementary nature of different modalities, i.e., RGB, optical flow, and pose, which are beneficial to the sign spotting task. To this end, we propose a framework to merge multiple modalities for multiple-shot supervised learning. Furthermore, we explore the sign spotting task with the one-shot setting, which needs fewer annotations and has broader applications. To evaluate our approach, we participated in the Sign Spotting Challenge, organized by ECCV 2022. The competition contains two tracks, i.e., multiple-shot supervised learning (MSSL for track 1) and one-shot learning with weak labels (OSLWL for track 2). In track 1, our method achieves around 0.566 F1-score and is ranked 2nd. In track 2, we are ranked the 1st, with a 0.6 F1-score. These results demonstrate the effectiveness of our proposed method. We hope our solution will provide some insight for future research in the community.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://chalearnlap.cvc.uab.cat/dataset/42/description/.

References

  1. Albanie, S., et al.: BSL-1K: scaling up co-articulated sign language recognition using mouthing cues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 35–53. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_3

    Chapter  Google Scholar 

  2. Bai, Y., Wang, Y., Tong, Y., Yang, Y., Liu, Q., Liu, J.: Boundary content graph neural network for temporal action proposal generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_8

    Chapter  Google Scholar 

  3. Bull, H., Afouras, T., Varol, G., Albanie, S., Momeni, L., Zisserman, A.: Aligning subtitles in sign language videos. In: ICCV, pp. 11552–11561 (2021)

    Google Scholar 

  4. Cai, Y., et al.: Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: ICCV, pp. 2272–2281 (2019)

    Google Scholar 

  5. Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. In: CVPR, pp. 7784–7793 (2018)

    Google Scholar 

  6. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)

    Google Scholar 

  7. Chang, S., Wang, P., Wang, F., Li, H., Feng, J.: Augmented transformer with adaptive graph for temporal action proposal generation. arXiv (2021)

    Google Scholar 

  8. Chen, K., et al.: MMDetection: Open MMLab detection toolbox and benchmark. arXiv (2019)

    Google Scholar 

  9. MMP Contributors: OpenMMLab pose estimation toolbox and benchmark (2020). https://github.com/open-mmlab/mmpose

  10. Douze, M., Jégou, H., Schmid, C., Pérez, P.: Compact video description for copy detection with precise temporal alignment. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 522–535. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_38

    Chapter  Google Scholar 

  11. Enriquez, M.V., Alba-Castro, J.L., Docio-Fernandez, L., Junior, J.C.S.J., Escalera, S.: ECCV 2022 sign spotting challenge: dataset, design and results. In: ECCVW (2022)

    Google Scholar 

  12. Gong, G., Zheng, L., Mu, Y.: Scale matters: temporal scale aggregation network for precise action localization in untrimmed videos. In: ICME, pp. 1–6 (2020)

    Google Scholar 

  13. Hu, H., Zhao, W., Zhou, W., Wang, Y., Li, H.: SignBERT: pre-training of hand-model-aware representation for sign language recognition. In: ICCV, pp. 11087–11096 (2021)

    Google Scholar 

  14. Jiang, Y.G., Wang, J.: Partial copy detection in videos: a benchmark and an evaluation of popular methods. TBD 2(1), 32–42 (2016)

    Google Scholar 

  15. Li, D., et al.: TSPNet: hierarchical feature learning via temporal semantic pyramid for sign language translation. In: NeurIPS, vol. 33, pp. 12034–12045 (2020)

    Google Scholar 

  16. Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV, pp. 7083–7093 (2019)

    Google Scholar 

  17. Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: ICCV, pp. 3889–3898 (2019)

    Google Scholar 

  18. Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: ACM MM, pp. 988–996 (2017)

    Google Scholar 

  19. Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: ECCV, pp. 3–19 (2018)

    Google Scholar 

  20. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)

    Google Scholar 

  21. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  22. Liu, Y., Ma, L., Zhang, Y., Liu, W., Chang, S.F.: Multi-granularity generator for temporal action proposal. In: CVPR, pp. 3604–3613 (2019)

    Google Scholar 

  23. Min, Y., Hao, A., Chai, X., Chen, X.: Visual alignment constraint for continuous sign language recognition. In: ICCV, pp. 11542–11551 (2021)

    Google Scholar 

  24. Momeni, L., Afouras, T., Stafylakis, T., Albanie, S., Zisserman, A.: Seeing wake words: audio-visual keyword spotting. arXiv (2020)

    Google Scholar 

  25. Ong, E.J., Koller, O., Pugeault, N., Bowden, R.: Sign spotting using hierarchical sequential patterns with temporal intervals. In: CVPR, pp. 1923–1930 (2014)

    Google Scholar 

  26. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)

    Google Scholar 

  27. Sincan, O.M., Keles, H.Y.: AUTSL: a large scale multi-modal Turkish sign language dataset and baseline methods. IEEE Access 8, 181340–181355 (2020)

    Article  Google Scholar 

  28. Stafylakis, T., Tzimiropoulos, G.: Zero-shot keyword spotting for visual speech recognition in-the-wild. In: ECCV, pp. 513–529 (2018)

    Google Scholar 

  29. Tan, H.K., Ngo, C.W., Hong, R., Chua, T.S.: Scalable detection of partial near-duplicate videos by visual-temporal consistency. In: ACM MM, pp. 145–154 (2009)

    Google Scholar 

  30. Tan, J., Tang, J., Wang, L., Wu, G.: Relaxed transformer decoders for direct action proposal generation. In: ICCV, pp. 13526–13535 (2021)

    Google Scholar 

  31. Varol, G., Momeni, L., Albanie, S., Afouras, T., Zisserman, A.: Read and attend: temporal localisation in sign language videos. In: CVPR, pp. 16857–16866 (2021)

    Google Scholar 

  32. Viitaniemi, V., Jantunen, T., Savolainen, L., Karppa, M., Laaksonen, J.: S-pot-a benchmark in spotting signs within continuous signing. In: LREC (2014)

    Google Scholar 

  33. Wang, L., Yang, H., Wu, W., Yao, H., Huang, H.: Temporal action proposal generation with transformers. arXiv (2021)

    Google Scholar 

  34. Xu, M., Zhao, C., Rojas, D.S., Thabet, A., Ghanem, B.: G-tad: sub-graph localization for temporal action detection. In: CVPR, pp. 10156–10165 (2020)

    Google Scholar 

  35. Zhang, C., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. arXiv preprint arXiv:2202.07925 (2022)

  36. Zhao, P., Xie, L., Ju, C., Zhang, Y., Wang, Y., Tian, Q.: Bottom-up temporal action localization with mutual regularization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 539–555. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_32

    Chapter  Google Scholar 

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China under Contract U20A20183. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Wengang Zhou or Houqiang Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, L., Zhou, W., Zhao, W., Hu, H., Li, H. (2023). Multi-modal Sign Language Spotting by Multi/One-Shot Learning. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13808. Springer, Cham. https://doi.org/10.1007/978-3-031-25085-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25085-9_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25084-2

  • Online ISBN: 978-3-031-25085-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics