Skip to main content

CMT-Co: Contrastive Learning with Character Movement Task for Handwritten Text Recognition

  • Conference paper
  • First Online:
Computer Vision – ACCV 2022 (ACCV 2022)

Abstract

Mainstream handwritten text recognition (HTR) approaches require large-scale labeled data for training to achieve satisfactory performance. Recently, contrastive learning has been introduced to perform self-supervised training on unlabeled data to improve representational capacity. It minimizes the distance between the positive pairs while maximizing their distance to the negative ones. Previous studies typically consider each frame or a fixed window of frames in a sequential feature map as a separate instance for contrastive learning. However, owing to the arbitrariness of handwriting and the diversity of word length, such modeling may contain the information of multiple consecutive characters or an over-segmented sub-character, which may confuse the model to perceive semantic clues information. To address this issue, in this paper, we design a character-level pretext task termed Character Movement Task, to assist word-level contrastive learning, namely CMT-Co. It moves the characters in a word to generate artifacts and guides the model to perceive the text content by using the moving direction and distance as supervision. In addition, we customize a data augmentation strategy specifically for handwritten text, which significantly contributes to the construction of training pairs for contrastive learning. Experiments have shown that the proposed CMT-Co achieves competitive or even superior performance compared to previous methods on public handwritten benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aberdam, A., et al.: Sequence-to-sequence contrastive learning for text recognition. In: CVPR, pp. 15302–15312 (2021)

    Google Scholar 

  2. Bhunia, A.K., Ghose, S., Kumar, A., Chowdhury, P.N., Sain, A., Song, Y.Z.: MetaHTR: towards writer-adaptive handwritten text recognition. In: CVPR, pp. 15830–15839 (2021)

    Google Scholar 

  3. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607 (2020)

    Google Scholar 

  4. Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: NeurIPS, pp. 22243–22255 (2020)

    Google Scholar 

  5. Chen, X., Jin, L., Zhu, Y., Luo, C., Wang, T.: Text recognition in the wild: a survey. ACM Comput. Surv. 54(2), 1–35 (2021)

    Article  Google Scholar 

  6. Chen, X., Fan, H., Girshick, R.B., He, K.: Improved baselines with momentum contrastive learning. CoRR abs/2003.04297 (2020)

    Google Scholar 

  7. Chen, X., He, K.: Exploring simple Siamese representation learning. In: CVPR, pp. 15750–15758 (2021)

    Google Scholar 

  8. Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: ICCV, pp. 5076–5084 (2017)

    Google Scholar 

  9. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV, pp. 1422–1430 (2015)

    Google Scholar 

  10. Gan, Y., Han, R., Yin, L., Feng, W., Wang, S.: Self-supervised multi-view multi-human association and tracking. In: ACM International Conference on Multimedia, pp. 282–290 (2021)

    Google Scholar 

  11. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)

    Google Scholar 

  12. Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. In: NeurIPS, vol. 33, pp. 21271–21284 (2020)

    Google Scholar 

  13. Grosicki, E., El Abed, H.: ICDAR 2009 handwriting recognition competition. In: ICDAR, pp. 1398–1402 (2009)

    Google Scholar 

  14. Ha, J., Haralick, R.M., Phillips, I.T.: Document page decomposition by the bounding-box project. In: ICDAR, pp. 1119–1122 (1995)

    Google Scholar 

  15. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9729–9738 (2020)

    Google Scholar 

  16. Ingle, R.R., Fujii, Y., Deselaers, T., Baccash, J., Popat, A.C.: A scalable handwritten text recognition system. In: ICDAR, pp. 17–24 (2019)

    Google Scholar 

  17. Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., Makedon, F.: A survey on contrastive self-supervised learning. Technologies 9(1), 2 (2020)

    Article  Google Scholar 

  18. Kinakh, V., Taran, O., Voloshynovskiy, S.: ScatSimCLR: self-supervised contrastive learning with pretext task regularization for small-scale datasets. In: ICCV Workshops, pp. 1098–1106 (2021)

    Google Scholar 

  19. Kleber, F., Fiel, S., Diem, M., Sablatnig, R.: CVL-DataBase: an off-line database for writer retrieval, writer identification and word spotting. In: ICDAR, pp. 560–564 (2013)

    Google Scholar 

  20. Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: CVPR, pp. 1920–1929 (2019)

    Google Scholar 

  21. Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1920–1929 (2019)

    Google Scholar 

  22. Li, W., Wang, G., Fidon, L., Ourselin, S., Cardoso, M.J., Vercauteren, T.: On the compactness, efficiency, and representation of 3D convolutional networks: brain parcellation as a pretext task. In: Niethammer, M., et al. (eds.) IPMI 2017. LNCS, vol. 10265, pp. 348–360. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59050-9_28

    Chapter  Google Scholar 

  23. Liu, H., et al.: Perceiving stroke-semantic context: hierarchical contrastive learning for robust scene text recognition. In: AAAI, pp. 1702–1710 (2022)

    Google Scholar 

  24. Liu, X., et al.: Self-supervised learning: generative or contrastive. IEEE Trans. Knowl. Data Eng. 35, 1 (2021)

    Google Scholar 

  25. Liu, Y., Chen, H., Shen, C., He, T., Jin, L., Wang, L.: ABCNet: real-time scene text spotting with adaptive Bezier-curve network. In: CVPR, pp. 9806–9815 (2020)

    Google Scholar 

  26. Luo, C., Jin, L., Sun, Z.: MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109–118 (2019)

    Article  Google Scholar 

  27. Luo, C., Zhu, Y., Jin, L., Wang, Y.: Learn to augment: joint data augmentation and network optimization for text recognition. In: CVPR, pp. 13746–13755 (2020)

    Google Scholar 

  28. Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5(1), 39–46 (2002)

    Article  MATH  Google Scholar 

  29. Michael, J., Labahn, R., Grüning, T., Zöllner, J.: Evaluating sequence-to-sequence models for handwritten text recognition. In: ICDAR, pp. 1286–1293 (2019)

    Google Scholar 

  30. Misra, I., Maaten, L.V.D.: Self-supervised learning of pretext-invariant representations. In: CVPR, pp. 6706–6716 (2020)

    Google Scholar 

  31. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving Jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5

    Chapter  Google Scholar 

  32. Parvez, M.T., Mahmoud, S.A.: Offline Arabic handwritten text recognition: a survey. ACM Comput. Surv. 45(2), 1–35 (2013)

    Article  MATH  Google Scholar 

  33. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR, pp. 2536–2544 (2016)

    Google Scholar 

  34. Ptak, R., Żygadło, B., Unold, O.: Projection-based text line segmentation with a variable threshold. Int. J. Appl. Math. Comput. Sci. 27(1), 195–206 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  35. Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: ICDAR, pp. 67–72 (2017)

    Google Scholar 

  36. Sánchez, J.A., Bosch, V., Romero, V., Depuydt, K., De Does, J.: Handwritten text recognition for historical documents in the tranScriptorium project. In: DATeCH, pp. 111–117. ACM (2014)

    Google Scholar 

  37. Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: ICFHR2014 competition on handwritten text recognition on tranScriptorium datasets (HTRtS). In: ICFHR, pp. 785–790 (2014)

    Google Scholar 

  38. Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: ICFHR2016 competition on handwritten text recognition on the read dataset. In: ICFHR, pp. 630–635 (2016)

    Google Scholar 

  39. Sánchez, J.A., Romero, V., Toselli, A.H., Villegas, M., Vidal, E.: ICDAR2017 competition on handwritten text recognition on the read dataset. In: ICDAR, pp. 1383–1388 (2017)

    Google Scholar 

  40. Thoker, F.M., Doughty, H., Snoek, C.G.: Skeleton-contrastive 3D action representation learning. In: ACM International Conference on Multimedia, pp. 1655–1663 (2021)

    Google Scholar 

  41. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464 (2011)

    Google Scholar 

  42. Wang, T., et al.: Implicit feature alignment: learn to convert text recognizer to text spotter. In: CVPR, pp. 5973–5982 (2021)

    Google Scholar 

  43. Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR, pp. 3733–3742 (2018)

    Google Scholar 

  44. Yan, J., Wang, J., Li, Q., Wang, C., Pu, S.: Self-supervised regional and temporal auxiliary tasks for facial action unit recognition. In: ACM International Conference on Multimedia, pp. 1038–1046 (2021)

    Google Scholar 

  45. Zeiler, M.D.: ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701 (2012)

    Google Scholar 

  46. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40

    Chapter  Google Scholar 

Download references

Acknowledgment

This research is supported in part by NSFC (Grant No.: 61936003), GD-NSF (no. 2017A030312006, No. 2021A1515011870), Zhuhai Industry Core and Key Technology Research Project (no. ZH22044702200058PJL), and the Science and Technology Foundation of Guangzhou Huangpu Development District (Grant 2020GH17).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lianwen Jin .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 447 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, X., Wang, J., Jin, L., Ren, Y., Xue, Y. (2023). CMT-Co: Contrastive Learning with Character Movement Task for Handwritten Text Recognition. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13847. Springer, Cham. https://doi.org/10.1007/978-3-031-26293-7_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-26293-7_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-26292-0

  • Online ISBN: 978-3-031-26293-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics