Abstract
Mainstream handwritten text recognition (HTR) approaches require large-scale labeled data for training to achieve satisfactory performance. Recently, contrastive learning has been introduced to perform self-supervised training on unlabeled data to improve representational capacity. It minimizes the distance between the positive pairs while maximizing their distance to the negative ones. Previous studies typically consider each frame or a fixed window of frames in a sequential feature map as a separate instance for contrastive learning. However, owing to the arbitrariness of handwriting and the diversity of word length, such modeling may contain the information of multiple consecutive characters or an over-segmented sub-character, which may confuse the model to perceive semantic clues information. To address this issue, in this paper, we design a character-level pretext task termed Character Movement Task, to assist word-level contrastive learning, namely CMT-Co. It moves the characters in a word to generate artifacts and guides the model to perceive the text content by using the moving direction and distance as supervision. In addition, we customize a data augmentation strategy specifically for handwritten text, which significantly contributes to the construction of training pairs for contrastive learning. Experiments have shown that the proposed CMT-Co achieves competitive or even superior performance compared to previous methods on public handwritten benchmarks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aberdam, A., et al.: Sequence-to-sequence contrastive learning for text recognition. In: CVPR, pp. 15302–15312 (2021)
Bhunia, A.K., Ghose, S., Kumar, A., Chowdhury, P.N., Sain, A., Song, Y.Z.: MetaHTR: towards writer-adaptive handwritten text recognition. In: CVPR, pp. 15830–15839 (2021)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607 (2020)
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: NeurIPS, pp. 22243–22255 (2020)
Chen, X., Jin, L., Zhu, Y., Luo, C., Wang, T.: Text recognition in the wild: a survey. ACM Comput. Surv. 54(2), 1–35 (2021)
Chen, X., Fan, H., Girshick, R.B., He, K.: Improved baselines with momentum contrastive learning. CoRR abs/2003.04297 (2020)
Chen, X., He, K.: Exploring simple Siamese representation learning. In: CVPR, pp. 15750–15758 (2021)
Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: ICCV, pp. 5076–5084 (2017)
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV, pp. 1422–1430 (2015)
Gan, Y., Han, R., Yin, L., Feng, W., Wang, S.: Self-supervised multi-view multi-human association and tracking. In: ACM International Conference on Multimedia, pp. 282–290 (2021)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. In: NeurIPS, vol. 33, pp. 21271–21284 (2020)
Grosicki, E., El Abed, H.: ICDAR 2009 handwriting recognition competition. In: ICDAR, pp. 1398–1402 (2009)
Ha, J., Haralick, R.M., Phillips, I.T.: Document page decomposition by the bounding-box project. In: ICDAR, pp. 1119–1122 (1995)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9729–9738 (2020)
Ingle, R.R., Fujii, Y., Deselaers, T., Baccash, J., Popat, A.C.: A scalable handwritten text recognition system. In: ICDAR, pp. 17–24 (2019)
Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., Makedon, F.: A survey on contrastive self-supervised learning. Technologies 9(1), 2 (2020)
Kinakh, V., Taran, O., Voloshynovskiy, S.: ScatSimCLR: self-supervised contrastive learning with pretext task regularization for small-scale datasets. In: ICCV Workshops, pp. 1098–1106 (2021)
Kleber, F., Fiel, S., Diem, M., Sablatnig, R.: CVL-DataBase: an off-line database for writer retrieval, writer identification and word spotting. In: ICDAR, pp. 560–564 (2013)
Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: CVPR, pp. 1920–1929 (2019)
Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1920–1929 (2019)
Li, W., Wang, G., Fidon, L., Ourselin, S., Cardoso, M.J., Vercauteren, T.: On the compactness, efficiency, and representation of 3D convolutional networks: brain parcellation as a pretext task. In: Niethammer, M., et al. (eds.) IPMI 2017. LNCS, vol. 10265, pp. 348–360. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59050-9_28
Liu, H., et al.: Perceiving stroke-semantic context: hierarchical contrastive learning for robust scene text recognition. In: AAAI, pp. 1702–1710 (2022)
Liu, X., et al.: Self-supervised learning: generative or contrastive. IEEE Trans. Knowl. Data Eng. 35, 1 (2021)
Liu, Y., Chen, H., Shen, C., He, T., Jin, L., Wang, L.: ABCNet: real-time scene text spotting with adaptive Bezier-curve network. In: CVPR, pp. 9806–9815 (2020)
Luo, C., Jin, L., Sun, Z.: MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109–118 (2019)
Luo, C., Zhu, Y., Jin, L., Wang, Y.: Learn to augment: joint data augmentation and network optimization for text recognition. In: CVPR, pp. 13746–13755 (2020)
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5(1), 39–46 (2002)
Michael, J., Labahn, R., Grüning, T., Zöllner, J.: Evaluating sequence-to-sequence models for handwritten text recognition. In: ICDAR, pp. 1286–1293 (2019)
Misra, I., Maaten, L.V.D.: Self-supervised learning of pretext-invariant representations. In: CVPR, pp. 6706–6716 (2020)
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving Jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Parvez, M.T., Mahmoud, S.A.: Offline Arabic handwritten text recognition: a survey. ACM Comput. Surv. 45(2), 1–35 (2013)
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR, pp. 2536–2544 (2016)
Ptak, R., Żygadło, B., Unold, O.: Projection-based text line segmentation with a variable threshold. Int. J. Appl. Math. Comput. Sci. 27(1), 195–206 (2017)
Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: ICDAR, pp. 67–72 (2017)
Sánchez, J.A., Bosch, V., Romero, V., Depuydt, K., De Does, J.: Handwritten text recognition for historical documents in the tranScriptorium project. In: DATeCH, pp. 111–117. ACM (2014)
Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: ICFHR2014 competition on handwritten text recognition on tranScriptorium datasets (HTRtS). In: ICFHR, pp. 785–790 (2014)
Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: ICFHR2016 competition on handwritten text recognition on the read dataset. In: ICFHR, pp. 630–635 (2016)
Sánchez, J.A., Romero, V., Toselli, A.H., Villegas, M., Vidal, E.: ICDAR2017 competition on handwritten text recognition on the read dataset. In: ICDAR, pp. 1383–1388 (2017)
Thoker, F.M., Doughty, H., Snoek, C.G.: Skeleton-contrastive 3D action representation learning. In: ACM International Conference on Multimedia, pp. 1655–1663 (2021)
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464 (2011)
Wang, T., et al.: Implicit feature alignment: learn to convert text recognizer to text spotter. In: CVPR, pp. 5973–5982 (2021)
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR, pp. 3733–3742 (2018)
Yan, J., Wang, J., Li, Q., Wang, C., Pu, S.: Self-supervised regional and temporal auxiliary tasks for facial action unit recognition. In: ACM International Conference on Multimedia, pp. 1038–1046 (2021)
Zeiler, M.D.: ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701 (2012)
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Acknowledgment
This research is supported in part by NSFC (Grant No.: 61936003), GD-NSF (no. 2017A030312006, No. 2021A1515011870), Zhuhai Industry Core and Key Technology Research Project (no. ZH22044702200058PJL), and the Science and Technology Foundation of Guangzhou Huangpu Development District (Grant 2020GH17).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, X., Wang, J., Jin, L., Ren, Y., Xue, Y. (2023). CMT-Co: Contrastive Learning with Character Movement Task for Handwritten Text Recognition. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13847. Springer, Cham. https://doi.org/10.1007/978-3-031-26293-7_37
Download citation
DOI: https://doi.org/10.1007/978-3-031-26293-7_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26292-0
Online ISBN: 978-3-031-26293-7
eBook Packages: Computer ScienceComputer Science (R0)