Abstract
The onomatopoeia texts in the Japanese comic, with its arbitrary shapes diverse backgrounds and complex layouts, are a challenging and worthwhile subject of study. On the one hand, when recognizing onomatopoeia text images, using existing mainstream text recognition methods may lead to the inability to achieve the expected recognition results. This may be caused by these methods not taking into account the unique characteristics of onomatopoeia words. On the other hand, truncated text which is a part of a complete onomatopoeia word text but not adjacent to other parts on a page of the comic has no meaning. It is only when these truncated texts of a complete onomatopoeia word are linked together that their original meaning can be understood. So, a new method named M4C-COO was proposed to predict the link by researchers but the issue of class imbalance between truncated texts and non-truncated texts was ignored. To solve these problems, in this paper, a new recognition method exploiting the characteristics of onomatopoeia texts was devised; focal loss (FL) was introduced to predict the link and, furthermore, a completely novel loss function based on the focal loss (FB) was proposed. Finally, through experiments, the effectiveness of the works was demonstrated, achieving the state-of-the-art performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2552–2566 (2014)
Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4715–4723. IEEE (2019)
Baek, J., Matsui, Y., Aizawa, K.: Coo: Comic onomatopoeia dataset for recognizing arbitrary or truncated texts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 267–283. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_16
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
BOOKSTEIN, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell. 11(6), 567–585 (1989)
Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5076–5084. IEEE (2017)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Du, Y., et la.: SVTR: scene text recognition with a single visual model. arXiv preprint arXiv:2205.00159 (2022)
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7098–7107. IEEE (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE (2016)
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9992–10002. IEEE (2020)
Huang, Y., Sun, Z., Jin, L., Luo, C.: EPAN: effective parts attention network for scene text recognition. Neurocomputing 376, 202–213 (2020)
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, vol. 28, pp. 2017–2025 (2015)
Lee, J., Park, S., Baek, J., Oh, S.J., Kim, S., Lee, H.: On recognizing texts of arbitrary shapes with 2D self-attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 546–547 (2020)
Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8610–8617. AAAI (2019)
Liao, M., Pang, G., Huang, J., Hassner, T., Bai, X.: Mask TextSpotter v3: segmentation proposal network for robust scene text spotting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 706–722. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_41
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988. IEEE (2017)
Louis, J.B., Burie, J.C.: Detection of buried complex text. Case of onomatopoeia in comics books. In: Coustaty, M., Fornés, A. (eds.) ICDAR 2023. LNCS, vol. 14193, pp. 177–191. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41498-5_13
Louis, J.B., Burie, J.C., Revel, A.: Can deep learning approaches detect complex text? Case of onomatopoeia in comics albums. In: Rousseau, J.J., Kapralos, B. (eds.) ICPR 2022. Lecture Notes in Computer Science, vol. 13644, pp. 48–60. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-37742-6_4
Lu, N., et al.: Master: multi-aspect non-local network for scene text recognition. Pattern Recogn. 117, 107980 (2021)
Matsui, Y., et al.: Sketch-based manga retrieval using manga109 dataset. Multimedia Tools Appl. 76, 21811–21838 (2017)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28, pp. 91–99 (2015)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4168–4176. IEEE (2016)
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2018)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, vol. 27, pp. 3104–3112 (2014)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008 (2017)
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Advances in Neural Information Processing Systems, vol. 28, pp. 2692–2700 (2015)
Acknowledgements
This study is supported by the Project for Science and Technology of Inner Mongolia Autonomous Region under Grant 2019GG281, the Natural Science Foundation of Inner Mongolia Autonomous Region under Grant 2019ZD14, the Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region under Grant NJYT-20-A05, the fund of supporting the reform and development of local universities (Disciplinary Construction) and construction project of “Inner Mongolia Science and Technology Achievement Transfer and Transformation Demonstration Zone, University Collaborative Innovation Base, and University Entrepreneurship Training Base” (Supercomputing Power Project: 21300-231510).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ma, J., Wei, H., Wang, Y. (2024). Recognition and Link Prediction of Onomatopoeia Texts with Arbitrary Shapes. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14806. Springer, Cham. https://doi.org/10.1007/978-3-031-70543-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-70543-4_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70542-7
Online ISBN: 978-3-031-70543-4
eBook Packages: Computer ScienceComputer Science (R0)