Elsevier

Pattern Recognition Letters

Volume 136, August 2020, Pages 134-141
Pattern Recognition Letters

An attention-based row-column encoder-decoder model for text recognition in Japanese historical documents

https://doi.org/10.1016/j.patrec.2020.05.026Get rights and content

Highlights

  • Multiple text lines recognition without segmentation.

  • Row-column encoder encodes features in both vertical and horizontal directions.

  • Residual LSTM add context from all past attentions to the decoder.

  • State-of-the-art model for recognizing Japanese historical documents.

Abstract

This paper presents an attention-based row-column encoder-decoder (ARCED) model for recognizing an input image of multiple text lines from Japanese historical documents without explicit segmentation of lines. The recognition system has three main parts: a feature extractor, a row-column encoder, and a decoder. We introduce a row-column BLSTM in the encoder and a residual LSTM network in the decoder. The whole system is trained end-to-end by a standard cross-entropy loss function, requiring only document images and their ground-truth text. We experimentally evaluate the performance of ARCED on the dataset of Japanese historical documents: Kana-PRMU. The results of the experiments show that ARCED outperforms the state-of-the-art recognition methods on the dataset. Furthermore, we demonstrate that the row-column BLSTM in the encoder and the residual LSTM in the decoder improves the performance of the encoder-decoder model for the recognition of Japanese historical document.

Introduction

Until the Edo period (1603 - 1868), Japanese documents were vertically written with a brush or wood-block printed. Characters, especially Kanji of Chinese origin and Kana (a set of 46 phonetic characters made from Kanji), were deformed and cursively written, so even experts have difficulty in reading them. Due to this reason, Japanese historical documents recognition is still a big challenging problem and has been receiving much attention from numerous researchers [1], [2], [3], [4]. Under the support by the Center for Open Data in the Humanities (CODH) in Japan, the technical committee on Pattern Recognition and Media Understanding (PRMU) in the academic society of IEICE Japan held a contest to read deformed Kana in 2017 [5]. The tasks are divided into three levels according to the number of characters in a circumscribed rectangle: level 1: single characters, level 2: sequences of three vertically written Kana characters, and level 3: unrestricted sets of characters composed of three or more characters possibly in multiple lines. The dataset for the contest consisting of three sub-datasets for the three levels is published.1 We call the dataset Kana-PRMU. In this contest, we proposed the combination of a pre-trained CNN and an LSTM with CTC named by Deep Convolutional Recurrent Network (DCRN) for level 2 and the DCRN combined with a vertical line segmentation method for level 3 [6]. Here, CNN stands for Convolutional Neural Network, LSTM for Long Short-Term Memory Neural Network and CTC for Connectionist Temporal Classification. These methods won the best award with 12.88% character error rate (CER) for level 2 and 26.70% for level 3. After the contest, we presented their end-to-end trained versions, with the results of the new state-of-the-art accuracy of 10.90% CER for level 2 and 18.50% for level 3 [7].

This paper introduces an attention-based row-column encoder-decoder (ARCED) model for recognizing multiple text lines image in Japanese historical documents. Since Japanese historical documents were written cursively through an entire text line with neighbor text lines touching each other, a segmentation-free approach is sought. We propose a model consisting of three main parts: a feature extractor, a row-column encoder, and a decoder. Given an input image, the feature extractor extracts a feature grid from it by a CNN. The row-column encoder applies a row bidirectional LSTM (BLSTM) and a column BLSTM to encode the feature grid in the horizontal direction and the vertical direction, respectively. The decoder applies an attention-based LSTM to generate the final target text based on the attended pertinent features. In this model, we incorporate a row-column BLSTM in the encoder to capture the sequential order information in both the vertical and the horizontal directions and a residual LSTM network in the decoder to take advantage of entire past attention information.

Experiments on level 2 and level 3 of the Kana-PRMU dataset show that the ARCED model reduces the error rates for single text lines (level 2) and for three or more characters possibly in multiple lines (level 3) drastically from the previous methods [7]. The experiments also show that the row-column BLSTM in the encoder and the residual LSTM network in the decoder improve the performance of the attention-based encoder-decoder model for the text recognition task in the Japanese historical documents.

The contributions of this paper are three folds. First, we present an attention-based encoder-decoder model for recognizing multiple text lines in Japanese historical documents. Second, we propose a row-column BLSTM in the encoder to encode the grid feature in both the vertical and horizontal directions. Third, we introduce a residual LSTM network in the decoder to take advantage of the entire past attention information.

The rest of this paper is organized as follows. Section 2 describes the related work. Section 3 presents an overview of the ARCED model. The dataset is described in Section 4 and our experimental results are reported in Section 5. Finally, conclusions are presented in Section 6.

Section snippets

Related work

Document recognition consists of two main parts: layout analysis [8] and text recognition, which we will survey briefly here. For most western languages, a segmentation-free approach that applies fixed-width sliding windows to extract features has often been used [9]. However, for languages of Chinese origin and also some western languages, a segmentation-based approach that segments the input text into characters and fragments, and merges the fragments in the recognition stage has commonly

The proposed model

We propose an attention-based row-column encoder-decoder (ARCED) model consisting of three main parts: a feature extractor, a row-column encoder and a decoder as shown in Fig. 1. From the bottom of the ARCED, the feature extractor extracts a feature grid from the input image by DCNN. Then, the row-column encoder applies a row BLSTM and a column BLSTM to encode the feature grid in the horizontal and vertical directions, respectively. At the top of ARCED, the decoder applies an attention-based

Datasets

The PRMU contest consists of three tasks as follows. Level 1: single Kana characters; level 2: sequences of three vertically written Kana characters; and level 3: three or more Kana characters possibly in multiple lines. All the tasks are to recognize Kana characters of 46 categories while Kanji characters are excluded. The Kana-PRMU dataset is compiled from 2,222 scanned pages of 15 Japanese historical books and consists of three subsets for three levels. The datasets for levels 1, 2, and 3

Experiments

To verify the effectiveness of each part of the ARCED model and compare its performance with other methods, we conducted experiments on the level 2 and level 3 subsets of Kana_PRMU. The experiment details are described in Section 5.1, the results of the experiments are presented in Section 5.2, and the analysis on recognized and misrecognized samples is given in Section 5.3.

Conclusion

This paper presented an attention-based row-column encoder-decoder model named ARCED for recognizing multiple text lines of deformed Kana sequences in Japanese historical documents. We introduced the row-column BLSTM in the encoder and the residual LSTM in the decoder. Following the experiments on the level 2 and level 3 subsets of the Kana_PRMU dataset, our proposed ARCED model achieved 4.15% and 12.69% character error rates in the test sets of level 2 and level 3, respectively. First, the

CRediT authorship contribution statement

Nam Tuan Ly: Conceptualization, Formal analysis, Data curation. Cuong Tuan Nguyen: Conceptualization, Formal analysis, Data curation, Writing - original draft. Masaki Nakagawa: Conceptualization, Formal analysis, Data curation, Writing - review & editing.

Declaration of Competing Interest

This manuscript has not been submitted to, nor is under review at, anotherjournal or other publishing venue.

The authors have no affiliation with any organization with a direct or indirectfinancial interest in the subject matter discussed in the manuscript.

The following authors have affiliations with organizations with direct orindirect financial interest in the subject matter discussed in the manuscript.

Acknowledgments

This work is being supported by the Grant-in-Aid for Scientific Research (S)−18H05221, (A)−18H03597 and being partially support by Grant-in-Aid for Early-Career Scientists 18K18068. We would also thank Prof. Bipin Indurkhya for advising us to improve the presentation.

References (36)

  • H. Bunke

    Recognition of cursive Roman handwriting: past, present and future

  • Q.-F. Wang et al.

    Handwritten Chinese text recognition by integrating multiple contexts

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • R. Plamondon et al.

    On-line and off-line handwriting recognition : a comprehensive survey

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • B. Shi et al.

    An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • A. Graves et al.

    A novel connectionist system for unconstrained handwriting recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • A. Graves et al.

    Offline handwriting recognition with multidimensional recurrent neural networks

  • N.-T. Ly et al.

    Deep convolutional recurrent network for segmentation-free offline handwritten Japanese text recognition

  • N.T. Ly et al.

    Training an end-to-end model for offline handwritten Japanese text recognition by generated synthetic patterns

  • Cited by (17)

    View all citing articles on Scopus
    View full text