An attention-based row-column encoder-decoder model for text recognition in Japanese historical documents

doi:10.1016/j.patrec.2020.05.026

Pattern Recognition Letters

Volume 136, August 2020, Pages 134-141

https://doi.org/10.1016/j.patrec.2020.05.026 Get rights and content

Highlights

•
Multiple text lines recognition without segmentation.
•
Row-column encoder encodes features in both vertical and horizontal directions.
•
Residual LSTM add context from all past attentions to the decoder.
•
State-of-the-art model for recognizing Japanese historical documents.

Abstract

This paper presents an attention-based row-column encoder-decoder (ARCED) model for recognizing an input image of multiple text lines from Japanese historical documents without explicit segmentation of lines. The recognition system has three main parts: a feature extractor, a row-column encoder, and a decoder. We introduce a row-column BLSTM in the encoder and a residual LSTM network in the decoder. The whole system is trained end-to-end by a standard cross-entropy loss function, requiring only document images and their ground-truth text. We experimentally evaluate the performance of ARCED on the dataset of Japanese historical documents: Kana-PRMU. The results of the experiments show that ARCED outperforms the state-of-the-art recognition methods on the dataset. Furthermore, we demonstrate that the row-column BLSTM in the encoder and the residual LSTM in the decoder improves the performance of the encoder-decoder model for the recognition of Japanese historical document.

Introduction

Until the Edo period (1603 - 1868), Japanese documents were vertically written with a brush or wood-block printed. Characters, especially Kanji of Chinese origin and Kana (a set of 46 phonetic characters made from Kanji), were deformed and cursively written, so even experts have difficulty in reading them. Due to this reason, Japanese historical documents recognition is still a big challenging problem and has been receiving much attention from numerous researchers [1], [2], [3], [4]. Under the support by the Center for Open Data in the Humanities (CODH) in Japan, the technical committee on Pattern Recognition and Media Understanding (PRMU) in the academic society of IEICE Japan held a contest to read deformed Kana in 2017 [5]. The tasks are divided into three levels according to the number of characters in a circumscribed rectangle: level 1: single characters, level 2: sequences of three vertically written Kana characters, and level 3: unrestricted sets of characters composed of three or more characters possibly in multiple lines. The dataset for the contest consisting of three sub-datasets for the three levels is published.¹ We call the dataset Kana-PRMU. In this contest, we proposed the combination of a pre-trained CNN and an LSTM with CTC named by Deep Convolutional Recurrent Network (DCRN) for level 2 and the DCRN combined with a vertical line segmentation method for level 3 [6]. Here, CNN stands for Convolutional Neural Network, LSTM for Long Short-Term Memory Neural Network and CTC for Connectionist Temporal Classification. These methods won the best award with 12.88% character error rate (CER) for level 2 and 26.70% for level 3. After the contest, we presented their end-to-end trained versions, with the results of the new state-of-the-art accuracy of 10.90% CER for level 2 and 18.50% for level 3 [7].

This paper introduces an attention-based row-column encoder-decoder (ARCED) model for recognizing multiple text lines image in Japanese historical documents. Since Japanese historical documents were written cursively through an entire text line with neighbor text lines touching each other, a segmentation-free approach is sought. We propose a model consisting of three main parts: a feature extractor, a row-column encoder, and a decoder. Given an input image, the feature extractor extracts a feature grid from it by a CNN. The row-column encoder applies a row bidirectional LSTM (BLSTM) and a column BLSTM to encode the feature grid in the horizontal direction and the vertical direction, respectively. The decoder applies an attention-based LSTM to generate the final target text based on the attended pertinent features. In this model, we incorporate a row-column BLSTM in the encoder to capture the sequential order information in both the vertical and the horizontal directions and a residual LSTM network in the decoder to take advantage of entire past attention information.

Experiments on level 2 and level 3 of the Kana-PRMU dataset show that the ARCED model reduces the error rates for single text lines (level 2) and for three or more characters possibly in multiple lines (level 3) drastically from the previous methods [7]. The experiments also show that the row-column BLSTM in the encoder and the residual LSTM network in the decoder improve the performance of the attention-based encoder-decoder model for the text recognition task in the Japanese historical documents.

The contributions of this paper are three folds. First, we present an attention-based encoder-decoder model for recognizing multiple text lines in Japanese historical documents. Second, we propose a row-column BLSTM in the encoder to encode the grid feature in both the vertical and horizontal directions. Third, we introduce a residual LSTM network in the decoder to take advantage of the entire past attention information.

The rest of this paper is organized as follows. Section 2 describes the related work. Section 3 presents an overview of the ARCED model. The dataset is described in Section 4 and our experimental results are reported in Section 5. Finally, conclusions are presented in Section 6.

Section snippets

Related work

Document recognition consists of two main parts: layout analysis [8] and text recognition, which we will survey briefly here. For most western languages, a segmentation-free approach that applies fixed-width sliding windows to extract features has often been used [9]. However, for languages of Chinese origin and also some western languages, a segmentation-based approach that segments the input text into characters and fragments, and merges the fragments in the recognition stage has commonly

The proposed model

We propose an attention-based row-column encoder-decoder (ARCED) model consisting of three main parts: a feature extractor, a row-column encoder and a decoder as shown in Fig. 1. From the bottom of the ARCED, the feature extractor extracts a feature grid from the input image by DCNN. Then, the row-column encoder applies a row BLSTM and a column BLSTM to encode the feature grid in the horizontal and vertical directions, respectively. At the top of ARCED, the decoder applies an attention-based

Datasets

The PRMU contest consists of three tasks as follows. Level 1: single Kana characters; level 2: sequences of three vertically written Kana characters; and level 3: three or more Kana characters possibly in multiple lines. All the tasks are to recognize Kana characters of 46 categories while Kanji characters are excluded. The Kana-PRMU dataset is compiled from 2,222 scanned pages of 15 Japanese historical books and consists of three subsets for three levels. The datasets for levels 1, 2, and 3

Experiments

To verify the effectiveness of each part of the ARCED model and compare its performance with other methods, we conducted experiments on the level 2 and level 3 subsets of Kana_PRMU. The experiment details are described in Section 5.1, the results of the experiments are presented in Section 5.2, and the analysis on recognized and misrecognized samples is given in Section 5.3.

Conclusion

This paper presented an attention-based row-column encoder-decoder model named ARCED for recognizing multiple text lines of deformed Kana sequences in Japanese historical documents. We introduced the row-column BLSTM in the encoder and the residual LSTM in the decoder. Following the experiments on the level 2 and level 3 subsets of the Kana_PRMU dataset, our proposed ARCED model achieved 4.15% and 12.69% character error rates in the test sets of level 2 and level 3, respectively. First, the

CRediT authorship contribution statement

Nam Tuan Ly: Conceptualization, Formal analysis, Data curation. Cuong Tuan Nguyen: Conceptualization, Formal analysis, Data curation, Writing - original draft. Masaki Nakagawa: Conceptualization, Formal analysis, Data curation, Writing - review & editing.

Declaration of Competing Interest

This manuscript has not been submitted to, nor is under review at, anotherjournal or other publishing venue.

The authors have no affiliation with any organization with a direct or indirectfinancial interest in the subject matter discussed in the manuscript.

The following authors have affiliations with organizations with direct orindirect financial interest in the subject matter discussed in the manuscript.

Acknowledgments

This work is being supported by the Grant-in-Aid for Scientific Research (S)−18H05221, (A)−18H03597 and being partially support by Grant-in-Aid for Early-Career Scientists 18K18068. We would also thank Prof. Bipin Indurkhya for advising us to improve the presentation.

References (36)

S. Eskenazi et al.
A comprehensive survey of mostly textual document segmentation algorithms since 2008
Pattern Recognit.
(2017)
H. Fujisawa
Forty years of research in character and document recognition—an industrial perspective
Pattern Recognit.
(2008)
A. Graves et al.
Framewise phoneme classification with bidirectional LSTM and other neural network architectures
A. Kitadai et al.
Document image retrieval to support reading Mokkans
K. Terasawa et al.
A fast appearance-based full-text search method for historical newspaper images
T. Van Phan et al.
A re-assembling scheme of fragmented Mokkan images
A. Kitadai et al.
Similarity evaluation and shape feature extraction for character pattern retrieval to support reading historical documents
PRMU, PRMU algorithm contest 2017, (2017). https://sites.google.com/view/alcon2017prmu/. (Accessed 30 June...
H.T. Nguyen et al.
Attempts to recognize anomalously deformed Kana in Japanese historical documents
N.-T. Ly et al.
Recognition of anomalously deformed kana sequences in Japanese historical documents
IEICE Trans. Inf. Syst.
(2019)

H. Bunke

Recognition of cursive Roman handwriting: past, present and future

Q.-F. Wang et al.

Handwritten Chinese text recognition by integrating multiple contexts

IEEE Trans. Pattern Anal. Mach. Intell.

(2012)

R. Plamondon et al.

On-line and off-line handwriting recognition : a comprehensive survey

IEEE Trans. Pattern Anal. Mach. Intell.

(2000)

B. Shi et al.

An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2017)

A. Graves et al.

A novel connectionist system for unconstrained handwriting recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2009)

A. Graves et al.

Offline handwriting recognition with multidimensional recurrent neural networks

N.-T. Ly et al.

Deep convolutional recurrent network for segmentation-free offline handwritten Japanese text recognition

N.T. Ly et al.

Training an end-to-end model for offline handwritten Japanese text recognition by generated synthetic patterns

Cited by (17)

Recognition of printed Urdu script in Nastaleeq font by using CNN-BiGRU-GRU Based Encoder-Decoder Framework
2023, Intelligent Systems with Applications
RNN based Deep learning models has shown tremendous success in sequential and temporal data where the order is critical to achieve higher accuracy in context understanding. RNN family like LSTM, BLSTM, GRU, BiGRU etc. are the mainly used models in these kind of sequential tasks. RNN family based Encoder-decoder frameworks are widely used for the recognition of various languages scripts. However, in Urdu, very less research has been done especially with the deep learning models. The existing research work for printed Urdu recognition have shown that the current models only work for very basic sentences of Urdu but in case of complex words and sentences, these algorithms totally fail in terms of accuracy and the time complexity in identification of the Nastaleeq font writing. To identify printed Urdu text in images, we have proposed an encoder-decoder based hybrid deep learning approach with Convolutional Neural Network (CNN) for feature extraction part, bi-directional Gated Recurrent Unit network (BiGRU) as encoder and Gated Recurrent Unit network (GRU) as decoder. The CNN layer of the algorithm is used to obtain ligature features in Urdu, which are subsequently utilized by encoder (BiGRU) and decoder (GRU) to recognize the sentences by accurately distinguishing the characters and joiners. Experimental results have shown that our proposed CNN-BiGRU-GRU hybrid technique with specific hyper-parameter tuning performs well as compared to other state-of-the-art algorithms in terms of epochs (70 epochs as compared to 100 with BLSTM-LSTM based encoder decoder), 6 percent increase of Character Recognition Accuracy (86.95 percent as compared to 81.08 percent of BLSTM-LSTM), 10 percent increase of Word Recognition Accuracy (WRA) (89.48% as compared to 79.06 percent of BLSTM-LSTM) and less time complexity (18 seconds less than BLSTM-LSTM with same system configuration).
Pattern recognition and artificial intelligence techniques for cultural heritage
2020, Pattern Recognition Letters
This paper is the editorial of the virtual special issue (VSI) “Pattern recognition and artificial intelligence techniques for cultural heritage”, of which the authors of this paper have been the guest editors. It aims to bring together the work of experts from the fields of pattern recognition and artificial intelligence and that of cultural heritage. This multidisciplinary subject covers a wide spectrum spanning from the study of the cultural heritage to the development of tools based on PR/AI techniques for cultural heritage analysis, reconstruction and understanding. The papers included in this special issue allowed us to highlight the advances on this subject from a wide-angle perspective, as well as to stimulate new theoretical and applied researches for better characterizing the state of the art in this domain.
A semi-self-supervised learning model to recognize handwritten characters in ancient documents in Indian scripts
2024, Neural Computing and Applications
A hybrid deep learning model to recognize handwritten characters in ancient documents in Devanagari and Maithili scripts
2024, Multimedia Tools and Applications
An optimized CNN system to recognize handwritten characters in ancient documents in Grantha script
2023, International Journal of Information Technology (Singapore)
Medical Mobile Robot Localization in Hospital Corridor Environment Using Laser SLAM and Text Features
2022, Journal of Imaging Science and Technology

View all citing articles on Scopus

View full text

An attention-based row-column encoder-decoder model for text recognition in Japanese historical documents

Highlights

Abstract

Introduction

Section snippets

Related work

The proposed model

Datasets

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Pattern Recognit.

Pattern Recognit.

Document image retrieval to support reading Mokkans

A fast appearance-based full-text search method for historical newspaper images

A re-assembling scheme of fragmented Mokkan images

Similarity evaluation and shape feature extraction for character pattern retrieval to support reading historical documents

Attempts to recognize anomalously deformed Kana in Japanese historical documents

Recognition of anomalously deformed kana sequences in Japanese historical documents

IEICE Trans. Inf. Syst.

Recognition of cursive Roman handwriting: past, present and future

Handwritten Chinese text recognition by integrating multiple contexts

IEEE Trans. Pattern Anal. Mach. Intell.

On-line and off-line handwriting recognition : a comprehensive survey

IEEE Trans. Pattern Anal. Mach. Intell.

An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition

IEEE Trans. Pattern Anal. Mach. Intell.

A novel connectionist system for unconstrained handwriting recognition

IEEE Trans. Pattern Anal. Mach. Intell.

Offline handwriting recognition with multidimensional recurrent neural networks

Deep convolutional recurrent network for segmentation-free offline handwritten Japanese text recognition

Training an end-to-end model for offline handwritten Japanese text recognition by generated synthetic patterns