Classification of Phonetic Characters by Space-Filling Curves

Owczarek, Valentin; Drapeau, Jordan; Burie, Jean-Christophe; Franco, Patrick; Coustaty, Mickaël; Mullot, Rémy; Eglin, Véronique

doi:10.1007/978-3-030-57058-3_7

Valentin Owczarek¹¹,
Jordan Drapeau¹¹,
Jean-Christophe Burie¹¹,
Patrick Franco¹¹,
Mickaël Coustaty¹¹,
Rémy Mullot¹¹ &
…
Véronique Eglin¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12116))

Included in the following conference series:

International Workshop on Document Analysis Systems

1483 Accesses

Abstract

Ancient printed documents are an infinite source of knowledge, but digital uses are usually complicated due to the age and the quality of the print. The Linguistic Atlas of France (ALF) maps are composed of printed phonetic words used to locate how words were pronounced over the country. Those words were printed using the Rousselot-Gillieron alphabet (extension of Latin alphabet) which bring character recognition problems due to the large number of diacritics. In this paper, we propose a phonetic character recognition process based on a space-filling curves approach. We proposed an original method adapted to this particular data set, able to finely classify, with more than 70% of accuracy, noisy and specific characters.

You have full access to this open access chapter, Download conference paper PDF

A Survey of Methods and Techniques in Offline Telugu Character Segmentation and Recognition

Printed Gujarati Character Classification Using High-Level Strokes

An Unsupervised Classification of Printed and Handwritten Telugu Words in Pre-printed Documents Using Text Discrimination Coefficient

Keywords

1 Introduction

The historical heritage largely contributes to the culture of each country around the world. This legacy generally appears as historical documents or ancient maps where graphical elements are often present. In this paper, we more specifically take interest in specific graphical documents named linguistic maps. Those maps transcribe the way a language is spoken in each area and help to understand the evolution of the language over time. In our research, we consider the Linguistic Atlas of France (ALF)^{Footnote 1} which is an atlas created between 1896 and 1900, then printed and published between 1902 and 1910. The ALF is an influential dialect atlas which presents an instantaneous picture of the dialect situation of France at the end of the 19th century, published in 35 booklets, bringing together in 13 volumes, which represents 1920 geolinguistic maps. The Swiss linguist Jules Gilliéron and the French businessman Edmond Edmont carry out the surveys for the ALF by travelling by rail, car and on foot through the 639 survey points of the Gallo-Romanic territory to spread the investigations as widely as possible. Thanks to its data homogeneously transcribed using the Rousselot-Gilliéron alphabet and published in a raw form on its maps, the ALF can be assimilated to a first-generation atlas which gathers more than one million reliable lexical data, which inspired many other linguistic atlas in Europe.

The ALF maps are mainly composed of four kinds of information: names of French departments (always surrounded by a rectangle), survey point numbers (identification of a city where a survey has been done), words in phonetics (pronunciation of the word written in Rousselot-Gilliéron phonetic alphabet), and borders. An illustration of these components is given in Fig. 1. Note that each map gather the different pronunciations of a given word in a single map. For example, the Fig. 1 represents a sample of the map made for the word “balance”.

Our research aims at automatically extracting the ALF information and generating maps with selected elements (currently, this process is done manually and it takes weeks to build a single map). In a previous work [5], we proposed to separate each type of information into layers (see Fig. 1) in order to prepare data for a subsequent analysis. Based on these results, this paper focus on the classification of characters in the phonetic layer.

2 Dataset Specifications

Edmond Edmont and Jules Gilliéron use in the ALF the phonetic notation developed and broadcast by Abbé Rousselot and Jules Gilliéron himself. The conventions that define the Rousselot-Gilliéron alphabet are recorded in the “Revue des patois gallo-romans (no. 4 1891, p. 5–6)” and repeated in the maps intelligence notice that accompanies the ALF. This alphabet is mainly made up of the letters of the Gallo-Roman languages (like French), on which diacritics (accents and notations) may be placed to symbolize more faithfully the way of pronouncing this letter or a part of the word (lemma). There are 1920 maps in the ALF which have all been written uniformly with the Rousselot-Gilliéron alphabet for the transcription of phonetic words. The Fig. 2 shows an example of a word transcribed into phonetic.

An inventory of the different characters used on all the maps has been made. The protocol used was to insert a character as new in the inventory if the diacritic was different. The superposition of diacritics being possible, for a given basic character (a, e, ...), the number of variations of some characters can be important. For example, the basic character “e” offers a range of 60 variations. Each variation is considered as a character of the inventory. From this inventory, a dataset has been created. It consists of an image of each character of the inventory found in the ALF maps. We chose to extract only one image per character because finding one representation of some specific characters, among the 1920 maps, is quite complex. Finding a second representation would have required a lot of effort. To date, there is no search tool within the maps, so the work has been done manually. This work brought together a collection of 251 different characters images (181 vowels, 61 consonants, 9 legend symbols) (Fig. 3).

Note that in this alphabet, 389 characters have been listed, but only 251 of them were found printed on the various maps. The images (from this dataset) have been extracted directly from the maps, which also bring a lot of noise to them. Indeed, noises are either related to the maps (textures degradations like holes, ink smudges, partially erased or slightly rotated characters) or created when scanning them (artifacts, low resolution, blur). This is why our dataset is a reduced dataset that shows wide disparities of low image quality (Fig. 4).

The image annotation consists in associating a class to each thumbnail (a class is an index in the latin alphabet) and its transcription in Rousselot-Gilliéron alphabet. The Table 1 shows a sample of the correspondence file of the dataset for the character “a”.

Table 1. A sample of the correspondence file of the dataset.

Full size table

The transcription of accented character (phonetic) in the correspondence file was made possible thanks to the Symbola font, which includes almost the entire Rousselot-Gilliéron alphabet. The font is available online and also provided with the dataset. Sometimes a problem of diacritics overlay can be observed depending on the editor which the font is used, but this phenomenon doesn’t have impact on the transcription result.

The dataset is available online and can be accessed for free^{Footnote 2}. It includes images of the characters before and after pre-processing, its correspondence file (filename, class, transcription), the previously presented Symbola font, and some learning logs.

All experiments that are proposed on this dataset are presented in the next sections. We present here our first attempt to cluster those letters and to map them to their transcription.

3 Related Work

The principle of image classification is based on a fast search for object similarity among a large collection of already identified objects. In the literature, there is several works to specify this similarity.

A few of them [4, 7] use the matching of some basic properties of the image like colors, textures, shapes, or edges. However, the maps from our project are ancient printed documents. This means that the time-degradation (variable texture, colors) and the printing made with the tools of that time (variable shapes, edges) make it difficult to exploit this kind of maps with these approaches.

Another solution is to use local invariant features which is becoming more and more popular these days, like SIFT [10] or SURF [2]. In our case, Rousselot-Gilliéron alphabet brings also classification problems, because of these diacritics, compared to Latin languages. Here again, our maps are printed with the tools of that time with ink. This means that we never have two purely identical letters (and mostly diacritics) on all the data that we have for comparison purposes which is not suitable.

Recently, in image classification Artificial Neural Network (ANN) are widely used, more precisely Convolutional neural network (CNN) [3] which is inspired by human vision. The ANN is composed of specific layer computing results from small part of the image. The main drawback of this kind of technique is the complexity of training. In our case, the network will not converge efficiently, because of the lack of training images, and high variety in the characters (root and diacritics). However, there are methodologies based on a few numbers of samples or inclusive zero shot called one-shot learning. These methods are based on transfer learning or on a mapping between several representation spaces, and in our case, there is no consistent base offers similar characteristics to the studied characters (root and diacritics).

In opposite, we proposed a new technique based on space-filling curve. This technique takes advantage of the automatic discovery of key points and compact representation of class model. These features match our requirements and we will then explore them in the following of this paper.

4 Space-Filling Curve for Characters Classification

4.1 An Overview of Space-Filling Curves

Space-Filling curves (SFCs) are historically a mathematical curiosity, they are continuous non-differentiable functions. The first one was discovered by G. Peano in 1890 [13]. One year later D. Hilbert have proposed a different curve [8], commonly called Hilbert curve, this curve is originally defined for dimension $D=2$, but multidimensional are defined using the Reflected Binary Gray code (RBG). The Fig. 5, shown the Hilbert space-filling curve for dimension $D=2$ and order $n= 1, 2, 3$.

In this paper, $S_n^D$ is the space-filling curve function transforming a D dimensional point into an integer called index with the order n by the multidimensional Hilbert curve and the inverse function by $\bar{S}_n^D$.

The main property of the Hilbert curve is the neighborhood preserving, the fact that two close D dimensional points separated by a short distance have high probability to be associate to indices by $S_n^D$ separated by a short distance i.e:

$$\begin{aligned} \begin{aligned} S^D(\mathbf {X}_i)&= I_i ,\ \bar{S}^D(I_i)=\mathbf {X}_i\\ S^D(\mathbf {X}_j)&= I_j,\ \bar{S}^D(I_j)=\mathbf {X}_j \\ m(\mathbf {X}_i, \mathbf {X}_j)&= \varepsilon _D,\ m(I_i,I_j) = \varepsilon _1\\ \end{aligned} \end{aligned}$$

(1)

where m is a distance function, with $\varepsilon _D$ and $\varepsilon _1$ small. According to [6, 11], the curve conserving the best the locality is the multidimensional extension of the Hilbert curve and for that reason was used in this paper. Complementary information on Space-filling curves can be find in [14] with more applications in [1].

Along years, many applications are drawn on SFCs, for example image storing and retrieving [15], derivative-free global optimization [9] and image encryption [12]. Here SFC are used in a new framework to recognize the characters of the Rousselot-Gilliéron phonetic alphabet.

4.2 Characters Image Classification

When dealing with characters image classification with the proposed approach, several points have to be met. Firstly, images have to be pre-processed in order to standardize them, for example changing the color scheme to black and white (binarization), deleting small noisy components, resizing the images. Secondly, the classification technique has to be translation and rotation invariant or not very sensitive.

In this section, the SFC character images classification is explained with a focus on the pre-processing, and the translation and rotation sensitivity.

Data Pre-processing. As mentioned in Sect. 2, images are degraded, so we propose to standardize the images with five steps presented in Fig. 6:

1.
Transform the color images to black and white:
- Divided each pixel of the image by the standard deviation,
- Then set to white or black each pixel depending if the value is lower or bigger than the mean pixel value.
2.
Delete small black components to reduce noise and black and white transformation artefact.
3.
Find the Region of Interest (ROI) as the smallest box embracing the black pixels.
4.
Center the main component by adding white padding.
5.
Resize the image.

As result of this pre-processing, 64 $\times $ 64 pixels images are obtained with gray scale color scheme and main component centered.

Training the SFC Classifier. As shown in Eq. 1, it is possible to transform a data point into an index using the SFC function. But to represent an image, we need as much indices as pixels. In this work, an image is represented by a distribution histogram where the bins length is not regular. Those are calculated using the distribution of all indices produced by all images of a training data set. The irregularity in the bins length permit to adapt the classifier to the particularities of the dataset. Once the bins are known, we can create for each image an histogram and then merge all histograms for a given character into one unique using the statistic mean. To summarize, the training part create two linked objects: The irregular bins length and one histogram by classes of character.

As mention in Fig. 7, to create the bin length you need to firstly compute every index from every non-white pixel. Here the polar coordinate $(r,\theta )$ are used to ensure low rotation sensitivity. The translation sensitivity is always assured: SFC assigns close indices for two close points, then if $x_a$ and $x_b$, two points, with same color c but with slightly different polar coordinate r and $\theta $ will have close indices. The same reasoning can be made for the same polar coordinates but with slightly different colors. The three-dimensional extension of the Hilbert-curve is used which is based on the Reflected-Binary-Gray-Code. Our assumption is that the curve used have to create a bijection between triplet (theta, rho, c), the pixels and the indices: the order n is then 8 to be able to store the maximum value of c(=255). The maximum index value is $(2^nD)-1(=16.777.215)$. The bin length is computed using a dichotomous search to create smallest histograms bins where the frequency of appearance is high, following the Algorithm 1.

The histogram can be then interpreted as low dimension representation of each image.

Classification of Character. The classification process is similar to k-means where the distribution histograms can be seen as centroids. Then, the class of a new image is the class of the nearest centroids. Given the sparsity of dimensional space ($D \approx 300$), the cosine distance is used to compare two histograms. The Fig. 8 shows the process of classification:

1.
A new image is pre-process (cf. Fig. 6)
2.
The image is transformed to a histogram with irregular bins, previously compute in the training step (cf. Fig. 7)
3.
The determined class is the class corresponding to the nearest mean distribution histogram.

In this section, a framework able to classify noisy and low quality images is presented. The input data are standardized to a square binarized images where the root character is centered. The SFC-classification is based on an automatic selection of zone-of-interest using indices and reduce to variable length bins histograms. The method is specialized by learning the bins length of these histograms. This task is done on the training dataset using dichotomous separation. The method is slightly robust to translation and rotation through the use of polar coordinates and SFC, making the classification process well adapted to the dataset presented in Sect. 2.

4.3 Results

As described earlier, the SFC based classifier use a training and test principle. This results in having to set aside all the characters that are present only once in the dataset, because with only one image we won’t be able to train and test the model. As a security measure, any image class that does not have at least four images will be set aside for these tests. The Table 2 shows the repartition of used character within the dataset.

Table 2. Number of images by class of character

Full size table

The tests were therefore carried out on a total of two hundred and nine character images after subtracting the small classes. To evaluate the robustness of the model, three different sets of tests were carried out with for each class of character a random selection, where 75% are assigned for training and 25% for test. The results are presented in Table 3.

Table 3. Portion of images for the training and the tests, and the result accuracy of classification for the three experiments.

Full size table

Concerning the results, the proposed method is able to detect slightly more than 70% on each experiment. These results are understandable because the standardization of the images reduce the variance of the numerous noises present on the images of the basic dataset, and thus it offers a certain stability, but on the other hand it is not perfect either. Indeed, the standardization sometimes goes:

cut letters during binarization,
cause a zoom effect of the character, to which the model is sensitive,
add or not add a padding if a diacritic is present.

We also notice that classes with too few images give us variable recognition rates as can be expected with learning models. The Table 4 perfectly illustrates this last point by showing the accuracy of classification for each character inside each experiment.

Table 4. For each test set, this table shows the accuracy of classification for each classes of characters.

Full size table

Results can still be improved. On the one hand, a next step could be to increase the number of images per class by capturing more occurrences of a character, or by generating slightly different images by adding few noises on the one already extracted. On the other hand, the pre-processing that seeks to standardize images by binarizing, removing noises and putting them at the same resolution, generates defects that should be fixed, because some generate missing diacritics or sections of characters.

5 Conclusion

This paper presents an original approach for classifying phonetics characters images. The proposed method consists in the representation of characters by irregular histograms, created with space-filling curve, able to differentiate characters between them. Note that our method is able to correctly classify 70% of our phonetic characters images dataset subset.

The obtained results are quite encouraging because they prove that from few data, which are not perfect, we are able to obtain reliable performance. An interesting strategy could be, for example, to inject its result in another algorithm which will focus to identify the diacritics in an optimized way. This may be a final alternative solution to build an OCR for this kind of phonetic alphabet.

However, these results will have to be compared with related works presented in Sect. 3, and then can still be improved as described in the Sect. 4.3. The SFC technique here consider only the color of the pixels and the position (polar coordinate) to produce an index, it will be thoughtful to use nearest pixels to increase the quality of the characterization.

To go further, it might also be interesting to use the SFC with phonetic words images. This will result in the formation of groups of similar words, and by definition, isoglosses (regions where people say exactly the same thing) that are of real value to dialectologists.

Notes

1.
Maps dataset available at http://lig-tdcge.imag.fr/cartodialect5.
2.
The phonetics characters images dataset is available here: http://l3i-share.univ-lr.fr/datasets/Dataset_CharRousselotGillerion.zip.

References

Bader, M.: Space-Filling Curves. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-31046-1
Book MATH Google Scholar
Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Speeded-up robust features (surf). Comput. Vis. Image Underst. 110(3), 346–359 (2008). Similarity Matching in Computer Vision and Multimedia
Article Google Scholar
Chang, O., Constante, P., Gordon, A., Singaña, M.: A novel deep neural network that uses space-time features for tracking and recognizing a moving object. J. Artif. Intell. Soft Comput. Res. 7(2), 125–136 (2016)
Article Google Scholar
Cheng, Y.C., Chen, S.Y.: Image classification using color, texture and regions. Image Vis. Comput. 21(9), 759–776 (2003)
Article Google Scholar
Drapeau, J., et al.: Extraction of ancient map contents using trees of connected components. In: Fornés, A., Lamiroy, B. (eds.) GREC 2017. LNCS, vol. 11009, pp. 115–130. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02284-6_9
Chapter Google Scholar
Faloutsos, C., Roseman, S.: Fractals for secondary key retrieval. In: Proceedings of the Eighth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1989, pp. 247–252. ACM (1989)
Google Scholar
Fredembach, C., Schröder, M., Süsstrunk, S.: Region-Based Image Classification for Automatic Color Correction, January 2003
Google Scholar
Hilbert, D.: Ueber die stetige Abbildung einer Line auf ein Flächenstück. Math. Ann. 38(3), 459–460 (1891)
Article MathSciNet Google Scholar
Lera, D., Sergeyev, Y.D.: GOSH: derivative-free global optimization using multi-dimensional space-filling curves. J. Glob. Optim. 71(1), 193–211 (2018)
Article MathSciNet Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Article Google Scholar
Moon, B., Jagadish, H.V., Faloutsos, C., Saltz, J.H.: Analysis of the clustering properties of the Hilbert space-filling curve. IEEE Trans. Knowl. Data Eng. 13(1), 124–141 (2001)
Article Google Scholar
Murali, P., Sankaradass, V.: An efficient space filling curve based image encryption. Multimedia Tools Appl. 78(2), 2135–2156 (2018). https://doi.org/10.1007/s11042-018-6234-8
Article Google Scholar
Peano, G.: Sur une courbe, qui remplit toute une aire plane. Math. Ann. 36, 157–160 (1890). https://doi.org/10.1007/BF01199438
Article MathSciNet MATH Google Scholar
Sagan, H.: Space-Filling Curves. Springer, New York (1994). https://doi.org/10.1007/978-1-4612-0871-6
Book MATH Google Scholar
Song, Z., Roussopoulos, N.: Using Hilbert curve in image storing and retrieving. Inf. Syst. 27(8), 523–536 (2002)
Article Google Scholar

Download references

Acknowledgment

This work is carried out in the framework of the ECLATS project and supported by the French National Research Agency (ANR) under the grant number ANR-15-CE38-0002.

Author information

Authors and Affiliations

Laboratoire L3i, University of La Rochelle, 17042, La Rochelle Cedex 1, France
Valentin Owczarek, Jordan Drapeau, Jean-Christophe Burie, Patrick Franco, Mickaël Coustaty & Rémy Mullot
Université de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, 69621, Villeurbanne, France
Véronique Eglin

Authors

Valentin Owczarek
View author publications
You can also search for this author in PubMed Google Scholar
Jordan Drapeau
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Christophe Burie
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Franco
View author publications
You can also search for this author in PubMed Google Scholar
Mickaël Coustaty
View author publications
You can also search for this author in PubMed Google Scholar
Rémy Mullot
View author publications
You can also search for this author in PubMed Google Scholar
Véronique Eglin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valentin Owczarek .

Editor information

Editors and Affiliations

Huazhong University of Science and Technology, Wuhan, China
Xiang Bai
Autonomous University of Barcelona, Barcelona, Spain
Dimosthenis Karatzas
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Owczarek, V. et al. (2020). Classification of Phonetic Characters by Space-Filling Curves. In: Bai, X., Karatzas, D., Lopresti, D. (eds) Document Analysis Systems. DAS 2020. Lecture Notes in Computer Science(), vol 12116. Springer, Cham. https://doi.org/10.1007/978-3-030-57058-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-57058-3_7
Published: 14 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57057-6
Online ISBN: 978-3-030-57058-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Classification of Phonetic Characters by Space-Filling Curves

Abstract

Similar content being viewed by others

A Survey of Methods and Techniques in Offline Telugu Character Segmentation and Recognition

Printed Gujarati Character Classification Using High-Level Strokes

An Unsupervised Classification of Printed and Handwritten Telugu Words in Pre-printed Documents Using Text Discrimination Coefficient

Keywords

1 Introduction

2 Dataset Specifications

3 Related Work

4 Space-Filling Curve for Characters Classification

4.1 An Overview of Space-Filling Curves

4.2 Characters Image Classification

4.3 Results

5 Conclusion

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships