Abstract
Ancient printed documents are an infinite source of knowledge, but digital uses are usually complicated due to the age and the quality of the print. The Linguistic Atlas of France (ALF) maps are composed of printed phonetic words used to locate how words were pronounced over the country. Those words were printed using the Rousselot-Gillieron alphabet (extension of Latin alphabet) which bring character recognition problems due to the large number of diacritics. In this paper, we propose a phonetic character recognition process based on a space-filling curves approach. We proposed an original method adapted to this particular data set, able to finely classify, with more than 70% of accuracy, noisy and specific characters.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The historical heritage largely contributes to the culture of each country around the world. This legacy generally appears as historical documents or ancient maps where graphical elements are often present. In this paper, we more specifically take interest in specific graphical documents named linguistic maps. Those maps transcribe the way a language is spoken in each area and help to understand the evolution of the language over time. In our research, we consider the Linguistic Atlas of France (ALF)Footnote 1 which is an atlas created between 1896 and 1900, then printed and published between 1902 and 1910. The ALF is an influential dialect atlas which presents an instantaneous picture of the dialect situation of France at the end of the 19th century, published in 35 booklets, bringing together in 13 volumes, which represents 1920 geolinguistic maps. The Swiss linguist Jules Gilliéron and the French businessman Edmond Edmont carry out the surveys for the ALF by travelling by rail, car and on foot through the 639 survey points of the Gallo-Romanic territory to spread the investigations as widely as possible. Thanks to its data homogeneously transcribed using the Rousselot-Gilliéron alphabet and published in a raw form on its maps, the ALF can be assimilated to a first-generation atlas which gathers more than one million reliable lexical data, which inspired many other linguistic atlas in Europe.
The ALF maps are mainly composed of four kinds of information: names of French departments (always surrounded by a rectangle), survey point numbers (identification of a city where a survey has been done), words in phonetics (pronunciation of the word written in Rousselot-Gilliéron phonetic alphabet), and borders. An illustration of these components is given in Fig. 1. Note that each map gather the different pronunciations of a given word in a single map. For example, the Fig. 1 represents a sample of the map made for the word “balance”.
Our research aims at automatically extracting the ALF information and generating maps with selected elements (currently, this process is done manually and it takes weeks to build a single map). In a previous work [5], we proposed to separate each type of information into layers (see Fig. 1) in order to prepare data for a subsequent analysis. Based on these results, this paper focus on the classification of characters in the phonetic layer.
2 Dataset Specifications
Edmond Edmont and Jules Gilliéron use in the ALF the phonetic notation developed and broadcast by Abbé Rousselot and Jules Gilliéron himself. The conventions that define the Rousselot-Gilliéron alphabet are recorded in the “Revue des patois gallo-romans (no. 4 1891, p. 5–6)” and repeated in the maps intelligence notice that accompanies the ALF. This alphabet is mainly made up of the letters of the Gallo-Roman languages (like French), on which diacritics (accents and notations) may be placed to symbolize more faithfully the way of pronouncing this letter or a part of the word (lemma). There are 1920 maps in the ALF which have all been written uniformly with the Rousselot-Gilliéron alphabet for the transcription of phonetic words. The Fig. 2 shows an example of a word transcribed into phonetic.
An inventory of the different characters used on all the maps has been made. The protocol used was to insert a character as new in the inventory if the diacritic was different. The superposition of diacritics being possible, for a given basic character (a, e, ...), the number of variations of some characters can be important. For example, the basic character “e” offers a range of 60 variations. Each variation is considered as a character of the inventory. From this inventory, a dataset has been created. It consists of an image of each character of the inventory found in the ALF maps. We chose to extract only one image per character because finding one representation of some specific characters, among the 1920 maps, is quite complex. Finding a second representation would have required a lot of effort. To date, there is no search tool within the maps, so the work has been done manually. This work brought together a collection of 251 different characters images (181 vowels, 61 consonants, 9 legend symbols) (Fig. 3).
Note that in this alphabet, 389 characters have been listed, but only 251 of them were found printed on the various maps. The images (from this dataset) have been extracted directly from the maps, which also bring a lot of noise to them. Indeed, noises are either related to the maps (textures degradations like holes, ink smudges, partially erased or slightly rotated characters) or created when scanning them (artifacts, low resolution, blur). This is why our dataset is a reduced dataset that shows wide disparities of low image quality (Fig. 4).
The image annotation consists in associating a class to each thumbnail (a class is an index in the latin alphabet) and its transcription in Rousselot-Gilliéron alphabet. The Table 1 shows a sample of the correspondence file of the dataset for the character “a”.
The transcription of accented character (phonetic) in the correspondence file was made possible thanks to the Symbola font, which includes almost the entire Rousselot-Gilliéron alphabet. The font is available online and also provided with the dataset. Sometimes a problem of diacritics overlay can be observed depending on the editor which the font is used, but this phenomenon doesn’t have impact on the transcription result.
The dataset is available online and can be accessed for freeFootnote 2. It includes images of the characters before and after pre-processing, its correspondence file (filename, class, transcription), the previously presented Symbola font, and some learning logs.
All experiments that are proposed on this dataset are presented in the next sections. We present here our first attempt to cluster those letters and to map them to their transcription.
3 Related Work
The principle of image classification is based on a fast search for object similarity among a large collection of already identified objects. In the literature, there is several works to specify this similarity.
A few of them [4, 7] use the matching of some basic properties of the image like colors, textures, shapes, or edges. However, the maps from our project are ancient printed documents. This means that the time-degradation (variable texture, colors) and the printing made with the tools of that time (variable shapes, edges) make it difficult to exploit this kind of maps with these approaches.
Another solution is to use local invariant features which is becoming more and more popular these days, like SIFT [10] or SURF [2]. In our case, Rousselot-Gilliéron alphabet brings also classification problems, because of these diacritics, compared to Latin languages. Here again, our maps are printed with the tools of that time with ink. This means that we never have two purely identical letters (and mostly diacritics) on all the data that we have for comparison purposes which is not suitable.
Recently, in image classification Artificial Neural Network (ANN) are widely used, more precisely Convolutional neural network (CNN) [3] which is inspired by human vision. The ANN is composed of specific layer computing results from small part of the image. The main drawback of this kind of technique is the complexity of training. In our case, the network will not converge efficiently, because of the lack of training images, and high variety in the characters (root and diacritics). However, there are methodologies based on a few numbers of samples or inclusive zero shot called one-shot learning. These methods are based on transfer learning or on a mapping between several representation spaces, and in our case, there is no consistent base offers similar characteristics to the studied characters (root and diacritics).
In opposite, we proposed a new technique based on space-filling curve. This technique takes advantage of the automatic discovery of key points and compact representation of class model. These features match our requirements and we will then explore them in the following of this paper.
4 Space-Filling Curve for Characters Classification
4.1 An Overview of Space-Filling Curves
Space-Filling curves (SFCs) are historically a mathematical curiosity, they are continuous non-differentiable functions. The first one was discovered by G. Peano in 1890 [13]. One year later D. Hilbert have proposed a different curve [8], commonly called Hilbert curve, this curve is originally defined for dimension \(D=2\), but multidimensional are defined using the Reflected Binary Gray code (RBG). The Fig. 5, shown the Hilbert space-filling curve for dimension \(D=2\) and order \(n= 1, 2, 3\).
In this paper, \(S_n^D\) is the space-filling curve function transforming a D dimensional point into an integer called index with the order n by the multidimensional Hilbert curve and the inverse function by \(\bar{S}_n^D\).
The main property of the Hilbert curve is the neighborhood preserving, the fact that two close D dimensional points separated by a short distance have high probability to be associate to indices by \(S_n^D\) separated by a short distance i.e:
where m is a distance function, with \(\varepsilon _D\) and \(\varepsilon _1\) small. According to [6, 11], the curve conserving the best the locality is the multidimensional extension of the Hilbert curve and for that reason was used in this paper. Complementary information on Space-filling curves can be find in [14] with more applications in [1].
Along years, many applications are drawn on SFCs, for example image storing and retrieving [15], derivative-free global optimization [9] and image encryption [12]. Here SFC are used in a new framework to recognize the characters of the Rousselot-Gilliéron phonetic alphabet.
4.2 Characters Image Classification
When dealing with characters image classification with the proposed approach, several points have to be met. Firstly, images have to be pre-processed in order to standardize them, for example changing the color scheme to black and white (binarization), deleting small noisy components, resizing the images. Secondly, the classification technique has to be translation and rotation invariant or not very sensitive.
In this section, the SFC character images classification is explained with a focus on the pre-processing, and the translation and rotation sensitivity.
Data Pre-processing. As mentioned in Sect. 2, images are degraded, so we propose to standardize the images with five steps presented in Fig. 6:
-
1.
Transform the color images to black and white:
-
Divided each pixel of the image by the standard deviation,
-
Then set to white or black each pixel depending if the value is lower or bigger than the mean pixel value.
-
-
2.
Delete small black components to reduce noise and black and white transformation artefact.
-
3.
Find the Region of Interest (ROI) as the smallest box embracing the black pixels.
-
4.
Center the main component by adding white padding.
-
5.
Resize the image.
As result of this pre-processing, 64 \(\times \) 64 pixels images are obtained with gray scale color scheme and main component centered.
Training the SFC Classifier. As shown in Eq. 1, it is possible to transform a data point into an index using the SFC function. But to represent an image, we need as much indices as pixels. In this work, an image is represented by a distribution histogram where the bins length is not regular. Those are calculated using the distribution of all indices produced by all images of a training data set. The irregularity in the bins length permit to adapt the classifier to the particularities of the dataset. Once the bins are known, we can create for each image an histogram and then merge all histograms for a given character into one unique using the statistic mean. To summarize, the training part create two linked objects: The irregular bins length and one histogram by classes of character.
Training of a classifier based on space-filling curve. Two object is learn: the average histograms and the bin length. For each image, a vector of indices is created, then transformed into an histogram using the irregular bins. The training ends with the creation of an average histogram for each classes.
As mention in Fig. 7, to create the bin length you need to firstly compute every index from every non-white pixel. Here the polar coordinate \((r,\theta )\) are used to ensure low rotation sensitivity. The translation sensitivity is always assured: SFC assigns close indices for two close points, then if \(x_a\) and \(x_b\), two points, with same color c but with slightly different polar coordinate r and \(\theta \) will have close indices. The same reasoning can be made for the same polar coordinates but with slightly different colors. The three-dimensional extension of the Hilbert-curve is used which is based on the Reflected-Binary-Gray-Code. Our assumption is that the curve used have to create a bijection between triplet (theta, rho, c), the pixels and the indices: the order n is then 8 to be able to store the maximum value of c(=255). The maximum index value is \((2^nD)-1(=16.777.215)\). The bin length is computed using a dichotomous search to create smallest histograms bins where the frequency of appearance is high, following the Algorithm 1.

The histogram can be then interpreted as low dimension representation of each image.
Classification of phonetic character based on Space-filling curve. The average histograms and the bin length must be determined by training (cf. Fig. 7). The comparison between the histogram and the average histograms is compute with cosine distance.
Classification of Character. The classification process is similar to k-means where the distribution histograms can be seen as centroids. Then, the class of a new image is the class of the nearest centroids. Given the sparsity of dimensional space (\(D \approx 300\)), the cosine distance is used to compare two histograms. The Fig. 8 shows the process of classification:
-
1.
A new image is pre-process (cf. Fig. 6)
-
2.
The image is transformed to a histogram with irregular bins, previously compute in the training step (cf. Fig. 7)
-
3.
The determined class is the class corresponding to the nearest mean distribution histogram.
In this section, a framework able to classify noisy and low quality images is presented. The input data are standardized to a square binarized images where the root character is centered. The SFC-classification is based on an automatic selection of zone-of-interest using indices and reduce to variable length bins histograms. The method is specialized by learning the bins length of these histograms. This task is done on the training dataset using dichotomous separation. The method is slightly robust to translation and rotation through the use of polar coordinates and SFC, making the classification process well adapted to the dataset presented in Sect. 2.
4.3 Results
As described earlier, the SFC based classifier use a training and test principle. This results in having to set aside all the characters that are present only once in the dataset, because with only one image we won’t be able to train and test the model. As a security measure, any image class that does not have at least four images will be set aside for these tests. The Table 2 shows the repartition of used character within the dataset.
The tests were therefore carried out on a total of two hundred and nine character images after subtracting the small classes. To evaluate the robustness of the model, three different sets of tests were carried out with for each class of character a random selection, where 75% are assigned for training and 25% for test. The results are presented in Table 3.
Concerning the results, the proposed method is able to detect slightly more than 70% on each experiment. These results are understandable because the standardization of the images reduce the variance of the numerous noises present on the images of the basic dataset, and thus it offers a certain stability, but on the other hand it is not perfect either. Indeed, the standardization sometimes goes:
-
cut letters during binarization,
-
cause a zoom effect of the character, to which the model is sensitive,
-
add or not add a padding if a diacritic is present.
We also notice that classes with too few images give us variable recognition rates as can be expected with learning models. The Table 4 perfectly illustrates this last point by showing the accuracy of classification for each character inside each experiment.
Results can still be improved. On the one hand, a next step could be to increase the number of images per class by capturing more occurrences of a character, or by generating slightly different images by adding few noises on the one already extracted. On the other hand, the pre-processing that seeks to standardize images by binarizing, removing noises and putting them at the same resolution, generates defects that should be fixed, because some generate missing diacritics or sections of characters.
5 Conclusion
This paper presents an original approach for classifying phonetics characters images. The proposed method consists in the representation of characters by irregular histograms, created with space-filling curve, able to differentiate characters between them. Note that our method is able to correctly classify 70% of our phonetic characters images dataset subset.
The obtained results are quite encouraging because they prove that from few data, which are not perfect, we are able to obtain reliable performance. An interesting strategy could be, for example, to inject its result in another algorithm which will focus to identify the diacritics in an optimized way. This may be a final alternative solution to build an OCR for this kind of phonetic alphabet.
However, these results will have to be compared with related works presented in Sect. 3, and then can still be improved as described in the Sect. 4.3. The SFC technique here consider only the color of the pixels and the position (polar coordinate) to produce an index, it will be thoughtful to use nearest pixels to increase the quality of the characterization.
To go further, it might also be interesting to use the SFC with phonetic words images. This will result in the formation of groups of similar words, and by definition, isoglosses (regions where people say exactly the same thing) that are of real value to dialectologists.
Notes
- 1.
Maps dataset available at http://lig-tdcge.imag.fr/cartodialect5.
- 2.
The phonetics characters images dataset is available here: http://l3i-share.univ-lr.fr/datasets/Dataset_CharRousselotGillerion.zip.
References
Bader, M.: Space-Filling Curves. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-31046-1
Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Speeded-up robust features (surf). Comput. Vis. Image Underst. 110(3), 346–359 (2008). Similarity Matching in Computer Vision and Multimedia
Chang, O., Constante, P., Gordon, A., Singaña, M.: A novel deep neural network that uses space-time features for tracking and recognizing a moving object. J. Artif. Intell. Soft Comput. Res. 7(2), 125–136 (2016)
Cheng, Y.C., Chen, S.Y.: Image classification using color, texture and regions. Image Vis. Comput. 21(9), 759–776 (2003)
Drapeau, J., et al.: Extraction of ancient map contents using trees of connected components. In: Fornés, A., Lamiroy, B. (eds.) GREC 2017. LNCS, vol. 11009, pp. 115–130. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02284-6_9
Faloutsos, C., Roseman, S.: Fractals for secondary key retrieval. In: Proceedings of the Eighth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1989, pp. 247–252. ACM (1989)
Fredembach, C., Schröder, M., Süsstrunk, S.: Region-Based Image Classification for Automatic Color Correction, January 2003
Hilbert, D.: Ueber die stetige Abbildung einer Line auf ein Flächenstück. Math. Ann. 38(3), 459–460 (1891)
Lera, D., Sergeyev, Y.D.: GOSH: derivative-free global optimization using multi-dimensional space-filling curves. J. Glob. Optim. 71(1), 193–211 (2018)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Moon, B., Jagadish, H.V., Faloutsos, C., Saltz, J.H.: Analysis of the clustering properties of the Hilbert space-filling curve. IEEE Trans. Knowl. Data Eng. 13(1), 124–141 (2001)
Murali, P., Sankaradass, V.: An efficient space filling curve based image encryption. Multimedia Tools Appl. 78(2), 2135–2156 (2018). https://doi.org/10.1007/s11042-018-6234-8
Peano, G.: Sur une courbe, qui remplit toute une aire plane. Math. Ann. 36, 157–160 (1890). https://doi.org/10.1007/BF01199438
Sagan, H.: Space-Filling Curves. Springer, New York (1994). https://doi.org/10.1007/978-1-4612-0871-6
Song, Z., Roussopoulos, N.: Using Hilbert curve in image storing and retrieving. Inf. Syst. 27(8), 523–536 (2002)
Acknowledgment
This work is carried out in the framework of the ECLATS project and supported by the French National Research Agency (ANR) under the grant number ANR-15-CE38-0002.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Owczarek, V. et al. (2020). Classification of Phonetic Characters by Space-Filling Curves. In: Bai, X., Karatzas, D., Lopresti, D. (eds) Document Analysis Systems. DAS 2020. Lecture Notes in Computer Science(), vol 12116. Springer, Cham. https://doi.org/10.1007/978-3-030-57058-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-57058-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57057-6
Online ISBN: 978-3-030-57058-3
eBook Packages: Computer ScienceComputer Science (R0)