1 Introduction

A good image analysis procedure is largely depending on relevant choices for the image representation and the features used. In our case, we are concerned with images of documents, whose quantity exchanged world-widely increases exponentially for multiple usages. The semantic aspect of such images is mainly contained in the text that can be recovered by optical character recognition (OCR) softwares, but also by other media such as graphics or images. Besides the layout of the image helps the reader to interpret the message carried by the document. Unfortunately, in most OCR softwares, the presence of a table may have a negative impact on text recognition. If tables are extracted first, text within each cell can be processed separately leading to better results. Thus the table extraction presents a two-fold interest, first to improve the text recognition rate and second to interpret the information contained in the table, yielding the improvement of the global document understanding.

Tables are one of the most specific structure of the image layout and they are present in a wide variety of documents such as administrative documents, invoices, scientific articles, question forms, etc. By their nature, the semantic aspect of the tables may be lost by the raster representation of the image. Thanks to recent office softwares, several table displays are used, linked to materialized cells surrounded by straight line segments or linked to different colors indicating columns or rows or to well chosen alignments. In this paper we limit ourselves to fully materialized tables, composed and surrounded by straight lines that intersect at right angles and are completely closed. This kind of table is most prevalent in commercial classic text editors. The acquisition noise due to sequential printings/scannings, deteriorations of the document may damage the table structure in raster images by generating “broken” segments. This lead to a proposal of straight line recovery that we experiment on the task of table extraction in degraded documents where the lines are no longer visible.

In this article, we address the problem of table extraction in binary images based on a model of straight lines representation of images. Indeed, we have chosen to replace the pixel classical representation of images by the use of a unique feature, the straight line segments. They can cover both the foreground and the background of the document. A pixel is one pixel long straight line segment and we will see in the proposed method that an area will be considered as a unique union of straight line segments characterized by their lengths, their orientations and their positions. Such image representation allows us to reason in a novel space representing differently the image content. This space is particularly suited for table structure extraction.

The paper is organized as follows. Section 2 recalls the main approaches from the literature for table extraction. Section 3 presents the proposed approach for table extraction based on a transform previously defined [1] to extract straight lines, that will be involved in the global process for table reconstruction. As experimental study in Sect. 4, the approach is evaluated both from quality and stability points of view on two datasets. Finally, conclusions and future works are provided in Sect. 5.

2 Related Works

Several approaches have been proposed in the literature for table extraction. They vary according to the input document format (html, image, pdf) [5, 10]. We focus on approaches dedicated to extract tables from raster images of documents. Two categories of methods co-exist: the ones based on the presence of lines (i.e. table separators) considered as primitives, and those not based on their presence.

Main Approaches for Table Extraction. As pioneer work among the first category, the approach proposed in [19] looks for blocks of individual elements delimited by vertical and horizontal line segments. Line segments are first detected and the points that form the corners are determined. The connectivity relationships between the extracted corners and thus the individual blocks are interpreted using global and local tree structures. The strategy in [7] also relies on detecting lines by considering all their intersections. In [14] the arrangement of the detected lines is compared with that of the text blocks in the same area. However the results showed a sensibility to the resolution of the scanned page and noise. A top-down approach is proposed in [9], based on a hierarchical characterization of the physical cells. Horizontal/vertical lines and spaces are used as features to extract the regions of the table. A recursive partitioning produces the regions of interest, and the hierarchy is used to determine the cells. The authors of [4] presented a table localization system using a recursive analysis of a modified X-Y tree to identify regions delimited by horizontal and vertical lines. The search is refined by looking for parallel lines that can be found in the deepest levels of the tree. This requires that at least two parallel lines are present.

A second category of methods for table extraction do not rely on the presence of lines but use textual information. In [15], tables are supposed to have distinct columns, with the hypothesis that the spaces between the fields are larger than the spaces between words in normal text lines. The table extraction relies on the formation of word blobs in text lines and then on finding the set of consecutive text lines which may form a table. The system in [13] takes the information from the bounding box of the input word, and outputs the corresponding logical text block units, for example the cells in a table environment. Starting with an arbitrary word as block seed, the algorithm recursively extends this block to all the words that interlace with their vertical neighbors. Since even the smallest gaps in the table columns prevent their words from interlacing each other, this segmentation is able to isolate such columns, though sensitive to the fit of the columns. The methodFootnote 1 presented in [17] is based on the concept of tab-stops in documents. Tab-stops inform about text aligns (left, right, center, etc.) and they are used to extract the blocks composing the layout of a document. Extracted blocks are then evaluated as table candidates using a specific strategy.

Despite of their interest, methods from this last category strongly depend on the textual information generally previously extracted using an OCR. However the presence of a table may have a negative impact on the character recognition rate. In this context, it appears that methods based on straight lines are more adapted to deal with fully materialized tables in potentially noisy documents with complex layouts. In the next section, we will focus on methods allowing straight line detection that we consider as primitives to extract tables.

Main Approaches for Straight Line Detection. Classical approaches derive from the Hough or Radon Transforms. The Hough Transform [2, 16, 20] can be applied on binary raster images in order to detect lines by considering a parametric representation. The approach uses an accumulator array, where each cell models a physical line. For every point, all the lines the point belongs to are considered and the associated values are incremented. The highest values in the array, representing the most salient lines, are then extracted.

Another way of detecting straight lines in image content is to consider a more local point of view than global Hough methods. The patch-based approach proposed in [18] uses a mono-dimensional accumulator array. A density map is locally computed from the input image and used to segment the image into high and low contour density areas.

In discrete geometry, the concept of discrete (blurred) segment has also been studied [6, 11] but these works are more focused on building novel representations, decomposition or analysis of discrete curves and more generally shapes. The approach presented in [3] relies on the definition of specific regions, named line support regions, which contain line segment. The input image is a gradient image. Edges detected are first combined in regions using both gradient magnitude and direction of neighboring pixels. The main line of each region is finally obtained from a statistical study of gradient magnitude in the region.

Some of the methods mentioned previously make the assumption that the lines present in the images are correctly and completely materialized, which makes it difficult to apply them in noisy documents where portions of lines materializing tables may be disrupted. On the other hand, Hough-based methods generally consider globally the image content, while local approaches do not take into account the geometry and topology of the image regions. In this work, we propose a new method for extracting tables based on a new Transform (called RLDT [1]) that we employ for line detection. This transform operates at a local “region” level, adapting the region of interest to the local context, and gives more information about the spatial organization of the segments than traditional global transforms. The “broken” lines are reconstructed using line seeds obtained via this transform. Finally the obtained straight line segments are used to reconstruct potential table structures.

Fig. 1.
figure 1

Flowchart of our approach for table extraction in images of documents.

3 Table Extraction

As we consider only fully materialized tables, our approach relies on a line extraction strategy. These lines extracted from the document content can be part of a table or only separators or part of a graphic or any image media. The global flowchart (Fig. 1) contains two main stages: (1) the extraction of the horizontal/vertical lines coupled to their (potential) reconstruction, and (2) the table reconstruction based on the lines limiting the table cells.

The first step of the line extraction is based on a recent transform defined in [1] that enables the definition of some primitives, in a novel image representation space, that are useful for the extraction of text, image and separators. To deal with “broken” lines and to ensure a better stability, the line seeds extracted via this transform are then prolongated. Then in the line verification step, a theoretical model of a line confirms or infirms the presence of segments among line segment candidates, especially in degraded conditions.

3.1 Relative Local Diameter Transform [1]

Classical methods in document analysis are based on binary images, as they yield in a fast and efficient way to the extraction of the foreground and background. These dual sources of information can be exploited to extract lines. The following notations will be used: an image of a document page D is a function that associates each point \((x,y) \in \mathbb {Z}^2\) with a value of the set V. We assume that for the image I, \(V = [0,1]\) and for the binary version \(I_b\), \(V = \{0,1\}\) (white is marked with 0, black with 1). The foreground is assumed to be black. Both images I and \(I_b\) share the same image definition set \(\varDelta _{I}\).

We define in this section some transforms (originally introduced in [1]) that give more local information about the spatial organization of the segments contained in a document content than traditional global transforms. We propose to adapt the neighborhood level around each pixel according to the image content. As we focus on the straight lines contained in the image, we refer to the Radon Transform. This transform was modified in order to obtain more localized information, and we defined in [1] the local Radon Transform (LR).

At each point, we define a local Radon Transform by

(1)

where \(\delta _{0}\) is the Dirac distribution in 0. LR gives the maximum length of the segment passing through the point \((x_0,y_0)\) in the direction \(\theta \).

Local Diameter Transform. The length of a segment is not relevant here in an absolute way but only relatively to the size of the considered binary image \(I_b\). Then, based on the LR, the local diameter at a point is measured by evaluating the length of the largest segment passing through this point and contained in the set of (foreground) pixels labeled 1. Thus only one value at each point corresponding to the maximum length of the segment is kept, independent of its direction. This is what we call the Local Diameter Transform (LDT) defined as \(LDT(I_b)(x,y)= \max _{\theta \in [0,\pi ]} LR(I_b)(\theta ,x,y)\). Obviously, depending on the application, the importance of the segment must be estimated relative to the dimensions of the document D. By denoting \(diam(\varDelta _{I},\theta )\) the diameter of \(\varDelta _{I}\) in the direction \(\theta \), we then define in each pixel of \(I_b\), the relative local diameter by:

$$\begin{aligned} RLDT(I_b)(x,y)= \max _{\theta \in [0,\pi ]} \frac{LR(I_b)(\theta ,x,y)}{diam(\varDelta _{I},\theta )} \end{aligned}$$
(2)

and the Relative Local Diameter Transformation of \(I_b\) is noted \(RLDT(I_b)\). An example of applying the RLDT to a toy-case image can be seen in Fig. 2.

Fig. 2.
figure 2

Illustration of the Relative Local Diameter Transform: (a) binary image, (b) RLDT calculated on the image (a) by taking into account 8 orientations.

The union of all the lines is covering the image content. Here, we are interested in only the long lines composing the table structure. (A threshold has been fixed to 2 according to another study that has shown that lines less than 2% of the size of the document are associated with text zones [1].) We then do consider only the points whose maximum length of the segment is horizontal or vertical (\(\theta = 0\) or \(\theta = \frac{\pi }{2}\)). The connected components (CC) contained in this image correspond to the set of straight line segments marked \(L(I_{b})\). Figure 3(c) illustrates this step. Note that the union of small straight lines can allow to support a slight angle to capture non-purely horizontal or vertical straight lines.

The set \(L(I_{b})\) comprises the separators belonging to tables, some other separators and also some short straight lines that may correspond to configuration of text or logos for example that need to be eliminated from the final set of straight lines belonging to materialized tables. The problem to be faced is the robustness of the straight line detection in degraded contexts, addressed in next section.

Fig. 3.
figure 3

Results of the different steps of the table extraction process: (a) initial image I, (b) binarized image \(I_b\), (c) straight lines \(L(I_b)\) from the RLDT, (d) candidate straight segments, (e) positively tested long segments, (f) long segments \(I_{s}\) (in white), (g) potential cells \(R^{c}\), (h) potential tables.

3.2 Line Extraction and Reconstruction

Line Prolongation. As the previous step is based on a binary image \(I_b\), the segment detection is sensitive to the binarization method and to the quality of the initial image. One assumption of our work is that the first step of the process, even if it presents weak points, gives some seeds of real straight lines that may be incomplete with respect to the content of the document. The final goal is to find cells involved in tables. These cells are limited by four segments defining a rectangle. Thus, the seeds of the real straight lines contained in \(L(I_{b})\) are extended on the whole image (Fig. 3(d)) and the presence of a segment on each portion limited by consecutive orthogonal segments will be tested.

Line Verification. Lines extracted at the previous step do not always fit with borders of cell tables. Some information in the original gray level image I can be useful to take a non ambiguous decision. Along the candidate straight lines, a local working zone is defined, where a theoretical model of a line is applied in order to confirm or infirm the presence of a line. This model is based on a local behavior of 1D sets of pixels and a global behavior on the working zone. The evolution of the gray levels along the sections of the zone (black line in Fig. 4(a)) are compared to the theoretical behavior in case of a line, where from left to right the gray levels should begin to increase then (potentially) stabilize and decrease. Otherwise, if there is no line at this place, the evolution is negligible. On Fig. 4(a), red zone corresponds to an increase of gray levels, green to a decrease and blue to a stability in gray levels relatively to I. A global vision of the row processing leads to the segmentation of the working zone according to three different color level behaviors Zi, Zs and Zd. This enables to eliminate the hypothesis of a straight line presence in some parts.

The result is depicted on Fig. 4(b) where the working zones are colored. As illustrated, a vertical black line well corresponds to the adjacency of red, blue and green zones. On a hypothetical segment, on each section where there is no seed point, the presence of adjacent pixels of (Zi, Zs and Zd) or (Zi and Zd) enables to maintain the hypothesis of a straight line and then a line is rebuilt as the projection of the seed on the section. A dilation in the direction of the line is performed. From seeds, i.e. the elements of \(L(I_{b})\), we thus built a set of longer segments where even some new segments may be added \(Lr(I,L(I_{b}))\). It happens sometimes that \(Lr(I,L(I_{b}))\) comprises small segments associated with text that are in the same vertical position as an horizontal segment in an other part of the page as illustrated in Fig. 3(e).

Fig. 4.
figure 4

Behavior of gray levels along a segment hypothesis: (a) proposed line model; (b) illustration of the different zones Zi in red, Zd in green, Zs in blue (segments of \(L(I_{b})\) are in black). (Color figure online)

Suppression of Lines Associated with Text. The last step is to discriminate between isolated lines and virtual lines associated with text. Let an element of \(Lr(I,L(I_{b}))\) be noted s. In a real straight line, Zi(s) is a long ribbon along the line. In text context the Zi(s) zone is more complex. Such a zone may be actually composed of several connected components denoted as \(\lbrace z_{s,a} \rbrace _{a=1}^{n_z}\). Let \(proj_{s}(x)\) be the orthogonal projection of a set x on the principal direction of s. It is assumed that the \(z_{s,a}\) have been sorted according to the lengths of their projection. Let \(n_0\) be defined by \(0.1*length(proj_{s} (s))\).

The segment s is then considered as a straight line segment if:

$$\begin{aligned} \exists {} n\le {}n_0/length(\cup _{a=1}^{n} proj_{s}(z_{s,a})) \geqslant 0.9* length(proj_{s} (s)) \end{aligned}$$
(3)

enabling to discard some non significant straight lines. We only retain these selected long segments in a final set \(I_{s}\) comprising segments that may be part of a table. An example of result is illustrated on Fig. 3(f). For simplicity, we now assume the same name for the set of segments and the raster associated image.

3.3 Table Reconstruction

As we have considered the connected components in the binary image containing long segments, the components of \(I_{s}\) can be noisy if some text characters are touching a straight line, which has to be taken into account. Let \(E^{c}\) be the complementary image content (background) of binary image content E (foreground). The non bounded CC of \(I_{s}^{c}\) cannot be part of a table and make a region R in the image domain. The complement of R, \(R^{c}\) comprises pixels of lines S and background pixels B (Fig. 3(g)). The CC of \(R^{c}\) are labeled as table if they contain pixels of B (Fig. 3(h)). Otherwise, they correspond to isolated separator lines.

4 Experimental Study

The evaluation of our approach is achieved from two points of view:

  • Quality: in a classical way, the recall and precision in the predicted tables give an information on the accuracy of the process;

  • Stability: nowadays more and more documents are hybrid documents, that is to say after a document is created, along its life, it can alternatively be printed and scanned. Document images are associated with the same document content but may differ in resolution, in capture noise or any degradation due to a natural use. We want to test the stability of the method on several images representing a same initial document at different dates.

To achieve this we have considered two different datasets presented in the next section before describing the obtained results.

4.1 Datasets

In order to have quantitative evaluations, two datasets were considered: one to evaluate quality and the other for stability. For quality we have considered the dataset proposed in an ICDAR 2013 competition [8]. As a matter of fact, on the one hand, the data are not images but are in pdf format and on the other hand the tables proposed are not all materialized. We have chosen to raster the images at 150 dpi and have also removed the images containing non-fully materialized tables. The datasubset contains 179 images of documents, among them 69 are containing tables. Then the results we obtain cannot be compared to those of the competition participants, focusing on the whole dataset.

For stability evaluation, no public dataset was available. We designed one annotated dataset called SETSTABLE, containing 293 document images associated with 14 hybrid documents (see composition in Table 2 (lines 1–3)).

For each hybrid document, the results associated with all the couples of the document occurrences are analyzed in order to check if they are similar. The identity of two results can be binary in particular when quality of the extraction is measured, then an extracted table is compared to the ground truth. For two occurrences \(D_{1}\) and \(D_{2}\) of a hybrid document D, \(\lbrace T_{1}^{i} \rbrace _{i=1}^{N_1}\) and \(\lbrace T_{2}^{i} \rbrace _{i=1}^{N_2}\) are the tables respectively extracted in the two document images. The covering level between two tables is computed as:

$$\begin{aligned} CL(T_{1}^{i}, T_{2}^{j}) = \dfrac{A( T_{1}^{i}\cap T_{2}^{j})}{A( T_{1}^{i}\cup T_{2}^{j})} \end{aligned}$$
(4)

where A(X) is the area of X. Accuracy is measured using the binary version of Eq. 4, with risk parameter p:

$$\begin{aligned} CL_p(T_{1}^{i}, T_{2}^{j}) = {\left\{ \begin{array}{ll} 1 &{} \text {if } CL(T_{1}^{i}, T_{2}^{j}) \geqslant p\\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(5)

The stability of a table extraction in \(D_{1}\) is then measured with respect to the process of a second document \(D_{2}\) by:

$$\begin{aligned} ST(T_{1}^{i}, D_{2}) = {\left\{ \begin{array}{ll} 0 &{} \text {if } N_{2} = 0\\ \max \nolimits _{j\in [1,N_2]} \lbrace CL( T_{1}^{i}, T_{2}^{j})\rbrace &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)

Finally the stability of the processing of \(D_{1}\) and \(D_{2}\) is measured by:

$$\begin{aligned} Stab(D_{1}, D_{2}) = {\left\{ \begin{array}{ll} 1 \quad \text {if } N_{1}=N_{2}=0 \\ 0 \quad \text {if } N_{1}\cdot N_{2}=0 \text { and } N_{1}+N_{2} \ne 0 \\ \dfrac{1}{2} \left( \dfrac{1}{N_{1}} \sum \nolimits _{i=1}^{N_{1}} ST(T_{1}^{i}, D_{2})+ \dfrac{1}{N_{2}} \sum \nolimits _{i=1}^{N_{2}} ST(T_{2}^{i}, D_{1}) \right) \text { otherwise} \end{array}\right. } \end{aligned}$$
(7)

For an hybrid document D for which we have N (\(N > 1\)) different images we can compute a global stability score SC(D) defined by:

$$\begin{aligned} SC(D) = \dfrac{2}{N^{2}-N}\sum _{i=1}^{N}\sum _{j=1}^{i-1} Stab(D_{i}, D_{j}) \end{aligned}$$
(8)

SC takes value in [0, 1]; full stability is obtained when \(SC=1\). Replacing CL by \(CL_{p}\), a more binary stability score denoted \(SC_{p}(D)\) can be computed.

4.2 Results

Quality of Table Extraction. The NICK method [12] was used to binarize all the document images. The table extraction quality is computed at the object (table) level with classical precision (P), recall (R) and \(F_{1}\)-measure (F) indexes. An extracted table is considered as true positive if Eq. 5 is satisfied with risk parameter \(p=0.80\), relatively to ground truth. The ICDAR 2013 document ground truth contains 88 tables bounding boxes. Among this set, our method identified correctly 86 tables and 18 extra tables have been found. This happens for example when a graph comprises histogram and horizontal lines. A further process should fix these problems when hypothesizing a table (Fig. 5(b)).

As comparative study, we compared our results to the ones obtained with the method proposed in [17], relying on a OCR to extract tables. We used the open source implementation of this algorithm provided as part of the Tesseract OCR engine, with the recommended parameter values.

In order to evaluate the efficiency of our line reconstruction step (LRS), we compared the results with or without this step on the SETSTABLE dataset.

Table 1 presents the quantitative results obtained on the datasets, while Fig. 5(a) and (b) illustrate some qualitative results.

From Table 1 (left), one can note than our proposal outperforms the results provided by the Tesseract [17] method on the ICDAR 2013 datasubset. The high precision of our method can be explained by the pdf nature of the initial data leading to high quality images. Experiments showed that in this condition the LRS step does not improve the results.

The lower precision values on the SETSTABLE dataset (Table 1 (right)) can be explained by logos detected as tables. We can see that the global quality has improved using the LRS step, especially the recall value has increased of 43%.

Table 1. Quality of fully materialized table extraction.

Stability of the Method. We compared our results to the ones obtained with Tesseract [17] according to the accuracy criterion \(CL_{0.80}\) of Eq. 5 (leading to \(SC_{0.80}\)). This does not take into account the small variations due to image acquisition. Then, we have also used \(CL_{0.70}\). This relaxes the constraints and the score value should increase since the offset due to document rotations, translations in the different versions of the hybrid documents are absorbed. To be independent of the accuracy measurement, we have also used the pure cover level CL of Eq. 4 to compute SC.

The obtained results are presented in Table 2 and illustrated on Fig. 5(c) and (d). It has to be noticed that the stability results from this table cannot be directly compared to the quality results (Table 1). The lack of stability reflects an inconstancy in the errors and not the number of errors.

Table 2. Evaluation of the stability of our method from the SETSTABLE dataset.

Results from Table 2 suggest that our strategy is more stable than Tesseract [17]. Documents with full stability are documents numbered 6, 7 and 13 that contain no table. Unfortunately, we found 15 tables in the other images without table, they correspond to logo zones.

As mentioned earlier the use of \(SC_{0.70}\) shows a more optimistic point of view of the stability of the results (and improves the overall scores) than \(SC_{0.80}\) where the identity criterion between two tables is stricter (i.e. smaller shifts, rotations are allowed). This is particularly penalizing for small tables. Figure 5(c, d) shows the projection of all extracted tables from the instances of an hybrid document. They seem all correct, but are slightly differing. The final stability result will depend on the choice of the risk parameter in the covering level definition.

Fig. 5.
figure 5

Table extraction results: (a) and (b) documents from ICDAR 2013 dataset; (c) illustrates a stability result where each color represents a result of the table extraction in a version of the hybrid document ((d) is a zoom). (Color figure online)

5 Conclusion

Table recognition in documents is still a hot topic and we proposed here an approach for their extraction in document images, in particular when acquisition noise can disrupt the recovering of the table structures. It relies on the search for straight line segments in documents. As the sequential printings and scannings of a document and its deterioration can lead to “broken” lines, the cornerstone of our approach is to first reconstruct the degraded lines given a new image transform, before to extract potential table structures. The proposed approach has been evaluated on two datasets given two points of view: the stability of the extraction, by analyzing several images of the same (hybrid) document, and the accuracy of the results. The obtained results are encouraging and highlight the ability of our approach to extract tables, even in quite noisy documents.

This work opens up several perspectives. As a limit of this approach, the output of our method strongly depends on the initial binarization of the page which can lead to potentially different results for many instances of a hybrid document, decreasing the overall stability. We plan to consider (and to couple) different methods of binarization to design a more stable process. An other way could be to generalize the proposed image transforms for gray level or color images to avoid binarization.