Abstract:
In this paper we investigate the importance of individual features for the task of document layout analysis, in particular for the classification of the document pixels. ...View moreMetadata
Abstract:
In this paper we investigate the importance of individual features for the task of document layout analysis, in particular for the classification of the document pixels. The feature set consists of numerous state-of-the-art features, including color, gradient, and local binary patterns (LBP). To deal with the high dimensionality of the feature set, we propose a cascade of an adapted forward selection and a genetic selection. We have evaluated our feature selection method on three historical document datasets. For the classification we used machine learning methods which classify each pixel into either periphery, background, text block, or decoration. The proposed cascading feature selection method reduced the number of features significantly while preserving the cross-validation performance. Furthermore, it selected less features with comparable performance, compared with the conventional feature selection methods. In our analysis we found that LBP features are consistently selected by all feature selection methods on all three datasets. This indicates that LBP correlate highly with the pixel classes much more than any other type of features does. These findings suggest a clue in paradigm for document layout analysis in general.
Published in: 2014 4th International Conference on Image Processing Theory, Tools and Applications (IPTA)
Date of Conference: 14-17 October 2014
Date Added to IEEE Xplore: 08 January 2015
ISBN Information: