Paper
18 January 2010 Semi-supervised learning for detecting text-lines in noisy document images
Zongyi Liu, Hanning Zhou
Author Affiliations +
Proceedings Volume 7534, Document Recognition and Retrieval XVII; 75340C (2010) https://doi.org/10.1117/12.837362
Event: IS&T/SPIE Electronic Imaging, 2010, San Jose, California, United States
Abstract
Document layout analysis is a key step in document image understanding with wide applications in document digitization and reformatting. Identifying correct layout from noisy scanned images is especially challenging. In this paper, we introduce a semi-supervised learning framework to detect text-lines from noisy document images. Our framework consists of three steps. The first step is the initial segmentation that extracts text-lines and images using simple morphological operations. The second step is a grouping-based layout analysis that identifies text-lines, image zones, column separator and vertical border noise. It is able to efficiently remove the vertical border noises from multi-column pages. The third step is an online classifier that is trained with the high confidence line detection results from Step Two, and filters out noise from low confidence lines. The classifier effectively removes speckle noises embedded inside the content zones. We compare the performance of our algorithm to the state-of-the-art work in the field on the UW-III database. We choose the results reported by the Image Understanding Pattern Recognition Research (IUPR) and Scansoft Omnipage SDK 15.5. We evaluate the performances at both the page frame level and the text-line level. The result shows that our system has much lower false-alarm rate, while maintains similar content detection rate. In addition, we also show that our online training model generalizes better than algorithms depending on offline training.
© (2010) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Zongyi Liu and Hanning Zhou "Semi-supervised learning for detecting text-lines in noisy document images", Proc. SPIE 7534, Document Recognition and Retrieval XVII, 75340C (18 January 2010); https://doi.org/10.1117/12.837362
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Image segmentation

Databases

Detection and tracking algorithms

Speckle

Image processing algorithms and systems

Image understanding

Data modeling

Back to Top