skip to main content
10.1145/1815330.1815375acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdasConference Proceedingsconference-collections
research-article

A histogram-based technique for automatic threshold assessment in a run length smoothing-based algorithm

Published: 09 June 2010 Publication History

Abstract

Document layout analysis is crucial in the automatic document processing workflow, because its outcome affects all subsequent processing steps. A first problem concerns the possibility of dealing not only with documents having easy layout, but with so-called non-Manhattan layout documents as well. Another problem is that most available techniques can be applied to scanned document, due to the emphasis in previous decades being put on legacy documents digitization. Conversely, nowadays most documents come directly in digital format, and thus new techniques must be developed. A famous approach proposed in the literature for layout analysis was the RLSA, suitable to scanned black&white images and based the application of Run Length Smoothing and the AND logical operator. A recent variant thereof is based on the application of the OR operator, for which reason has been called RLSO. It exploits a bottom-up approach that proved able to handle even non-Manhattan layouts, on both scanned and natively digital documents. Like RLSA, it is based on the definition of thresholds for the smoothing operator, but the different approach requires different criteria than those that work in RLSA to define proper values. Since this is a hard and unnatural task for an (even expert) user, this paper proposes a technique to automatically define such thresholds for each single document, based on the distribution of spacing therein. Application on selected samples of documents, that aimed at covering a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size. It can provide a useful basis also for handling more complex cases.

References

[1]
H. Baird, S. Jones, and S. Fortune. Image segmentation by shape-directed covers. In Proceedings of International Conference on Pattern Recognition, pages 820--825, Atlantic City, NJ, 1990.
[2]
P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
[3]
T. M. Breuel. Two geometric algorithms for layout analysis. In D. P. Lopresti, J. Hu, and R. S. Kashi, editors, Document Analysis Systems, volume 2423 of Lecture Notes in Computer Science, pages 188--199. Springer, 2002.
[4]
H. Cao, R. Prasad, P. Natarajan, and E. MacRostie. Robust page segmentation based on smearing and error correction unifying top-down and bottom-up approaches. In ICDAR '07: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 1, pages 392--396, Washington, DC, USA, 2007. IEEE Computer Society.
[5]
F. Cesarini, S. Marinai, G. Soda, and M. Gori. Structured document segmentation and representation by the modified x-y tree. In ICDAR '99: Proceedings of the Fifth International Conference on Document Analysis and Recognition, page 563, Washington, DC, USA, 1999. IEEE Computer Society.
[6]
S. Ferilli, M. Biba, F. Esposito, and T. M. Basile. A distance-based technique for non-manhattan layout analysis. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR-2009), volume I, pages 231--235. IEEE Computer Society, July 2009.
[7]
B. Gatos, S. L. Mantzaris, S. J. Perantonis, and A. Tsigris. Automatic page analysis for the creation of a digital library from newspaper archives. International Journal on Digital Libraries (IJODL, 3:77--84, 2000.
[8]
K. Kise, A. Sato, and M. Iwata. Segmentation of page images using the area voronoi diagram. Computer Vision Image Understanding, 70(3):370--382, 1998.
[9]
G. Nagy, S. Seth, and M. Viswanathan. A prototype document image analysis system for technical journals. Computer, 25(7):10--22, 1992.
[10]
G. Nagy and S. C. Seth. Hierarchical representation of optically scanned documents. In Proceedings of the International Conference on Pattern Recognition (ICPR), pages 347--349, 1984.
[11]
L. O'Gorman. The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11):1162--1173, 1993.
[12]
T. Pavlidis and J. Zhou. Page segmentation by white streams. In Proceedings of International Conference on Document Analysis and Recognition, pages 945--953, Saint-Malo, France, 1991.
[13]
A. Simon, J.-C. Pret, and A. P. Johnson. A fast algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(3):273--277, 1997.
[14]
H.-M. Sun. Page segmentation for manhattan and non-manhattan layout documents via selective CRLA. In Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), pages 116--120. IEEE Computer Society, 2005.
[15]
F. Wahl, K. Wong, and R. Casey. Block segmentation and text extraction in mixed text/image documents. Graphical Models and Image Processing, 20:375--390, 1982.
[16]
K. Y. Wong, R. Casey, and F. M. Wahl. Document analysis system. IBM Journal of Research and Development, 26:647--656, 1982.

Cited By

View all
  • (2022)A Deep Learning-Based System for Document Layout AnalysisProceedings of the 2022 6th International Conference on Machine Learning and Soft Computing10.1145/3523150.3523154(20-25)Online publication date: 15-Jan-2022
  • (2020)Classification of Text regions in a Document Image by Analyzing the properties of Connected Components2020 IEEE Applied Signal Processing Conference (ASPCON)10.1109/ASPCON49795.2020.9276688(36-40)Online publication date: 7-Oct-2020
  • (2018)Augmented Documents for Research Contact Management2018 IEEE 4th International Forum on Research and Technology for Society and Industry (RTSI)10.1109/RTSI.2018.8548463(1-6)Online publication date: Sep-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
June 2010
490 pages
ISBN:9781605587738
DOI:10.1145/1815330
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. layout analysis
  2. segmentation

Qualifiers

  • Research-article

Conference

DAS '10

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)A Deep Learning-Based System for Document Layout AnalysisProceedings of the 2022 6th International Conference on Machine Learning and Soft Computing10.1145/3523150.3523154(20-25)Online publication date: 15-Jan-2022
  • (2020)Classification of Text regions in a Document Image by Analyzing the properties of Connected Components2020 IEEE Applied Signal Processing Conference (ASPCON)10.1109/ASPCON49795.2020.9276688(36-40)Online publication date: 7-Oct-2020
  • (2018)Augmented Documents for Research Contact Management2018 IEEE 4th International Forum on Research and Technology for Society and Industry (RTSI)10.1109/RTSI.2018.8548463(1-6)Online publication date: Sep-2018
  • (2018)Document Layout Analysis: A Maximum Homogeneous Region Approach2018 1st International Conference on Multimedia Analysis and Pattern Recognition (MAPR)10.1109/MAPR.2018.8337515(1-5)Online publication date: Apr-2018
  • (2018)A Document Layout Analysis Method Based on Morphological Operators and Connected Components2018 XLIV Latin American Computer Conference (CLEI)10.1109/CLEI.2018.00080(622-631)Online publication date: Oct-2018
  • (2018)Text and non-text separation in offline document imagesInternational Journal on Document Analysis and Recognition10.1007/s10032-018-0296-z21:1-2(1-20)Online publication date: 1-Jun-2018
  • (2017)Document Layout Analysis Using Multigaussian Fitting2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2017.127(747-752)Online publication date: Nov-2017
  • (2017)A robust system for document layout analysis using multilevel homogeneity structureExpert Systems with Applications: An International Journal10.1016/j.eswa.2017.05.03085:C(99-113)Online publication date: 1-Nov-2017
  • (2016)Page segmentation using minimum homogeneity algorithm and adaptive mathematical morphologyInternational Journal on Document Analysis and Recognition10.1007/s10032-016-0265-319:3(191-209)Online publication date: 1-Sep-2016
  • (2015)Hybrid page segmentation using multilevel homogeneity structureProceedings of the 9th International Conference on Ubiquitous Information Management and Communication10.1145/2701126.2701138(1-6)Online publication date: 8-Jan-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media