research-article

Text graphic separation in Indian newspapers

Authors:
Ritu Garg

IIT Delhi, India

IIT Delhi, India
View Profile

,
Anukriti Bansal

IIT Delhi, India

IIT Delhi, India
View Profile

,
Santanu Chaudhury

IIT Delhi, India

IIT Delhi, India
View Profile

,
Sumantra Dutta Roy

IIT Delhi, India

IIT Delhi, India
View Profile

MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCRAugust 2013Article No.: 13Pages 1–5https://doi.org/10.1145/2505377.2505393

Published:24 August 2013Publication History

MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR

Pages 1–5

ABSTRACT

Digitization of newspaper article is important for registering historical events. Layout analysis of Indian newspaper is a challenging task due to the presence of different font size, font styles and random placement of text and non-text regions. In this paper we propose a novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts. The learning problem has been formulated as an optimization problem using EM algorithm to learn optimal parameters depending on the nature of the document content.

References

S. Aggarwal, S. Kumar, R. Garg, and S. Chaudhury. Content directed enhancement of degraded document images. In Proceeding of the workshop on Document Analysis and Recognition, pages 55--61, 2012. Google ScholarDigital Library
K. C. Fan, C. H. Liu, and Y. K. Wang. Segmentation and classification of mixed text/graphics/image documents. Pattern Recognition Letters, 15(12):1201--1209, 1994. Google ScholarDigital Library
R. Cao and C. L. Tan. Text/graphics separation in maps. In Fourth International Workshop on Graphics Recognition Algorithms and Applications, pages 167--177, London, UK, UK, 2002. Springer-Verlag. Google ScholarDigital Library
R. Cattoni, S. M. T. Coianiz, and C. M. Modena. Geometric layout analysis techniques for document image understanding: a review. Technical report, IRST, 1998.Google Scholar
S. Chowdhury, S. Mandal, A. Das, and B. Chanda. Segmentation of text and graphics from document images. In Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02, pages 619--623, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
L. A. Fletcher and R. Kasturi. A robust algorithm for text string separation from mixed text/graphics images. IEEE Transaction Pattern Analysis Machine Intelligence, 10(6):910--918, 1988. Google ScholarDigital Library
B. Gatos, S. L. Mantzaris, and A. Antonacopoulos. First international newspaper segmentation contest. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on, pages 1190--1194, 2001. Google ScholarDigital Library
B. Gatos, S. L. Mantzaris, K. V. Chandrinos, A. Tsigris, and S. J. Perantonis. Integrated algorithms for newspaper page decomposition and article tracking. In Proceedings of the Fifth International Conference on Document Analysis and Recognition, 1999. Google ScholarDigital Library
K. Hadjar, O. Hitz, and R. Ingold. Newspaper page decomposition using a split and merge approach. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on, pages 1186--1189, 2001. Google ScholarDigital Library
K. Hadjar and R. Ingold. Arabic newspaper page segmentation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2, ICDAR '03, 2003. Google ScholarDigital Library
G. Harit, R. Garg, and S. Chaudhury. Syntactic and semantic labeling of hierarchically organized document image components of indian scripts. In Advances in Pattern Recognition, 2009. ICAPR '09. Seventh International Conference on, pages 314--317, 2009. Google ScholarDigital Library
A. K. Jain and S. Bhattacharjee. Texture segmentation using gabor filters for automatic document processing. Machine Vision and Application, 5:169--184, 1992. Google ScholarDigital Library
N. Journet, V. Eglin, J. Ramel, and R. Mullot. Text/graphic labelling of ancient printed documents. In Proceedings of International Conference on Document Analysis and Recognition, volume 2, pages 1010--1014, August 2005. Google ScholarDigital Library
S. Khedekar, V. Ramanaprasad, S. Setlur, and V. Govindaraju. Text - image separation in devanagari documents. In Proceedings of the Seventh ICDAR, pages 1265--1269, 2003. Google ScholarDigital Library
S. Kumar, R. Gupta, N. Khanna, S. Chaudhury, and S. D. Joshi. Text extraction and document image segmentation using matched wavelets and mrf model. IEEE Transactions of Image Processing, 16:2117--2128, August 2007. Google ScholarDigital Library
F. Liu. A new component based algorithm for newspaper layout analysis. In Proceedings of the Sixth ICDAR, ICDAR '01, 2001. Google ScholarDigital Library
J. Liu, Y. Y. Tang, and C. Y. Suen. Chinese document layout analysis based on adaptive split-and-merge and qualitative spatial reasoning. Pattern Recognition, 30(7):1265--1278, 1997.Google ScholarCross Ref
Z. M.-H. H. X.-Z. Liu Dong-Rong, Wang Ke-Jian. Chinese newspaper layout analysis with antecedent compartmental lines. In Proceedings of the Second International Conference on Machine Learning and Cybernetics, pages 2771--2774, 2003.Google ScholarCross Ref
S. Mao, A. Rosenfeld, and T. Kanungo. Document structure analysis algorithms: a literature survey. Proc. SPIE Electronic Imaging, page 197âĂKŞ207, 2003.Google Scholar
P. E. Mitchell and H. Yan. Newspaper layout analysis incorporating connected component separation. Image Vision Comput., 22(4):307--317, 2004.Google ScholarCross Ref
G. Nagy. Twenty years of document image analysis in pami. IEEE Trans. PAMI, 22(1):38--62, 2000. Google ScholarDigital Library
P. P. Rege and C. A. Chandrakar. Text-image separation in document images using boundary/perimeter detection. ACEEE International Journal on Signal and Image Processing, 03(1):10--14, 2012.Google Scholar
P. P. Roy, J. Llados, and U. Pal. Text/graphics separation in color maps. In Proceedings of the International Conference on Computing: Theory and Applications, pages 545--551, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
G. Sharma, R. Garg, and S. Chaudhury. Curvature feature distribution based classification of indian scripts from document images. In Proceedings of the International Workshop on Multilingual OCR, pages 3:1--3:6, 2009. Google ScholarDigital Library
C. L. Tan and P. O. Ng. Text extraction using pyramid. Pattern Recognition, 31:63--72, 1998.Google ScholarCross Ref
Y. Y. Tang, S.-W. Lee, and C. Y. Suen. Automatic document processing: A survey. Pattern Recognition, 29(12):1931--1952, 1996.Google ScholarCross Ref
K. Tombre, S. Tabbone, L. Pélissier, B. Lamiroy, and P. Dosch. Text/graphics separation revisited. In Proceedings of the 5th International Workshop on Document Analysis Systems V, pages 200--211, London, UK, UK, 2002. Springer-Verlag. Google ScholarDigital Library
F. M. Wahl, K. Y. Wong, and R. G. Casey. Block segmentation and text extraction in mixed text/image documents. In Computer Graphics and Image Processing, volume 20, pages 375--390, 1982.Google Scholar
D. Wang and S. N. Srihari. Classification of newspaper image blocks using texture analysis. Computer Vision, Graphics, and Image Processing, 47(3):327--352, 1989. Google ScholarDigital Library

Index Terms

Text graphic separation in Indian newspapers
1. Applied computing
  1. Document management and text processing
2. Computing methodologies
  1. Machine learning

Recommendations

Deep features based convolutional neural network model for text and non-text region segmentation from document images
Abstract
A deep convolutional neural network model is presented here which uses deep learning features for text and non-text region segmentation from document images. The key objective is to extract text regions from the complex layout document ...
Highlights
- A method to analyze the complex layout document images using a deep neural network architecture is proposed.
Read More
Benchmarking NAS for Article Separation in Historical Newspapers
Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration
Abstract
The digitization of historical newspapers is a crucial task for preserving cultural heritage and making it accessible for various natural language processing and information retrieval tasks. One of the key challenges in digitizing old newspapers ... $^{}$ $^{}$
Read More
Automatic Separation of Words in Multi-lingual Multi-script Indian Documents
ICDAR '97: Proceedings of the 4th International Conference on Document Analysis and Recognition

In a multi-lingual country like India, a document may contain more than one script forms. For such a document it is necessary to separate different script forms before feeding them to OCRs of individual script. In this paper an automatic word ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR
August 2013
99 pages
ISBN:9781450321143
DOI:10.1145/2505377
General Chairs:
Venu Govindaraju
University at Buffalo
,
Prem Natarajan
Information Sciences Institute
,
Santanu Chaudhury
IIT Delhi, India
,
Daniel Lopresti
Lehigh University
,
Program Chairs:
Srirangaraj Setlur
University at Buffalo
,
Huaigu Cao
Raytheon BBN Technologies
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Indian newspaper
complex layout
parameter estimation
text document image classification system
text graphic separation
Qualifiers
- research-article
Conference

Acceptance Rates
MOCR '13 Paper Acceptance Rate17of34submissions,50%Overall Acceptance Rate17of34submissions,50%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 148
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Text graphic separation in Indian newspapers

MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR

ABSTRACT

References

Cited By

Index Terms

Recommendations

Deep features based convolutional neural network model for text and non-text region segmentation from document images

Benchmarking NAS for Article Separation in Historical Newspapers

Automatic Separation of Words in Multi-lingual Multi-script Indian Documents