ABSTRACT
Digitization of newspaper article is important for registering historical events. Layout analysis of Indian newspaper is a challenging task due to the presence of different font size, font styles and random placement of text and non-text regions. In this paper we propose a novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts. The learning problem has been formulated as an optimization problem using EM algorithm to learn optimal parameters depending on the nature of the document content.
- S. Aggarwal, S. Kumar, R. Garg, and S. Chaudhury. Content directed enhancement of degraded document images. In Proceeding of the workshop on Document Analysis and Recognition, pages 55--61, 2012. Google ScholarDigital Library
- K. C. Fan, C. H. Liu, and Y. K. Wang. Segmentation and classification of mixed text/graphics/image documents. Pattern Recognition Letters, 15(12):1201--1209, 1994. Google ScholarDigital Library
- R. Cao and C. L. Tan. Text/graphics separation in maps. In Fourth International Workshop on Graphics Recognition Algorithms and Applications, pages 167--177, London, UK, UK, 2002. Springer-Verlag. Google ScholarDigital Library
- R. Cattoni, S. M. T. Coianiz, and C. M. Modena. Geometric layout analysis techniques for document image understanding: a review. Technical report, IRST, 1998.Google Scholar
- S. Chowdhury, S. Mandal, A. Das, and B. Chanda. Segmentation of text and graphics from document images. In Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02, pages 619--623, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- L. A. Fletcher and R. Kasturi. A robust algorithm for text string separation from mixed text/graphics images. IEEE Transaction Pattern Analysis Machine Intelligence, 10(6):910--918, 1988. Google ScholarDigital Library
- B. Gatos, S. L. Mantzaris, and A. Antonacopoulos. First international newspaper segmentation contest. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on, pages 1190--1194, 2001. Google ScholarDigital Library
- B. Gatos, S. L. Mantzaris, K. V. Chandrinos, A. Tsigris, and S. J. Perantonis. Integrated algorithms for newspaper page decomposition and article tracking. In Proceedings of the Fifth International Conference on Document Analysis and Recognition, 1999. Google ScholarDigital Library
- K. Hadjar, O. Hitz, and R. Ingold. Newspaper page decomposition using a split and merge approach. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on, pages 1186--1189, 2001. Google ScholarDigital Library
- K. Hadjar and R. Ingold. Arabic newspaper page segmentation. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2, ICDAR '03, 2003. Google ScholarDigital Library
- G. Harit, R. Garg, and S. Chaudhury. Syntactic and semantic labeling of hierarchically organized document image components of indian scripts. In Advances in Pattern Recognition, 2009. ICAPR '09. Seventh International Conference on, pages 314--317, 2009. Google ScholarDigital Library
- A. K. Jain and S. Bhattacharjee. Texture segmentation using gabor filters for automatic document processing. Machine Vision and Application, 5:169--184, 1992. Google ScholarDigital Library
- N. Journet, V. Eglin, J. Ramel, and R. Mullot. Text/graphic labelling of ancient printed documents. In Proceedings of International Conference on Document Analysis and Recognition, volume 2, pages 1010--1014, August 2005. Google ScholarDigital Library
- S. Khedekar, V. Ramanaprasad, S. Setlur, and V. Govindaraju. Text - image separation in devanagari documents. In Proceedings of the Seventh ICDAR, pages 1265--1269, 2003. Google ScholarDigital Library
- S. Kumar, R. Gupta, N. Khanna, S. Chaudhury, and S. D. Joshi. Text extraction and document image segmentation using matched wavelets and mrf model. IEEE Transactions of Image Processing, 16:2117--2128, August 2007. Google ScholarDigital Library
- F. Liu. A new component based algorithm for newspaper layout analysis. In Proceedings of the Sixth ICDAR, ICDAR '01, 2001. Google ScholarDigital Library
- J. Liu, Y. Y. Tang, and C. Y. Suen. Chinese document layout analysis based on adaptive split-and-merge and qualitative spatial reasoning. Pattern Recognition, 30(7):1265--1278, 1997.Google ScholarCross Ref
- Z. M.-H. H. X.-Z. Liu Dong-Rong, Wang Ke-Jian. Chinese newspaper layout analysis with antecedent compartmental lines. In Proceedings of the Second International Conference on Machine Learning and Cybernetics, pages 2771--2774, 2003.Google ScholarCross Ref
- S. Mao, A. Rosenfeld, and T. Kanungo. Document structure analysis algorithms: a literature survey. Proc. SPIE Electronic Imaging, page 197âĂKŞ207, 2003.Google Scholar
- P. E. Mitchell and H. Yan. Newspaper layout analysis incorporating connected component separation. Image Vision Comput., 22(4):307--317, 2004.Google ScholarCross Ref
- G. Nagy. Twenty years of document image analysis in pami. IEEE Trans. PAMI, 22(1):38--62, 2000. Google ScholarDigital Library
- P. P. Rege and C. A. Chandrakar. Text-image separation in document images using boundary/perimeter detection. ACEEE International Journal on Signal and Image Processing, 03(1):10--14, 2012.Google Scholar
- P. P. Roy, J. Llados, and U. Pal. Text/graphics separation in color maps. In Proceedings of the International Conference on Computing: Theory and Applications, pages 545--551, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- G. Sharma, R. Garg, and S. Chaudhury. Curvature feature distribution based classification of indian scripts from document images. In Proceedings of the International Workshop on Multilingual OCR, pages 3:1--3:6, 2009. Google ScholarDigital Library
- C. L. Tan and P. O. Ng. Text extraction using pyramid. Pattern Recognition, 31:63--72, 1998.Google ScholarCross Ref
- Y. Y. Tang, S.-W. Lee, and C. Y. Suen. Automatic document processing: A survey. Pattern Recognition, 29(12):1931--1952, 1996.Google ScholarCross Ref
- K. Tombre, S. Tabbone, L. Pélissier, B. Lamiroy, and P. Dosch. Text/graphics separation revisited. In Proceedings of the 5th International Workshop on Document Analysis Systems V, pages 200--211, London, UK, UK, 2002. Springer-Verlag. Google ScholarDigital Library
- F. M. Wahl, K. Y. Wong, and R. G. Casey. Block segmentation and text extraction in mixed text/image documents. In Computer Graphics and Image Processing, volume 20, pages 375--390, 1982.Google Scholar
- D. Wang and S. N. Srihari. Classification of newspaper image blocks using texture analysis. Computer Vision, Graphics, and Image Processing, 47(3):327--352, 1989. Google ScholarDigital Library
Index Terms
- Text graphic separation in Indian newspapers
Recommendations
Deep features based convolutional neural network model for text and non-text region segmentation from document images
AbstractA deep convolutional neural network model is presented here which uses deep learning features for text and non-text region segmentation from document images. The key objective is to extract text regions from the complex layout document ...
Highlights- A method to analyze the complex layout document images using a deep neural network architecture is proposed.
Benchmarking NAS for Article Separation in Historical Newspapers
Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine CollaborationAbstractThe digitization of historical newspapers is a crucial task for preserving cultural heritage and making it accessible for various natural language processing and information retrieval tasks. One of the key challenges in digitizing old newspapers ...
Automatic Separation of Words in Multi-lingual Multi-script Indian Documents
ICDAR '97: Proceedings of the 4th International Conference on Document Analysis and RecognitionIn a multi-lingual country like India, a document may contain more than one script forms. For such a document it is necessary to separate different script forms before feeding them to OCRs of individual script. In this paper an automatic word ...
Comments