A Study on the Classification of Layout Components for Newspapers

Ferilli, Stefano; Esposito, Floriana; Redavid, Domenico

doi:10.1007/978-3-319-56300-8_15

Stefano Ferilli¹⁴,
Floriana Esposito¹⁴ &
Domenico Redavid¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 701))

Included in the following conference series:

Italian Research Conference on Digital Libraries

415 Accesses

Abstract

While nowadays most newspapers are born-digital (typeset directly in PDF), up to a few years ago they were only available in printed form. Digitizing the paper artifact to make it available in digital libraries yields a sequence of raster images of the pages that make up the documents. Such images consist of just matrices of pixels, and carry no explicit information about their organization into meaningful higher-level components. So, in the perspective of automatically extracting useful information from the newspapers and indexing them for future retrieval, a necessary preliminary task is to identify the layout components that are meaningful from a human interpretation viewpoint.

Unfortunately, approaches proposed in the literature for automatic layout analysis are often ineffective on newspapers, because of the much more complex layout of this kind of documents compared, e.g., to books and scientific papers. This work specifically focuses on the classification of layout blocks according to their content type. It investigates on the adaptation of an existing approach, that has been successfully applied to documents having standard layout, to the case of newspapers, working on the description features and set of classes. The modified approach was implemented and embedded in the DoMInUS system for document processing and management. Experimental results aimed at its evaluation are reported and commented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. Int. J. Doc. Anal. Recogn. 4, 2–17 (2001)
Article Google Scholar
Cao, H., Prasad, R., Natarajan, P., MacRostie, E.: Robust page segmentation based on smearing and error correction unifying top-down and bottom-up approaches. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 392–396. IEEE Computer Society (2007)
Google Scholar
Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N.: Machine learning for digital document processing: from layout analysis to metadata extraction. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition. Studies in Computational Intelligence, vol. 90, pp. 105–138. Springer, Heidelberg (2008)
Chapter Google Scholar
Ferilli, S.: Automatic Digital Document Processing and Management - Problems, Algorithms and Techniques. Springer, London (2011)
Book Google Scholar
Ferilli, S., Biba, M., Esposito, F., Basile, T.M.A.: A distance-based technique for non-manhattan layout analysis. In: Proceedings of the 10th International Conference on Document Analysis Recognition (ICDAR), pp. 231–235 (2009)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Mitchell, P.E., Yan, H.: Newspaper layout analysis incorporating connected component separation. Image Vis. Comput. 22(4), 307–317 (2004)
Article Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Shih, F.Y., Chen, S.-S.: Adaptive document block segmentation and classification. IEEE Trans. Syst. Man Cybern. - Part B 26(5), 797–802 (1996)
Article Google Scholar
Sun, H.-M.: Page segmentation for Manhattan and non-manhattan layout documents via selective CRLA. In: Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR), pp. 116–120. IEEE Computer Society (2005)
Google Scholar
Wang, D., Srihari, S.N.: Classification of newspaper image blocks using texture analysis. Comput. Vis. Graph. Image Process. 47, 327–352 (1989)
Article Google Scholar
Wong, K.Y., Casey, R., Wahl, F.M.: Document analysis system. IBM J. Res. Dev. 26, 647–656 (1982)
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank Vincenzo Raimondi for his help in implementing the prototype. This work was partially funded by the Italian PON 2007-2013 project PON02_00563_3489339 ‘Puglia@Service’.

Author information

Authors and Affiliations

University of Bari, Bari, Italy
Stefano Ferilli & Floriana Esposito
Artificial Brain S.r.l., Bari, Italy
Domenico Redavid

Authors

Stefano Ferilli
View author publications
You can also search for this author in PubMed Google Scholar
Floriana Esposito
View author publications
You can also search for this author in PubMed Google Scholar
Domenico Redavid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Ferilli .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Padova, Padua, Italy
Maristella Agosti
Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Firenze, Florence, Italy
Marco Bertini
Dipartimento di Informatica, Università degli Studi di Bari, Bari, Italy
Stefano Ferilli
Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Firenze, Florence, Italy
Simone Marinai
Dipartimento dei Beni Culturali, Università degli Studi di Padova, Padua, Italy
Nicola Orio

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferilli, S., Esposito, F., Redavid, D. (2017). A Study on the Classification of Layout Components for Newspapers. In: Agosti, M., Bertini, M., Ferilli, S., Marinai, S., Orio, N. (eds) Digital Libraries and Multimedia Archives. IRCDL 2016. Communications in Computer and Information Science, vol 701. Springer, Cham. https://doi.org/10.1007/978-3-319-56300-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-56300-8_15
Published: 08 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56299-5
Online ISBN: 978-3-319-56300-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics