Skip to main content
Log in

Using colour information to understand censorship cards of film archives

  • Original Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Many European film archives are involved in the digitization of 20th century historical paper documents. In the context of the IST project COLLATE three of them were interested in the semi-automatic annotation of censorship cards and their subsequent retrieval on the basis of both annotations and content. Processing censorship cards, which is the main subject of this paper, leads to a number of challenges for many document image analysis (DIA) systems. Problems arise due to the low layout quality and standard of such material, which introduces a considerable amount of noise in its description. The layout quality is often negatively affected by the presence of stamps, signatures, ink specks, manual annotations and so on that overlap those layout components involved in the understanding or annotation processes. In order to effectively reduce the presence and the effect of noise, we propose an improved version of the knowledge-based DIA system WISDOM++ allowing it to take full advantage of the use of colour information in all processing steps: namely, image segmentation, layout analysis, document image classification and understanding. Experiments have been conducted on a corpus of multi-format documents concerning rare historic film censorships provided by the three film archives involved in the COLLATE project.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aiello M., Monz C., Todoran L., Worring M. (2002). Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1):1–16

    Article  MATH  Google Scholar 

  2. Altamura O., Esposito F., Malerba D. (2001). Transforming paper documents into XML format with WISDOM++. Int. J. Doc. Anal. Recogn. 4(1):2–17

    Article  Google Scholar 

  3. Antonacopoulos, A., Karatzas, D.: Document image analysis for World War II personal records. In: 1st International Workshop on Document Image Analysis for Libraries (DIAL 2004), pp. 336–341 (2004)

  4. Antonacopoulos, A., Karatzas, D., Krawczyk, H., Wiszniewski, B.: The lifecycle of a digital historical document: structure and content. In: Munson, E.V., Vion-Dury J.Y. (eds.) Proceedings of the 2004 ACM Symposium on Document Engineering, pp. 147–154. ACM (2004)

  5. Bensaid A., Hall L.O., Bezdek J.C., Clarke L.P. (1996). Partially supervised clustering for image segmentation. Pattern Recogn. 29(5): 859–871

    Article  Google Scholar 

  6. Berardi M., Varlaro A., Malerba D. (2004). On the effect of caching in recursive theory learning. In: Camacho R., King R.D., Srinivasan A. (eds) Inductive Logic Programming, Lecture Notes in Computer Science, vol 3194. Springer, Berlin Heidelberg New York, pp. 44–62

    Google Scholar 

  7. Cheng H.D., Jiang X., Sun Y., Wang J. (2001). Color image segmentation: advances and prospects. Pattern Recogn. 34(12):2259–2281

    Article  MATH  Google Scholar 

  8. Esposito F., Malerba D., Marengo V. (2001). Inductive learning from numerical and symbolic data: an integrated framework. Intell. Data Anal. 5(6):445–461

    MATH  Google Scholar 

  9. Frommholz, I., Brocks, H., Thiel, U., Neuhold, E.J., Iannone, L., Semeraro, G., Berardi, M., Ceci, M.: Document-centered collaboration for scholars in the humanities – the collate system. In: European Conference on Research and Advanced Technology for Digital Libraries, pp. 434–445 (2003)

  10. Gatos B., Ntzios K., Pratikakis I., Petridis S., Konidaris T., Perantonis S.J. (2004). A segmentation-free recognition technique to assist old greek handwritten manuscript ocr. In: Marinai S., Dengel A. (eds) International Workshop on Document Analysis Systems, Lecture Notes in Computer Science, vol 3163. Springer, Berlin Heidelberg New York, pp. 63–74

    Google Scholar 

  11. Gatos B., Pratikakis I., Perantonis S.J. (2004). An adaptive binarization technique for low quality historical documents. In: Marinai S., Dengel A. (eds) International Workshop on Document Analysis Systems, Lecture Notes in Computer Science, vol. 3163. Springer, Berlin Heidelberg New York, pp. 102–113

    Google Scholar 

  12. Gervauz, M., Purgathofer, W.: A simple method for color quantization: octree quantization. Graphic Gems, pp. 287–293 (1990)

  13. Hase H., Yoneda M., Tokai S., Kato J., Suen C.Y. (2003). Color segmentation for text extraction. Int. J. Doc. Anal. Recogn. 6(4):271–284

    Article  Google Scholar 

  14. He J., Downton A.C. (2004). Configurable text stamp identification tool with application of fuzzy logic. In: Marinai S., Dengel A. (eds) International Workshop on Document Analysis Systems, Lecture Notes in Computer Science, vol. 3163. Springer, Berlin Heidelberg New York, pp. 201–212

    Google Scholar 

  15. Karatzas, D., Antonacopoulos, A.: Two approaches for text segmentation in web images. In: International Conference on Document Analysis and Recognition, pp. 131–136 (2003)

  16. Klink S., Kieninger T. (2001). Rule-based document structure understanding with a fuzzy combination of layout and textual features. Int. J. Doc. Anal. Recogn. 4(1):18–26

    Article  Google Scholar 

  17. Le Bourgeois F., Kaileh H. (2004). Automatic metadata retrieval from ancient manuscripts. In: Marinai S., Dengel A. (eds) International Workshop on Document Analysis Systems, Lecture Notes in Computer Science, vol. 3163. Springer, Berlin Heidelberg New York, pp. 75–89

    Google Scholar 

  18. Lee K.H., Choy Y.C., Cho S.B. (2000). Geometric structure analysis of document images: A knowledge-based approach. IEEE Trans. Pattern Anal. Mach. Intell. 22(11):1224–1240

    Article  Google Scholar 

  19. Levi G., Sirovich F. (1976). Generalized and/or graphs. Artif. Intell. 7(3):243–259

    Article  MATH  MathSciNet  Google Scholar 

  20. Lucchese, L., Mitra, S.K.: An algorithm for fast segmentation of color images,. In: Proceedings of IEEE 10th Tyrrhenian Workshop on Digital Communication, pp. 110–119 (1998)

  21. Lucchese, L., Mitra, S.K.: Advances in color image segmentation. In: Proceedings of Globecom’99, pp. 2038–2044 (1999)

  22. Malerba D. (2003). Learning recursive theories in the normal ilp setting. Fundamenta Informaticae 57(1):39–77

    MATH  MathSciNet  Google Scholar 

  23. Malerba, D., Esposito, F., Lisi, F.A., Altamura, O.: Automated discovery of dependencies between logical components in document image understanding. In: International Conference on Document Analysis and Recognition, pp. 174–178 (2001)

  24. Malerba, D., Esposito, F., Altamura, O., Ceci, M., Berardi, M.: Correcting the document layout: a machine learning approach. In: International Conference on Document Analysis and Recognition, p. 97 (2003)

  25. Mello, C.A.B., Lins, R.D.: Image segmentation of historical documents. In: Visual2000: 3rd International Conference on Visual Computing (2000)

  26. Mitchell T. (1997). Machine Learning. McGraw Hill, New York

    MATH  Google Scholar 

  27. Moghaddamzadeh A., Bourbakis N.G. (1997). A fuzzy region growing approach for segmentation of color images. Pattern Recogn. 30(6):867–881

    Article  Google Scholar 

  28. Nicolas S., Paquet T., Heutte L. (2004). Enriching historical manuscripts: The bovary project. In: Marinai S., Dengel A. (eds) International Workshop on Document Analysis Systems, Lecture Notes in Computer Science, vol. 3163. Springer, Berlin Heidelberg New York, pp. 135–146

    Google Scholar 

  29. Niyogi, D., Srihari, S.N.: Knowledge-based derivation of document logical structure. In: International Conference on Document Analysis and Recognition, pp. 472–475 (1995)

  30. Palmero, G.I.S., Dimitriadis, Y.A.: Structured document labeling and rule extraction using a new recurrent fuzzy-neural system. In: International Conference on Document Analysis and Recognition, pp. 181–184 (1999)

  31. Perroud, T., Sobottka, K., Bunke, H., Hall, L.: Text extraction from color documents – clustering approaches in three and four dimensions. In: International Conference on Document Analysis and Recognition, pp. 937–941 (2001)

  32. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc. (1993)

  33. Shih Y., Chen S.S. (1996). Adaptive document block segmentation and classification. IEEE Trans. Syst. Man Cybern Part B 26(5):797–802

    Article  Google Scholar 

  34. Sobottka K., Kronenberg H., Perroud T., Bunke H. (2000). Text extraction from colored book and journal covers. Int. J. Doc. Anal. Recogn. 2(4):163–176

    Google Scholar 

  35. Trémeau A., Borel N. (1997). A region growing and merging algorithm to color segmentation. Pattern Recogn. 30(7):1191–1203

    Article  Google Scholar 

  36. Utgoff, P.: An improved algorithm for incremental induction of decision trees. In: Proceedings of the Eleventh Internatinal Conference on Machine Learning. Morgan Kaufmann (1994)

  37. Wong K., Casey R., Wahl F. (1982). Document analysis system. IBM J. Res. Dev. 26(6):647–656

    Article  Google Scholar 

  38. Zhong Y., Karu K., Jain A.K. (1995). Locating text in complex color images. Pattern Recogn. 28(10):1523–1535

    Article  Google Scholar 

  39. Zhou, J., Lopresti, D.P.: Extracting text from www images. In: International Conference Document Analysis and Recognition, pp. 248–252. IEEE Computer Society (1997)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michelangelo Ceci.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Altamura, O., Berardi, M., Ceci, M. et al. Using colour information to understand censorship cards of film archives. IJDAR 9, 281–297 (2007). https://doi.org/10.1007/s10032-006-0021-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-006-0021-1

Keywords

Navigation