ABSTRACT
In this work we propose a system for automatic document segmentation to extract graphical elements from historical manuscripts and then to identify significant pictures from them, removing floral and abstract decorations. The system performs a block based analysis by means of color and texture features. The Gradient Spatial Dependency Matrix, a new texture operator particularly effective for this task, is proposed. The feature vectors are processed by an embedding procedure which allows increased performance in later SVM classification. Results for both feature extraction and embedding based classification are reported, supporting the effectiveness of the proposal.
- J. Bourgain. On lipschitz embedding of finite metric spaces in Hilbert space. Israel Journal of Mathematics, 52(1): 46--52, 1985.Google ScholarCross Ref
- N. Chen and D. Blostein. A survey of document image classification: problem statement, classifier architecture and performance evaluation. International Journal on Document Analysis and Recognition, 10(1): 1--16, 2007. Google ScholarDigital Library
- M. Diligenti, P. Frasconi, and M. Gori. Hidden Tree Markov Models for Document Image Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4): 519--523, 2003. Google ScholarDigital Library
- C. Faloutsos and K. Lin. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 163--174. ACM, 1995. Google ScholarDigital Library
- Y. Fataicha, M. Cheriet, J. Nie, and C. Suen. Content Analysis in Document Images: A Scale Space Approach. In Proceedings of the International Conference on Pattern Recognition, volume 3, pages 335--338. IEEE Computer Society, 2002. Google ScholarDigital Library
- C. Grana, D. Borghesani, S. Calderara, and R. Cucchiara. "inside the bible": Segmentation, annotation and retrieval for a new browsing experience. In ACM International Conference on Multimedia Information Retrieval, pages 379--386, Vancouver, Canada, Oct. 2008. Google ScholarDigital Library
- C. Grana, D. Borghesani, and R. Cucchiara. Describing Texture Directions with Von Mises Distributions. In Proceedings of the 19th International Conference on Pattern Recognition, 2008.Google ScholarCross Ref
- C. Grana, R. Vezzani, and R. Cucchiara. Enhancing HSV Histograms with Achromatic Points Detection for Video Retrieval. In Proceedings of ACM International Conference on Image and Video Retrieval, pages 302--308, 2007. Google ScholarDigital Library
- Haralick, R. M. and Shanmugam, K. and Dinstein, I. Textural features for image classification. IEEE Transactions on Systems, Man and Cybernetics, 3(6): 610--621, 1973.Google Scholar
- G. Hjaltason and H. Samet. Properties of Embedding Methods for Similarity Searching in Metric Spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5): 530--549, 2003. Google ScholarDigital Library
- G. Hristescu and M. Farach. Cluster-preserving Embedding of Proteins. Technical report, Center for Discrete Mathematics and Theoretical Computer Science, 1999. Google ScholarDigital Library
- J. Hu, R. Kashi, and R. Wilfong. Document Classification Using Layout Analysis. In Proceedings of the International Workshop on Database and Expert Systems Applications, pages 556--560. IEEE Computer Society, 1999. Google ScholarDigital Library
- A. Jain and R. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988. Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning, pages 137--142. Springer Verlag, 1998. Google ScholarDigital Library
- N. Journet, J. Ramel, R. Mullot, and V. Eglin. Document image characterization using a multiresolution analysis of the texture: application to old documents. International Journal of Document Analysis and Recognition, 11(1): 9--18, 2008. Google ScholarDigital Library
- E. Kavallieratou. A Binarization Algorithm specialized on Document Images and Photos. In Proceedings of the 8th International Conference on Document Analysis and Recognition, pages 463--467. IEEE Computer Society, 2005. Google ScholarDigital Library
- A. Kitamoto, M. Onishi, T. Ikezaki, D. Deuff, E. Meyer, S. Sato, T. Muramatsu, R. Kamida, T. Yamamoto, and K. Ono. Digital Bleaching and Content Extraction for the Digital Archive of Rare Books. In Proceedings of the International Conference on Document Image Analysis for Libraries, pages 133--144. IEEE Computer Society, 2006. Google ScholarDigital Library
- T. Konidaris, B. Gatos, K. Ntzios, I. Pratikakis, S. Theodoridis, and S. Perantonis. Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. International Journal on Document Analysis and Recognition, 9(2--4): 167--177, 2007. Google ScholarDigital Library
- F. Le Bourgeois and H. Emptoz. DEBORA: Digital accEss to BOoks of the RenAissance. International Journal of Document Analysis and Recognition, 9(2): 193--221, 2007. Google ScholarDigital Library
- F. Le Bourgeois, E. Trinh, B. Allier, V. Eglin, and H. Emptoz. Document Images Analysis Solutions for Digital libraries. In Proceedings of the International Workshop on Document Image Analysis for Libraries, pages 2--24. IEEE Computer Society, 2004. Google ScholarDigital Library
- G. Meng, N. Zheng, Y. Song, and Y. Zhang. Document Images Retrieval Based on Multiple Features Combination. In Proceedings of the International Conference on Document Analysis and Recognition, volume 1, pages 143--147. IEEE Computer Society, 2007. Google ScholarDigital Library
- G. Nagy. Twenty years of document image analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1): 38--62, 2000. Google ScholarDigital Library
- S. Nicolas, J. Dardenne, T. Paquet, and L. Heutte. Document Image Segmentation Using a 2D Conditional Random Field Model. In Proceedings of the International Conference on Document Analysis and Recognition, volume 1, pages 407--411, 2007. Google ScholarDigital Library
- J. Ogier and K. Tombre. Madonne: Document Image Analysis Techniques for Cultural Heritage Documents. In Digital Cultural Heritage, Proceedings of 1st EVA Conference, pages 107--114. Oesterreichische Computer Gesellschaft, 2006.Google Scholar
- J. Ramel, S. Busson, and M. Demonet. AGORA: the interactive document image analysis tool of the BVH project. In Proceedings of the International Conference on Document Image Analysis for Libraries, pages 145--155, 2006. Google ScholarDigital Library
- X. Wang, J. Wang, K. Lin, D. Shasha, B. Shapiro, and K. Zhang. An index structure for data mining and clustering. Knowledge and Information Systems, 2: 161--184, 2000.Google ScholarCross Ref
Index Terms
- Picture extraction from digitized historical manuscripts
Recommendations
A Pixel Labeling Approach for Historical Digitized Books
ICDAR '13: Proceedings of the 2013 12th International Conference on Document Analysis and RecognitionIn the context of historical collection conservation and worldwide diffusion, this paper presents an automatic approach of historical book page layout segmentation. In this article, we propose to search the homogeneous regions from the content of ...
Automatic segmentation of digitalized historical manuscripts
The artistic content of historical manuscripts provides a lot of challenges in terms of automatic text extraction, picture segmentation and retrieval by similarity. In particular this work addresses the problem of automatic extraction of meaningful ...
Historical event extraction from text
LaTeCH '11: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and HumanitiesIn this paper, we report on how historical events are extracted from text within the Semantics of History research project. The project aims at the creation of resources for a historical information retrieval system that can handle the time-based ...
Comments