skip to main content
10.1145/1815330.1815381acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdasConference Proceedingsconference-collections
research-article

Document analysis applied to fragments: feature set for the reconstruction of torn documents

Published: 09 June 2010 Publication History

Abstract

Document analysis is done to analyze entire forms (e.g. intelligent form analysis, table detection) or to describe the layout/structure of a document. In this paper document analysis is applied to snippets of torn documents to calculate features that can be used for reconstruction. The main intention is to handle snippets of varying size and different contents (e.g. handwritten or printed text). Documents can either be destroyed by the intention to make the printed content unavailable (e.g. business crime) or due to time induced degeneration of ancient documents (e.g. bad storage conditions). Current reconstruction methods for manually torn documents deal with the shape, or e.g. inpainting and texture synthesis techniques. In this paper the potential of document analysis techniques of snippets to support a reconstruction algorithm by considering additional features is shown. This implies a rotational analysis, a color analysis, a line detection, a paper type analysis (checked, lined, blank) and a classification of the text (printed or hand written). Preliminary results show that these features can be determined reliably on a real dataset consisting of 690 snippets.

References

[1]
A. Amin and S. Fischer. A document skew detection method using the hough transform. Pattern Analysis and Applications, 3(3 2000):243--253, 2000.
[2]
A. D. Bagdanov and J. Kanai. Projection profile based skew estimation algorithm for jbig compressed images. In ICDAR '97: Proceedings of the 4th International Conference on Document Analysis and Recognition, pages 401--406, Washington, DC, USA, 1997. IEEE Computer Society.
[3]
Z.-L. Bai and Q. Huo. Underline detection and removal in a document image using multiple strategies. Pattern Recognition, International Conference on, 2:578--581, 2004.
[4]
I. Bar-Yosef, N. Hagbi, K. Kedem, and I. Dinstein. Line segmentation for degraded handwritten historical documents. Document Analysis and Recognition, International Conference on, 0:1161--1165, 2009.
[5]
F. Berger. Ein hybrides Verfahren zur automatischen Rekonstruktion von handzerrissenen Dokumentenseiten mittels geometrischer Informationen. Master's thesis, Vienna University of Technology, Institute of Computer Graphics and Algorithms, Austria, 2008.
[6]
BStU Berlin. Rekonstruktion von Unterlagen (german). accessed 11th december 2009. http://www.bstu.bund.de/cln_012/nn_714874/DE/Archiv/Rekonstruktion/rekonstruktion__node. html__nnn=true.
[7]
R. S. Caprari. Algorithm for text page up/down orientation determination. Pattern Recogn. Lett., 21(4):311--317, 2000.
[8]
C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: image segmentation using expectation-maximization and its application to image querying. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(8):1026--1038, Aug 2002.
[9]
D.-C. Cheng, X. Jiang, and A. Schmidt-Trucksaess. Image segmentation using histogram fitting and spatial information. Advances in Mass Data Analysis of Signals and Images in Medicine, Biotechnology and Chemistry, LNCS, 4826:47--57, 2007.
[10]
M. G. Chung, M. Fleck, and D. Forsyth. Jigsaw puzzle solver using shape and color. Signal Processing Proceedings, 1998. ICSP '98. 1998 Fourth International Conference on, 2:877--880, 1998.
[11]
A. Curry. Archive collapse disaster for historians. Spiegel online international, accessed 04th march 2009. http://www.spiegel.de/international/germany/0,1518,611311,00.html.
[12]
H. C. da Gama Leitão and J. Stolfi. A multiscale method for the reassembly of two-dimensional fragmented objects. IEEE Trans. Pattern Anal. Mach. Intell., 24(9):1239--1251, 2002.
[13]
H. C. da Gama Leitão and J. Stolfi. Measuring the information content of fracture lines. Int. J. Comput. Vision, 65(3):163--174, 2005.
[14]
E. D. Demaine and M. L. Demaine. Jigsaw puzzles, edge matching, and polyomino packing: Connections and complexity. Graphs and Combinatorics, 23(1):195--208, 2007.
[15]
M. Diem, F. Kleber, and R. Sablatnig. Analysis of document snippets as a basis for reconstruction. In 10th International Symposium on Virtual Reality, Archaeology and Cultural Heritage (VAST 2009), pages 101--108, St. Julians, Malta, 2009.
[16]
H. Freeman and L. Garder. Apictorial jigsaw puzzles: The computer solution of a problem in pattern recognition. Computers, IEEE Transactions on, EC-13(2):118--127, April 1964.
[17]
B. Gatos, D. Danatsas, I. Pratikakis, and S. J. Perantonis. Automatic table detection in document images. In ICAPR (1), pages 609--618, 2005.
[18]
B. Gatos, I. Pratikakis, and S. J. Perantonis. Adaptive degraded document image binarization. Pattern Recogn., 39(3):317--327, 2006.
[19]
H. Hase, M. Yoneda, S. Tokai, J. Kato, and Y. Suen. Color segmentation for text extraction. Int. J. Doc. Anal. Recognit., 6(4):271--284, 2004.
[20]
N. Henderson, R. King, and R. H. Middleton. An application of gaussian mixtures: Colour segmenting for the four legged league using hsi colour space. RoboCup 2007: Robot Soccer World Cup XI, pages 254--261, 2008.
[21]
J. J. Hull. Document image skew detection: Survey and annotated bibiliography. In J. J. Hull and S. L. Taylor, editors, Document Analysis System II, World Scientific, pages 40--64, 1998.
[22]
E. Kavallieratou, N. Fakotakis, and K. G. Skew angle estimation for printed and handwritten documents using the wigner-ville distribution. Image and Vision Computing, 20:813--824, 2002.
[23]
F. Kleber, M. Diem, and R. Sablatnig. Document reconstruction by layout analysis of snippets. In IS&T/SPIE Electronic Imaging, forthcoming, St. Jose, California, USA, 2010.
[24]
G. Leedham, C. Yan, K. Takru, J. H. N. Tan, and L. Mian. Comparison of some thresholding algorithms for text/background segmentation in difficult document images. Document Analysis and Recognition, International Conference on, 2:859, 2003.
[25]
R. E. Lewand. Cryptological Mathematics. The Mathematical Association of America, 2005.
[26]
R. D. Lins and B. T. Ávila. A new algorithm for skew detection in images of documents. In A. C. Campilho and M. S. Kamel, editors, ICIAR (2), volume 3212 of Lecture Notes in Computer Science, pages 234--240. Springer, 2004.
[27]
D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2):91--110, 2004.
[28]
C. Mancas-Thillou and B. Gosselin. Color text extraction from camera-based images the impact of the choice of the clustering distance. In ICDAR '05: Proceedings of the Eighth International Conference on Document Analysis and Recognition, pages 312--316, Washington, DC, USA, 2005. IEEE Computer Society.
[29]
B. Nickolay and J. Schneider. Virtuelle Rekonstruktion "vorvernichteter" Stasi-Unterlagen. Technologische Machbarkeit und Finanzierbarkeit - Folgerungen für Wissenschaft, Kriminaltechnik und Publizistik, volume 21, chapter Automatische virtuelle Rekonstruktion "vorvernichtender" Stasi-Unterlagen - Machbarkeit, Systemlösung Potenziale, pages 11--28. Schriftenreihe des Berliner Landesbeauftragten für die Unterlagen des Staatssicherheitsdienstes der ehemaligen DDR (German), Berlin, 2007.
[30]
T. R. Nielsen, P. Drewsen, and K. Hansen. Solving jigsaw puzzles using image features. Pattern Recogn. Lett., 29(14):1924--1933, 2008.
[31]
C. Papaodysseus, T. Panagopoulos, M. Exarhos, C. Triantafillou, D. Fragoulis, and C. Doumas. Contour-shape based reconstruction of fragmented, 1600 bc wall paintings. Signal Processing, IEEE Transactions on, 50(6):1277--1288, Jun 2002.
[32]
G. Peake and T. Tan. A general algorithm for document skew angle estimation. In ICIP97, pages 230--233, 1997.
[33]
M. Prandtstetter and G. R. Raidl. Meta-heuristics for reconstructing cross cut shredded text documents. In ACM: to appear in Proceedings of the Genetic and Evolutionary Computation Conference (GECCO'09), 2009.
[34]
M. S. Sagiroglu and A. Ercil. A texture based matching approach for automated assembly of puzzles. In Proc. 18th International Conference on Pattern Recognition ICPR 2006, volume 3, pages 1036--1041, 2006.
[35]
J. Sauvola and M. Pietikäinen. Adaptive document image binarization. Pattern Recognition, 33:225--236, 2000.
[36]
J. Schneider and B. Nickolay. Automatische virtuelle rekonstruktion vernichteter dokumente. Fraunhofer FUTUR, 2:6--7, 2006.
[37]
J. Schneider and B. Nickolay. The stasi puzzle. Fraunhofer Magazine, Special Issue, 1:32--33, 2008.
[38]
P. D. Smet. Reconstruction of ripped-up documents using fragment stack analysis procedures. Forensic Science International, 176(2--3):124--136, 2008.
[39]
P. D. Smet, J. D. Bock, and W. Philips. Semiautomatic reconstruction of strip-shredded documents. In Proc. of SPIE -IS&T Electronic Imaging "Image and Video Communications and Processing 2005", pages 239--248, 2005.
[40]
T.-H. Su, T.-W. Zhang, H.-J. Huang, and Y. Zhou. Skew detection for chinese handwriting by horizontal stroke histogram. In ICDAR '07: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, pages 899--903, Washington, DC, USA, 2007. IEEE Computer Society.
[41]
C. Sun and D. Si. Skew and slant correction for document images using gradient direction. Document Analysis and Recognition, International Conference on, 0:142, 1997.
[42]
M. Tkalcic and J. Tasic. Colour spaces: perceptual, historical and applicational background. In EUROCON 2003. Computer as a Tool. The IEEE Region 8, volume 1, pages 304--308 vol. 1, Sept. 2003.
[43]
R. Tybon. Generating Solutions to the Jigsaw Puzzle Problem. PhD thesis, Griffith University, Australia, 2004.
[44]
A. Ukovich and G. Ramponi. Features for the reconstruction of shredded notebook paper. Image Processing, 2005. ICIP 2005. IEEE International Conference on, 3: III-93-6, Sept. 2005.
[45]
J.-C. Wu, J.-W. Hsieh, and Y.-S. Chen. Morphology-based text line extraction. Mach. Vision Appl., 19(3):195--207, 2008.
[46]
F.-H. Yao and G.-F. Shao. A shape and image merging technique to solve jigsaw puzzles. Pattern Recogn. Lett., 24(12):1819--1835, 2003.
[47]
B. Yu and A. K. Jain. A generic system for form dropout. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(11):1127--1134, 1996.
[48]
Y. Zheng, C. Liu, X. Ding, and S. Pan. Form frame line detection with directional single-connected chain. Document Analysis and Recognition, International Conference on, 0:0699, 2001.

Cited By

View all
  • (2023)Computational techniques for virtual reconstruction of fragmented archaeological textilesHeritage Science10.1186/s40494-023-01102-311:1Online publication date: 13-Dec-2023
  • (2022)DAZeTD: Deep Analysis of Zones in Torn DocumentsFrontiers in Handwriting Recognition10.1007/978-3-031-21648-0_35(515-529)Online publication date: 25-Nov-2022
  • (2019)Table Rows Segmentation2019 International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2019.00080(461-466)Online publication date: Sep-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
June 2010
490 pages
ISBN:9781605587738
DOI:10.1145/1815330
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document reconstruction
  2. layout analysis
  3. skew

Qualifiers

  • Research-article

Funding Sources

Conference

DAS '10

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Computational techniques for virtual reconstruction of fragmented archaeological textilesHeritage Science10.1186/s40494-023-01102-311:1Online publication date: 13-Dec-2023
  • (2022)DAZeTD: Deep Analysis of Zones in Torn DocumentsFrontiers in Handwriting Recognition10.1007/978-3-031-21648-0_35(515-529)Online publication date: 25-Nov-2022
  • (2019)Table Rows Segmentation2019 International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2019.00080(461-466)Online publication date: Sep-2019
  • (2018)Matching Table Structures of Historical Register Books using Association Graphs2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)10.1109/ICFHR-2018.2018.00046(217-222)Online publication date: Aug-2018
  • (2018)Comparing Machine Learning Approaches for Table Recognition in Historical Register Books2018 13th IAPR International Workshop on Document Analysis Systems (DAS)10.1109/DAS.2018.44(133-138)Online publication date: Apr-2018
  • (2017)Practical Challenge of Shredded Documents: Clustering of Chinese Homologous PiecesApplied Sciences10.3390/app70909517:9(951)Online publication date: 15-Sep-2017
  • (2017)Shredded banknotes reconstruction using AKAZE pointsForensic Science International10.1016/j.forsciint.2017.07.014278(280-295)Online publication date: Sep-2017
  • (2014)Extending philological research with methods of 3D computer graphics applied to analysis of cultural heritageProceedings of the Eurographics Workshop on Graphics and Cultural Heritage10.5555/2854922.2854945(165-172)Online publication date: 6-Oct-2014
  • (2014)Semi-automated document image clustering and retrievalDocument Recognition and Retrieval XXI10.1117/12.2043010(90210M)Online publication date: 3-Feb-2014
  • (2014)Form classification and retrieval using bag of words with shape features of line structuresDocument Recognition and Retrieval XXI10.1117/12.2037210(902107)Online publication date: 3-Feb-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media