Skip to main content

Datasets and Annotations for Document Analysis and Recognition

  • Reference work entry
  • First Online:
Handbook of Document Image Processing and Recognition

Abstract

The definition of standard frameworks for performance evaluation is a key issue in order to advance the state-of-the-art in any field of document analysis since it permits a fair and objective comparison of different proposed methods under a common scenario. For that reason, a large number of public datasets have emerged in the last years. However, several challenges must be considered when creating such datasets in order to get a sufficiently large collection of representative data that can be easily exploited by the researchers. In this chapter we review different approaches followed by the document analysis community to address some of these challenges, such as the collection of representative data, its annotation with ground-truth information, or the representation using accepted and common formats. We also provide a comprehensive list of existing public datasets for each of the different areas of document analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 549.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 549.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alamri H, Sadri J, Suen CY, Nobile N (2008) A novel comprehensive database for Arabic off-line handwriting recognition. In: Proceedings of the 11th international conference on frontiers in handwriting recognition (ICFHR 2008), Montréal, pp 664–669

    Google Scholar 

  2. Al-Ohali Y, Cheriet M, Suen C (2003) Databases for recognition of handwritten arabic cheques. Pattern Recognit 36(1):111–121. doi:10.1016/S0031-3203(02)00064-X, URL: http://www.sciencedirect.com/science/article/pii/S003132030200064X

    Article  Google Scholar 

  3. Antonacopoulos A, Karatzas D, Bridson D (2006) Ground truth for layout analysis performance evaluation. In: Proceedings of the 7th IAPR workshop on document analysis systems (DAS2006), Nelson. Springer, pp 302–311

    Google Scholar 

  4. Antonacopoulos A, Bridson D, Papadopoulos C, Pletschacher S (2009) A realistic dataset for performance evaluation of document layout analysis. In: 10th international conference on document analysis and recognition (ICDAR’09), Barcelona, 2009, pp 296–300. doi:10.1109/ICDAR.2009.271

    Google Scholar 

  5. Antonacopoulos A, Clausner C, Papadopoulos C, Pletschacher S (2011) Historical document layout analysis competition. In: 11th international conference on document analysis and recognition (ICDAR’11), Beijing, 2011

    Google Scholar 

  6. Baird HS (1995) Document image defect models. In: O’Gorman L, Kasturi R (eds) Document image analysis. IEEE Computer Society, Los Alamitos, pp 315–325. URL: http://dl.acm.org/citation.cfm?id=201573.201660

  7. Bhattacharya U, Chaudhuri B (2009) Handwritten numeral databases of Indian scripts and multistage recognition of mixed numerals. IEEE Trans Pattern Anal Mach Intell 31(3): 444–457. doi:10.1109/TPAMI.2008.88

    Article  Google Scholar 

  8. Blankers V, Heuvel C, Franke K, Vuurpijl L (2009) ICDAR 2009 signature verification competition. In: 10th international conference on document analysis and recognition (ICDAR’09), Barcelona, 2009, pp 1403–1407. doi:10.1109/ICDAR.2009.216

    Google Scholar 

  9. Bukhari SS, Shafait F, Breuel TM (2012) The IUPR dataset of camera-captured document images. In: Proceedings of the 4th international conference on camera-based document analysis and recognition (CBDAR’11), Beijing. Springer, Berlin/Heidelberg, pp 164–171

    Chapter  Google Scholar 

  10. Dalitz C, Droettboom M, Pranzas B, Fujinaga I (2008) A comparative study of staff removal algorithms. IEEE Trans Pattern Anal Mach Intell 30:753–766. doi:http://doi.ieeecomputersociety.org/10.1109/TPAMI.2007.70749

    Article  Google Scholar 

  11. Delalandre M, Valveny E, Pridmore T, Karatzas D (2010) Generation of synthetic documents for performance evaluation of symbol recognition & spotting systems. Int J Doc Anal Recognit 13:187–207. doi:http://dx.doi.org/10.1007/s10032-010-0120-x, URL: http://dx.doi.org/10.1007/s10032-010-0120-x

    Article  Google Scholar 

  12. Doucet A, Kazai G, Dresevic B, Uzelac A, Radakovic B, Todic N (2011) Setting up a competition framework for the evaluation of structure extraction from OCR-ed books. Int J Doc Anal Recognit 14:45–52. doi:http://dx.doi.org/10.1007/s10032-010-0127-3, URL: http://dx.doi.org/10.1007/s10032-010-0127-3

    Article  Google Scholar 

  13. El Abed H, Kherallah M, Märgner V, Alimi AM (2011) On-line Arabic handwriting recognition competition: ADAB database and participating systems. Int J Doc Anal Recognit 14: 15–23. doi:http://dx.doi.org/10.1007/s10032-010-0124-6, URL: http://dx.doi.org/10.1007/s10032-010-0124-6

    Article  Google Scholar 

  14. Fierrez J, Galbally J, Ortega-Garcia J, Freire M, Alonso-Fernandez F, Ramos D, Toledano D, Gonzalez-Rodriguez J, Siguenza J, Garrido-Salas J, Anguiano E, Gonzalez-de Rivera G, Ribalda R, Faundez-Zanuy M, Ortega J, Cardeñoso-Payo V, Viloria A, Vivaracho C, Moro Q, Igarza J, Sanchez J, Hernaez I, Orrite-Uruñuela C, Martinez-Contreras F, Gracia-Roche J (2010) BiosecurID: a multimodal biometric database. Pattern Anal Appl 13:235–246. doi:10.1007/s10044-009-0151-4, URL: http://dx.doi.org/10.1007/s10044-009-0151-4

    Article  MathSciNet  Google Scholar 

  15. Fischer A, Indermühle E, Bunke H, Viehhauser G, Stolz M (2010) Ground truth creation for handwriting recognition in historical documents. In: Proceedings of the 9th IAPR international workshop on document analysis systems (DAS’10), Boston. ACM, New York, pp 3–10. doi:http://doi.acm.org/10.1145/1815330.1815331, URL: http://doi.acm.org/10.1145/1815330.1815331

  16. Fornés A, Dutta A, Gordo A, Lladós J (2012) CVC-MUSCIMA: a ground truth of handwritten music score images for writer identification and staff removal. Int J Doc Anal Recognit 15(3), 243–251. doi:10.1007/s10032-011-0168-2, URL: http://dx.doi.org/10.1007/s10032-011-0168-2

    Article  Google Scholar 

  17. Fruchterman T (1995) DAFS: a standard for document and image understanding. In: Proceedings of the symposium on document image understanding technology, Bowes, pp 94–100

    Google Scholar 

  18. Garain U, Chaudhuri B (2005) A corpus for OCR research on mathematical expressions. Int J Doc Anal Recognit 7:241–259. doi:10.1007/s10032-004-0140-5, URL: http://dl.acm.org/citation.cfm?id=1102243.1102246

    Article  Google Scholar 

  19. Gatos B, Ntirogiannis K, Pratikakis I (2009) ICDAR2009 document image binarization contest (DIBCO 2009). In: 10th international conference on document analysis and recognition (ICDAR’09), Barcelona, 2009, pp 1375–1382. doi:10.1109/ICDAR.2009.246

    Google Scholar 

  20. Gatos B, Stamatopoulos N, Louloudis G (2011) ICDAR2009 handwriting segmentation contest. Int J Doc Anal Recognit 14:25–33. doi:10.1007/s10032-010-0122-8, URL: http://dx.doi.org/10.1007/s10032-010-0122-8

    Article  Google Scholar 

  21. Guyon I, Schomaker L, Plamondon R, Liberman M, Janet S (1994) Unipen project of on-line data exchange and recognizer benchmarks. In: Proceedings of the international conference on pattern recognition, Jerusalem, pp 29–33

    Google Scholar 

  22. Hassaï andne A, Al-Maadeed S, Alja’am JM, Jaoua A, Bouridane A (2011) The ICDAR2011 Arabic writer identification contest. In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 1470–1474. doi:10.1109/ICDAR.2011.292

    Google Scholar 

  23. Helmers M, Bunke H (2003) Generation and use of synthetic training data in cursive handwriting recognition. In: Perales F, Campilho A, de la Blanca N, Sanfeliu A (eds) Pattern recognition and image analysis. Lecture notes in computer science, vol 2652. Springer, Berlin/Heidelberg, pp 336–345

    Chapter  Google Scholar 

  24. Hu J, Kashi RS, Lopresti DP, Wilfong GT (2002) Evaluating the performance of table processing algorithms. Int J Doc Anal Recognit 4(3):140–153

    Article  Google Scholar 

  25. Indermühle E, Liwicki M, Bunke H (2010) IAMonDo-database: an online handwritten document database with non-uniform contents. In: Proceedings of the 9th IAPR international workshop on document analysis systems (DAS’10), Boston. ACM, New York, pp 97–104. doi:http://doi.acm.org/10.1145/1815330.1815343, URL: http://doi.acm.org/10.1145/1815330.1815343

  26. Kanai J, Rice SV, Nartker TA, Nagy G (1995) Automated evaluation of OCR zoning. IEEE Trans Pattern Anal Mach Intell 17:86–90. doi:http://doi.ieeecomputersociety.org/ 10.1109/34.368146

    Google Scholar 

  27. Kanungo T, Haralick RM, Stuezle W, Baird HS, Madigan D (2000) A statistical, nonparametric methodology for document degradation model validation. IEEE Trans Pattern Anal Mach Intell 22:1209–1223. doi:http://dx.doi.org/10.1109/34.888707, URL: http://dx.doi.org/10.1109/34.888707

    Article  Google Scholar 

  28. Khosravi H, Kabir E (2007) Introducing a very large dataset of handwritten Farsi digits and a study on their varieties. Pattern Recognit Lett 28:1133–1141. doi:10.1016/j.patrec.2006.12.022, URL: http://dl.acm.org/citation.cfm?id=1243503.1243603

    Article  Google Scholar 

  29. Kim DW, Kanungo T (2002) Attributed point matching for automatic groundtruth generation. Int J Doc Anal Recognit 5:47–66. doi:10.1007/s10032-002-0083-7, URL: http://dx.doi.org/10.1007/s10032-002-0083-7

    Article  Google Scholar 

  30. Lee CH, Kanungo T (2003) The architecture of TRUEVIZ: a groundtruth/metadata editing and visualizing toolkit. Pattern Recognit 36(3):811–825. doi:10.1016/S0031-3203(02)00101-2, URL: http://www.sciencedirect.com/science/article/pii/S0031320302001012

    Article  Google Scholar 

  31. Liang J, Phillips IT, Haralick RM (1997) Performance evaluation of document layout analysis algorithms on the UW data set. In: Proceedings of the SPIE document recognition IV, San Jose, pp 149–160

    Google Scholar 

  32. Liwicki M, Bunke H (2005) IAM-OnDB – an on-line English sentence database acquired from handwritten text on a whiteboard. In: Proceedings of the eighth international conference on document analysis and recognition (ICDAR’05), Seoul. IEEE Computer Society, Washington, DC, pp 956–961. doi:http://dx.doi.org/10.1109/ICDAR.2005.132, URL: http://dx.doi.org/10.1109/ICDAR.2005.132

  33. Liwicki M, van den Heuvel C, Found B, Malik M (2010) Forensic signature verification competition 4NSigComp2010 – detection of simulated and disguised signatures. In: International conference on frontiers in handwriting recognition (ICFHR), Kolkata, 2010, pp 715–720. doi:10.1109/ICFHR.2010.116

    Google Scholar 

  34. Liwicki M, Malik M, van den Heuvel C, Chen X, Berger C, Stoel R, Blumenstein M, Found B (2011) Signature verification competition for online and offline skilled forgeries (SigComp2011). In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 1480–1484. doi:10.1109/ICDAR.2011.294

    Google Scholar 

  35. Lopresti D (2009) Optical character recognition errors and their effects on natural language processing. Int J Doc Anal Recognit 12:141–151. doi:10.1007/s10032-009-0094-8, URL: http://dx.doi.org/10.1007/s10032-009-0094-8

    Article  Google Scholar 

  36. Louloudis G, Stamatopoulos N, Gatos B (2011) ICDAR 2011 writer identification contest. In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 1475–1479. doi:10.1109/ICDAR.2011.293

    Google Scholar 

  37. Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R (2003) ICDAR 2003 robust reading competitions. In: Proceedings of the seventh international conference on document analysis and recognition (ICDAR’03), Edinburgh, vol 2. IEEE Computer Society, Washington, DC, pp 682–687. URL: http://dl.acm.org/citation.cfm?id=938980.939531

  38. MacLean S, Labahn G, Lank E, Marzouk M, Tausky D (2011) Grammar-based techniques for creating ground-truthed sketch corpora. Int J Doc Anal Recognit 14: 65–74. doi:http://dx.doi.org/10.1007/s10032-010-0118-4, URL: http://dx.doi.org/10.1007/s10032-010-0118-4

    Article  Google Scholar 

  39. Marti UV, Bunke H (1999) A full English sentence database for off-line handwriting recognition. In: Proceedings of the fifth international conference on document analysis and recognition (ICDAR’99), Bangalore. IEEE Computer Society, Washington, DC, pp 705–708. URL: http://dl.acm.org/citation.cfm?id=839279.840504

  40. Mihov S, Schulz K, Ringlstetter C, Dojchinova V, Nakova V, Kalpakchieva K, Gerasimov O, Gotscharek A, Gercke C (2005) A corpus for comparative evaluation of OCR software and postcorrection techniques. In: Proceedings of the eighth international conference on document analysis and recognition, Seoul, 2005, vol 1, pp 162–166. doi:10.1109/ICDAR.2005.6

    Google Scholar 

  41. Moll M, Baird H, An C (2008) Truthing for pixel-accurate segmentation. In: The eighth IAPR international workshop on document analysis systems (DAS’08), Japan, 2008, pp 379–385. doi:10.1109/DAS.2008.47

    Google Scholar 

  42. Mori M, Suzuki A, Shio A, Ohtsuka S (2000) Generating new samples from handwritten numerals based on point correspondence. In: Proceedings of the 7th international workshop on frontiers in handwriting recognition (IWFHR2000), Amsterdam, pp 281–290

    Google Scholar 

  43. Mouchere H, Viard-Gaudin C, Kim DH, Kim JH, Garain U (2011) CROHME2011: competition on recognition of online handwritten mathematical expressions. In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 1497–1500. doi:10.1109/ICDAR.2011.297

    Google Scholar 

  44. Ntirogiannis K, Gatos B, Pratikakis I (2008) An objective evaluation methodology for document image binarization techniques. In: The eighth IAPR international workshop on document analysis systems (DAS’08), Nara, 2008, pp 217–224. doi:10.1109/DAS.2008.41

    Google Scholar 

  45. Okamoto M, Imai H, Takagi K (2001) Performance evaluation of a robust method for mathematical expression recognition. In: International conference on document analysis and recognition, Seattle, p 0121. doi:http://doi.ieeecomputersociety.org/10.1109/ICDAR.2001.953767

    Google Scholar 

  46. Ortega-Garcia J, Fierrez-Aguilar J, Simon D, Gonzalez J, Faundez-Zanuy M, Espinosa V, Satue A, Hernaez I, Igarza JJ, Vivaracho C, Escudero D, Moro QI (2003) MCYT baseline corpus: a bimodal biometric database. IEE Proc Vis Image Signal Process 150(6):395–401. doi:10.1049/ip-vis:20031078

    Article  Google Scholar 

  47. Paredes R, Kavallieratou E, Lins RD (2010) ICFHR 2010 contest: quantitative evaluation of binarization algorithms. In: International conference on frontiers in handwriting recognition, Kolkata, pp 733–736. doi:http://doi.ieeecomputersociety.org/10.1109/ICFHR.2010.119

    Google Scholar 

  48. Perez D, Tarazon L, Serrano N, Castro F, Terrades O, Juan A (2009) The GERMANA database. In: 10th international conference on document analysis and recognition (ICDAR’09), Barcelona, 2009, pp 301–305. doi:10.1109/ICDAR.2009.10

    Google Scholar 

  49. Phillips IT, Chhabra AK (1999) Empirical performance evaluation of graphics recognition systems. IEEE Trans Pattern Anal Mach Intell 21:849–870. doi:http://dx.doi.org/10.1109/34.790427, URL: http://dx.doi.org/10.1109/34.790427

    Article  Google Scholar 

  50. Phillips I, Chen S, Haralick R (1993) CD-ROM document database standard. In: Proceedings of the second international conference on document analysis and recognition, Tsukuba, 1993, pp 478–483. doi:10.1109/ICDAR.1993.395691

    Google Scholar 

  51. Phillips I, Ha J, Haralick R, Dori D (1993) The implementation methodology for a CD-ROM English document database. In: Proceedings of the second international conference on document analysis and recognition, Tsukuba, 1993, pp 484–487. doi:10.1109/ICDAR.1993.395690

    Google Scholar 

  52. Plamondon R, Guerfali W (1998) The generation of handwriting with delta-lognormal synergies. Biol Cybern 132:119–132

    Article  Google Scholar 

  53. Pletschacher S, Antonacopoulos A (2010) The page (page analysis and ground-truth elements) format framework. In: 20th international conference on pattern recognition (ICPR), Istanbul, 2010, pp 257–260. doi:10.1109/ICPR.2010.72

    Google Scholar 

  54. Pratikakis I, Gatos B, Ntirogiannis K (2010) H-DIBCO 2010 – handwritten document image binarization competition. In: International conference on frontiers in handwriting recognition (ICFHR), Kolkata, 2010, pp 727–732. doi:10.1109/ICFHR.2010.118

    Google Scholar 

  55. Pratikakis I, Gatos B, Ntirogiannis K (2011) ICDAR 2011 document image binarization contest (DIBCO 2011). In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 1506–1510. doi:10.1109/ICDAR.2011.299

    Google Scholar 

  56. Quiniou S, Mouchere H, Saldarriaga S, Viard-Gaudin C, Morin E, Petitrenaud S, Medjkoune S (2011) HAMEX – a handwritten and audio dataset of mathematical expressions. In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 452–456. doi:10.1109/ICDAR.2011.97

    Google Scholar 

  57. Rath TM, Manmatha R (2007) Word spotting for historical documents. Int J Doc Anal Recognit 9(2):139–152. doi:10.1007/s10032-006-0027-8, URL: http://dx.doi.org/10.1007/s10032-006-0027-8

    Article  Google Scholar 

  58. Rice SV, Jenkins FR, Nartker TA (1996) The fifth annual test of OCR accuracy. Technical report TR-96-01. AInformation Science Research Institute (University of Nevada, Las Vegas)

    Google Scholar 

  59. Rusiñol M, Borrís A, Lladós J (2010) Relational indexing of vectorial primitives for symbol spotting in line-drawing images. Pattern Recognit Lett 31:188–201. doi:http://dx.doi.org/10.1016/j.patrec.2009.10.002, URL: http://dx.doi.org/10.1016/j.patrec.2009.10.002

    Article  Google Scholar 

  60. Saund E, Lin J, Sarkar P (2009) PixLabeler: user interface for pixel-level labeling of elements in document images. In: Proceedings of the 2009 10th international conference on document analysis and recognition (ICDAR’09), Barcelona. IEEE Computer Society, Washington, DC, pp 646–650. doi:http://dx.doi.org/10.1109/ICDAR.2009.250, URL: http://dx.doi.org/10.1109/ICDAR.2009.250

  61. Schomaker L, Thomassen A, Teulings HL (1989) A computational model of cursive handwriting. In: Plamondon R, Suen CY, Simner ML (eds) Computer recognition and human production of handwriting. World Scientific, Singapore/Teaneck, pp 153–177

    Chapter  Google Scholar 

  62. Serrano N, Castro F, Juan A (2010) The RODRIGO database. In: LREC, Valletta

    Google Scholar 

  63. Setlur S, Govindaraju V (1994) Generating manifold samples from a handwritten word. Pattern Recognit Lett 15(9):901–905. doi:10.1016/0167-8655(94)90152-X, URL: http://www.sciencedirect.com/science/article/pii/016786559490152X

    Article  Google Scholar 

  64. Shafait F (2007) Document image dewarping contest. In: 2nd international workshop on camera-based document analysis and recognition, Curitiba, pp 181–188

    Google Scholar 

  65. Shahab A, Shafait F, Kieninger T, Dengel A (2010) An open approach towards the benchmarking of table structure recognition systems. In: Proceedings of the 9th IAPR international workshop on document analysis systems (DAS’10), Boston. ACM, New York, pp 113–120. doi:http://doi.acm.org/10.1145/1815330.1815345, URL: http://doi.acm.org/10.1145/1815330.1815345

  66. Smith EHB (2010) An analysis of binarization ground truthing. In: Proceedings of the 9th IAPR international workshop on document analysis systems (DAS’10), Boston. ACM, New York, pp 27–34. doi:http://doi.acm.org/10.1145/1815330.1815334, URL: http://doi.acm.org/10.1145/1815330.1815334

  67. Solimanpour F, Sadri J, Suen CY (2006) Standard databases for recognition of handwritten digits, numerical strings, legal amounts, letters and dates in Farsi language. In: Lorette G (ed) Tenth international workshop on frontiers in handwriting recognition, Université de Rennes 1, Suvisoft, La Baule. URL: http://hal.inria.fr/inria-00103983/en/

  68. Suen C, Nadal C, Legault R, Mai T, Lam L (1992) Computer recognition of unconstrained handwritten numerals. Proc IEEE 80(7):1162–1180. doi:10.1109/5.156477

    Article  Google Scholar 

  69. Todoran L, Worring M, Smeulders M (2005) The UvA color document dataset. Int J Doc Anal Recognit 7:228–240. doi:10.1007/s10032-004-0135-2, URL: http://dl.acm.org/citation.cfm?id=1102243.1102245

    Article  Google Scholar 

  70. Uchida S, Nomura A, Suzuki M (2005) Quantitative analysis of mathematical documents. Int J Doc Anal Recognit 7:211–218. doi:10.1007/s10032-005-0142-y, URL: http://dl.acm.org/citation.cfm?id=1102243.1102248

    Article  Google Scholar 

  71. Varga T, Bunke H (2003) Generation of synthetic training data for an HMM-based handwriting recognition system. In: Proceedings of the seventh international conference on document analysis and recognition (ICDAR’03), Edinburgh, vol 1. IEEE Computer Society, Washington, DC, pp 618–622. URL: http://dl.acm.org/citation.cfm?id=938979.939265

  72. Viard-Gaudin C, Lallican PM, Binter P, Knerr S (1999) The IRESTE On/Off (IRONOFF) dual handwriting database. In: Proceedings of the fifth international conference on document analysis and recognition (ICDAR’99), Bangalore. IEEE Computer Society, Washington, DC, pp 455–458. URL: http://dl.acm.org/citation.cfm?id=839279.840372

  73. Wang K, Belongie S (2010) Word spotting in the wild. In: Proceedings of the 11th European conference on computer vision: part I (ECCV’10), Heraklion. Springer, Berlin/Heidelberg, pp 591–604. URL: http://dl.acm.org/citation.cfm?id=1886063.1886108

    Chapter  Google Scholar 

  74. Wang J, Wu C, Xu YQ, Shum HY, Ji L (2002) Learning-based cursive handwriting synthesis. In: Proceedings of the eighth international workshop on frontiers of handwriting recognition, Niagara-on-the-Lake, pp 157–162

    Google Scholar 

  75. Wang DH, Liu CL, Yu JL, Zhou XD (2009) CASIA-OLHWDB1: a database of online handwritten Chinese characters. In: Proceedings of the 2009 10th international conference on document analysis and recognition (ICDAR’09), Barcelona. IEEE Computer Society, Washington, DC, pp 1206–1210. doi:http://dx.doi.org/10.1109/ICDAR.2009.163, URL: http://dx.doi.org/10.1109/ICDAR.2009.163

  76. Yang L, Huang W, Tan CL (2006) Semi-automatic ground truth generation for chart image recognition. In: Workshop on document analysis systems (DAS), Nelson, pp 324–335

    Google Scholar 

  77. Yanikoglu BA, Vincent L (1998) Pink panther: a complete environment for ground-truthing and benchmarking document page segmentation. Pattern Recognit 31(9): 1191–1204. doi:10.1016/S0031-3203(97)00137-4, URL: http://www.sciencedirect.com/science/article/pii/S0031320397001374

    Article  Google Scholar 

  78. Zhai J, Wenyin L, Dori D, Li Q (2003) A line drawings degradation model for performance characterization. In: Proceedings of the seventh international conference on document analysis and recognition, Edinburgh, 2003, pp 1020–1024. doi:10.1109/ICDAR.2003.1227813

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ernest Valveny .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag London

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Valveny, E. (2014). Datasets and Annotations for Document Analysis and Recognition. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-859-1_32

Download citation

Publish with us

Policies and ethics