Skip to main content
Log in

Word shape descriptor-based document image indexing: a new DBH-based approach

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

In this paper, we propose a novel feature representation for binary patterns by exploiting the object shape information. Initial evaluation of the representation is performed for Bengali and Gujarati script character classification. The extension of the representation for word images is presented subsequently. The proposed feature representation in combination with distance-based hashing is applied for defining novel word image-based document image indexing and retrieval framework. The concept of hierarchical hashing is utilized to reduce the retrieval time complexity. In addition, with the objective of reduction in the size of hashing data structure, the concept of multi-probe hashing is extended for binary mapping functions. The exhaustive experimental evaluation of the proposed framework on a collection of documents belonging to Devanagari, Bengali and English scripts has yielded encouraging results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Available at URL: http://ocr.cdacnoida.in/

  2. Adamek T., O’Connor N.E., Smeaton A.F.: Word matching using single closed contours for indexing handwritten historical documents. Int. J. Doc. Anal. Recognit. 9(2), 153–165 (2007)

    Google Scholar 

  3. Andoni A., Indyk P.: Near optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)

    Article  Google Scholar 

  4. Arya, D., Jawahar, C.V., Chakravorty, B., Patnaik, T., Chaudhuri, B.B., Lehal, G.S., Chaudhury, S., Ramakrishna, A.G.: Experiences of integration and performance testing of multilingual ocr for printed Indian scripts. In: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, MOCR_AND’11, pp. 9:1–9:8 (2011)

  5. Bai, S., Li, L., Tan, C.L.: Keyword spotting in document images through word shape coding. In: Proceedings of the 10th International Conference on Document Analysis and Recognition, pp. 331–335 (2009)

  6. Bajaj R., Chaudhury S.: Signature verification using multiple neural classifiers. Pattern Recognit. 30(1), 1–7 (1997)

    Article  Google Scholar 

  7. Belongie S., Malik J., Puzicha J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24(24), 509–522 (2002)

    Article  Google Scholar 

  8. Brisaboa, N.R., Cillero, Y., Farina, A., Ladra, S., Pedreira, O.: A new approach for document indexing using wavelet trees. In: Proceedings of the 18th International Conference on Database and Expert Systems Applications, pp. 69–73 (2007)

  9. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th ACM Symposium on Theory of Computing, pp. 380–388 (2002)

  10. Chen, F.R., Wilcox, L.D., Bloomberg, D.: Word spotting in scanned images using hidden Markov models. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 1–4 (1993)

  11. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of Annual Symposium on Computational Geometry, pp. 253–662 (2004)

  12. Deerwester S., Dumais S.T., Landauer T.K., Furnas G.W., Harshman R.A.: Indexing by latent semantic analysis. J. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  13. Doermann, D.: The retrieval of document images: a brief survey. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 945–949 (1997)

  14. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)

  15. Faloutsos, C., Lin, K.I.: Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the ACM International Conference on Management of Data, pp. 163–174 (1995)

  16. Goemans M.X., Williamson D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42(6), 1115–1145 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  17. Grigorescu C., Petkov N.: Distance sets for shape filters and shape recognition. IEEE Trans. Image Process. 12(10), 1274–1286 (2003)

    Article  MathSciNet  Google Scholar 

  18. Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: Proceedings of the 12th International Conference on Extending Database Technology, pp. 744–755 (2009)

  19. Hassan, E., Chaudhury, S., Gopal, M., Dholakia, J.: Use of mkl as symbol classifier for gujarati character recognition. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 255–262 (2010)

  20. Indyk, P., Motwani, R.: Approximate nearest neighbor—towards removing the curse of dimensionality. In: Proceedings of the 30th ACM Symposium on Theory of Computing, pp. 604–613 (1998)

  21. Lecun Y., Bottou L., Bengio Y., Haffner P.: Gradient based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  22. Liu Y., Zhanga D., Lua G., Mab W.Y.: A survey of content-based image retrieval with high-level semantics. Pattern Recognit. 40, 262–282 (2007)

    Article  MATH  Google Scholar 

  23. Llados, J., Sanchez, G.: Indexing historical documents by word shape signatures. In: Proceedings of the 9th International Conference on Document Analysis and Recognition, pp. 362–366 (2007)

  24. Lu S., Li L., Tan C.L.: Document image retrieval through word shape coding. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1913–1918 (2008)

    Article  Google Scholar 

  25. Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe lsh: Efficient indexing for high-dimensional similarity search. In: Proceedings of the 33th International Conference on Very Large Data Bases, pp. 950–961 (2007)

  26. Madhvanath S., Govindaraju V.: The role of holistic paradigms in handwritten word recognition. IEEE Trans. Pattern Anal. Mach. Intell. 23(2), 149–164 (2001)

    Article  Google Scholar 

  27. Manmatha, R., Han, C., Riseman, E.M., Croft, W.B.: Indexing handwriting using word matching. In: Proceedings of the 1st ACM International Conference on Digital Libraries, pp. 151–159 (1996)

  28. Marinai S., Marino E., Soda G.: Font adaptive word indexing of modern printed documents. IEEE Trans. Pattern Anal. Mach. Intell. 28(8), 1187–1199 (2006)

    Article  Google Scholar 

  29. Marinai, S., Marino, E., Soda, G.: Tree clustering for layout-based document image retrieval. In: Proceedings of the 2nd International Conference on Document Image Analysis for Libraries, pp. 243–253 (2006)

  30. Matei B., Shan Y., Sawhney H.S., Tan Y., Kumar R., Huber D., Hebert M.: Rapid object indexing using locality sensitive hashing and joint 3d-signature space estimation. IEEE Trans. Pattern Anal. Mach. Intell. 28(7), 1111–1126 (2006)

    Article  Google Scholar 

  31. Mehmod, T.S.: Indexing of handwritten document images. In: Proceedings of the 1997 Workshop on Document Image Analysis, pp. 66–73 (1997)

  32. Mingqiang, Y., Kidiyo, K., Joseph, R.: Pattern Recognition Techniques, Technology and Applications, chap. 3, pp. 43–90. In-Teh, Croatia (2008)

  33. Nakayama, T.: Content-oriented categorization of document images. In: Proceedings of the 16th International Conference on Computational Linguistics, vol. 2, pp. 818–823 (1996)

  34. Platt, J.C., Cristianini, N., Taylor, J.S.: Large margin dags for multiclass classification. In: Solla, S.A., Leen, T.K., Müller, K.-R. (eds.) Advances in Neural Information Processing Systems, vol. 12, pp. 547–553 (2000)

  35. Saykol E., Sinop A.K., Gudukbay U., Ulusoy O., Cetin A.E.: Content-based retrieval of historical ottoman documents stored as textual images. IEEE Trans. Image Process. 13(3), 314–325 (2004)

    Article  Google Scholar 

  36. Shen, H., Li, T., Schweiger, T.: An efficient similarity searching scheme in massive databases. In: Proceedings of the 3rd International Conference on Digital Telecommunications, pp. 47–52 (2008)

  37. Smeulders A.W.M., Worring M., Santini S., Gupta A., Jain R.: Content-based image retrieval at the end of early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2008)

    Article  Google Scholar 

  38. Vassilis, A., Michalis, P., Panagiotis, P., George, K.: Nearest neighbor retrieval using distance based hashing. In: Proceedings of the 24th International Conference on Data Engineering, pp. 327–336 (2008)

  39. Vincent, L.: Google book search: document understanding on a massive scale. In: Proceedings of the 9th International Conference on Document Analysis and Recognition, pp. 819–823 (2007)

  40. Weihong, W., Song, W.: A scalable content-based image retrieval scheme using locality-sensitive hashing. In: Proceedings of the International Conference on Computational Intelligence and Natural Computing, vol. 1, pp. 151–154 (2009)

  41. Xiaofei, H., Deng, C., Haifeng, L., Ying., M.W.: Locality preserving indexing for document representation. In: Proceedings of the 27th International SIGIR Conference on Research and Development in Information Retrieval, pp. 96–103 (2004)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ehtesham Hassan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hassan, E., Chaudhury, S. & Gopal, M. Word shape descriptor-based document image indexing: a new DBH-based approach. IJDAR 16, 227–246 (2013). https://doi.org/10.1007/s10032-012-0187-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-012-0187-7

Keywords

Navigation