Abstract
In this paper, we propose a novel feature representation for binary patterns by exploiting the object shape information. Initial evaluation of the representation is performed for Bengali and Gujarati script character classification. The extension of the representation for word images is presented subsequently. The proposed feature representation in combination with distance-based hashing is applied for defining novel word image-based document image indexing and retrieval framework. The concept of hierarchical hashing is utilized to reduce the retrieval time complexity. In addition, with the objective of reduction in the size of hashing data structure, the concept of multi-probe hashing is extended for binary mapping functions. The exhaustive experimental evaluation of the proposed framework on a collection of documents belonging to Devanagari, Bengali and English scripts has yielded encouraging results.
Similar content being viewed by others
References
Available at URL: http://ocr.cdacnoida.in/
Adamek T., O’Connor N.E., Smeaton A.F.: Word matching using single closed contours for indexing handwritten historical documents. Int. J. Doc. Anal. Recognit. 9(2), 153–165 (2007)
Andoni A., Indyk P.: Near optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Arya, D., Jawahar, C.V., Chakravorty, B., Patnaik, T., Chaudhuri, B.B., Lehal, G.S., Chaudhury, S., Ramakrishna, A.G.: Experiences of integration and performance testing of multilingual ocr for printed Indian scripts. In: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, MOCR_AND’11, pp. 9:1–9:8 (2011)
Bai, S., Li, L., Tan, C.L.: Keyword spotting in document images through word shape coding. In: Proceedings of the 10th International Conference on Document Analysis and Recognition, pp. 331–335 (2009)
Bajaj R., Chaudhury S.: Signature verification using multiple neural classifiers. Pattern Recognit. 30(1), 1–7 (1997)
Belongie S., Malik J., Puzicha J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24(24), 509–522 (2002)
Brisaboa, N.R., Cillero, Y., Farina, A., Ladra, S., Pedreira, O.: A new approach for document indexing using wavelet trees. In: Proceedings of the 18th International Conference on Database and Expert Systems Applications, pp. 69–73 (2007)
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th ACM Symposium on Theory of Computing, pp. 380–388 (2002)
Chen, F.R., Wilcox, L.D., Bloomberg, D.: Word spotting in scanned images using hidden Markov models. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 1–4 (1993)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of Annual Symposium on Computational Geometry, pp. 253–662 (2004)
Deerwester S., Dumais S.T., Landauer T.K., Furnas G.W., Harshman R.A.: Indexing by latent semantic analysis. J. Soc. Inf. Sci. 41(6), 391–407 (1990)
Doermann, D.: The retrieval of document images: a brief survey. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 945–949 (1997)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)
Faloutsos, C., Lin, K.I.: Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the ACM International Conference on Management of Data, pp. 163–174 (1995)
Goemans M.X., Williamson D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42(6), 1115–1145 (1995)
Grigorescu C., Petkov N.: Distance sets for shape filters and shape recognition. IEEE Trans. Image Process. 12(10), 1274–1286 (2003)
Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: Proceedings of the 12th International Conference on Extending Database Technology, pp. 744–755 (2009)
Hassan, E., Chaudhury, S., Gopal, M., Dholakia, J.: Use of mkl as symbol classifier for gujarati character recognition. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 255–262 (2010)
Indyk, P., Motwani, R.: Approximate nearest neighbor—towards removing the curse of dimensionality. In: Proceedings of the 30th ACM Symposium on Theory of Computing, pp. 604–613 (1998)
Lecun Y., Bottou L., Bengio Y., Haffner P.: Gradient based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Liu Y., Zhanga D., Lua G., Mab W.Y.: A survey of content-based image retrieval with high-level semantics. Pattern Recognit. 40, 262–282 (2007)
Llados, J., Sanchez, G.: Indexing historical documents by word shape signatures. In: Proceedings of the 9th International Conference on Document Analysis and Recognition, pp. 362–366 (2007)
Lu S., Li L., Tan C.L.: Document image retrieval through word shape coding. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1913–1918 (2008)
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe lsh: Efficient indexing for high-dimensional similarity search. In: Proceedings of the 33th International Conference on Very Large Data Bases, pp. 950–961 (2007)
Madhvanath S., Govindaraju V.: The role of holistic paradigms in handwritten word recognition. IEEE Trans. Pattern Anal. Mach. Intell. 23(2), 149–164 (2001)
Manmatha, R., Han, C., Riseman, E.M., Croft, W.B.: Indexing handwriting using word matching. In: Proceedings of the 1st ACM International Conference on Digital Libraries, pp. 151–159 (1996)
Marinai S., Marino E., Soda G.: Font adaptive word indexing of modern printed documents. IEEE Trans. Pattern Anal. Mach. Intell. 28(8), 1187–1199 (2006)
Marinai, S., Marino, E., Soda, G.: Tree clustering for layout-based document image retrieval. In: Proceedings of the 2nd International Conference on Document Image Analysis for Libraries, pp. 243–253 (2006)
Matei B., Shan Y., Sawhney H.S., Tan Y., Kumar R., Huber D., Hebert M.: Rapid object indexing using locality sensitive hashing and joint 3d-signature space estimation. IEEE Trans. Pattern Anal. Mach. Intell. 28(7), 1111–1126 (2006)
Mehmod, T.S.: Indexing of handwritten document images. In: Proceedings of the 1997 Workshop on Document Image Analysis, pp. 66–73 (1997)
Mingqiang, Y., Kidiyo, K., Joseph, R.: Pattern Recognition Techniques, Technology and Applications, chap. 3, pp. 43–90. In-Teh, Croatia (2008)
Nakayama, T.: Content-oriented categorization of document images. In: Proceedings of the 16th International Conference on Computational Linguistics, vol. 2, pp. 818–823 (1996)
Platt, J.C., Cristianini, N., Taylor, J.S.: Large margin dags for multiclass classification. In: Solla, S.A., Leen, T.K., Müller, K.-R. (eds.) Advances in Neural Information Processing Systems, vol. 12, pp. 547–553 (2000)
Saykol E., Sinop A.K., Gudukbay U., Ulusoy O., Cetin A.E.: Content-based retrieval of historical ottoman documents stored as textual images. IEEE Trans. Image Process. 13(3), 314–325 (2004)
Shen, H., Li, T., Schweiger, T.: An efficient similarity searching scheme in massive databases. In: Proceedings of the 3rd International Conference on Digital Telecommunications, pp. 47–52 (2008)
Smeulders A.W.M., Worring M., Santini S., Gupta A., Jain R.: Content-based image retrieval at the end of early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2008)
Vassilis, A., Michalis, P., Panagiotis, P., George, K.: Nearest neighbor retrieval using distance based hashing. In: Proceedings of the 24th International Conference on Data Engineering, pp. 327–336 (2008)
Vincent, L.: Google book search: document understanding on a massive scale. In: Proceedings of the 9th International Conference on Document Analysis and Recognition, pp. 819–823 (2007)
Weihong, W., Song, W.: A scalable content-based image retrieval scheme using locality-sensitive hashing. In: Proceedings of the International Conference on Computational Intelligence and Natural Computing, vol. 1, pp. 151–154 (2009)
Xiaofei, H., Deng, C., Haifeng, L., Ying., M.W.: Locality preserving indexing for document representation. In: Proceedings of the 27th International SIGIR Conference on Research and Development in Information Retrieval, pp. 96–103 (2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hassan, E., Chaudhury, S. & Gopal, M. Word shape descriptor-based document image indexing: a new DBH-based approach. IJDAR 16, 227–246 (2013). https://doi.org/10.1007/s10032-012-0187-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-012-0187-7