Skip to main content
Log in

Impact of online handwriting recognition performance on text categorization

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Today, there is an increasing demand of efficient archival and retrieval methods for online handwritten data. For such tasks, text categorization is of particular interest. The textual data available in online documents can be extracted through online handwriting recognition; however, this process produces errors in the resulting text. This work reports experiments on the categorization of online handwritten documents based on their textual contents. We analyze the effect of word recognition errors on the categorization performances, by comparing the performances of a categorization system with the texts obtained through online handwriting recognition and the same texts available as ground truth. Two well-known categorization algorithms (kNN and SVM) are compared in this work. A subset of the Reuters-21578 corpus consisting of more than 2,000 handwritten documents has been collected for this study. Results show that classification rate loss is not significant, and precision loss is only significant for recall values of 60–80% depending on the noise levels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  2. Chen N., Blostein D.: A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int. J. Doc. Anal. Recognit. 10(1), 1–16 (2007)

    Article  MATH  Google Scholar 

  3. Kolcz A., Alspector J., Augusteijn M., Carlson R., Viorel Popescu G.: A line oriented approach to word spotting in handwritten documents. Pattern Anal. Appl. 3, 153–168 (2000)

    Article  Google Scholar 

  4. Russell, G., Perrone, M., Chee, Y.: Handwritten document retrieval. In: Proceedings of 8th International Workshop on Frontiers in Handwritting Recognition (IWFHR ’02), pp. 233–238 (2002)

  5. Tan C.L., Huang W., Yu Z., Xu Y.: Imaged document text retrieval without OCR. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 838–844 (2002)

    Article  Google Scholar 

  6. Tomai, C., Zhang, B., Govindaraju, V.: Transcript mapping for historic handwritten document images. In: Proceedings of 8th International Workshop on Frontiers in Handwriting Recognition (IWFHR ’02), pp. 3453–3456 (2002)

  7. Marinai, S., Marino, M., Soda, G.: Indexing and retrieval of words in old documents. In: Proceedings of 7th International Conference Document Analysis and Recognition (ICDAR ’03), pp. 223–227 (2003)

  8. Rath, T., Manmatha, R.: Features for word spotting in historical manuscripts. In: Proceedings of 7th International Conference on Document Analysis and Recognition (ICDAR ’03), pp. 218–222 (2003)

  9. Rath, T., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: Proceedings of 27th Annual International ACM SIGIR Conference (SIGIR ’04), pp. 369–376 (2004)

  10. Doulgeri, N., Kavallieratou, E.: Retrieval of historical documents by word spotting. In: Proceedings of Document Recognition and Retrieval XVI, vol. 7247, p. 724706. SPIE (2009)

  11. Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text categorization of low quality images. In: Proceedings of 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’95), pp. 301–315 (1995)

  12. Junker M., Hoch R.: An experimental evaluation of OCR text representations for learning document classifiers. Int. J. Doc. Anal. Recognit. 1(2), 116–122 (1998)

    Article  Google Scholar 

  13. Taghva, K., Nartker, T.A., Borsack, J., Lumos, S., Condit, A., Young, R.: Evaluating text categorization in the presence of OCR errors. In: Proceedings of Document Recognition Retrieval VIII, vol. 4307, pp. 68–74. SPIE (2000)

  14. Murata, M., Busagala, L.S.P., Ohyama, W., Wakabayashi, T., Kimura, F.: The impact of OCR accuracy and feature transformation on automatic text classification. In: Proceedings of 7th IAPR International Workshop on Document Analysis Systems (DAS ’06), pp. 506–517 (2006)

  15. Rocchio J.J.: The SMART Retrieval System-Experiments in Automatic Document Processing, chap. Relevance Feedback in Information Retrieval, pp. 313–323. Prentice-Hall, Upper Saddle River (1971)

    Google Scholar 

  16. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’94), pp. 161–175 (1994)

  17. Debole F., Sebastiani F.: An analysis of the relative hardness of Reuters-21578 subsets. J. Am. Soc. Inf. Sci. Technol. 56(6), 584–596 (2005)

    Article  Google Scholar 

  18. Vinciarelli A.: Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1882–1895 (2005)

    Article  Google Scholar 

  19. Koch, G.: Catégorisation automatique de documents manuscrits: application aux courriers entrants. Ph.D. thesis, University of Rouen (2006)

  20. Milewski R.J., Govindaraju V., Bhardwaj A.: Automatic recognition of handwritten medical forms for search engines. Int. J. Doc. Anal. Recognit. 11(4), 203–218 (2009)

    Article  Google Scholar 

  21. Vapnik V.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    MATH  Google Scholar 

  22. Croft, B., Harding, S., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’94), pp. 115–126 (1994)

  23. Mitchell, T.M.: Machine Learning, chap. Instance-Based Learning. pp. 239–258. McGraw Hill, New York (1997) http://isbndb.com/d/book/machine_learning.html

  24. Deerwester S., Dumais S.T., Furnas G., Landauer T., Harshman R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. Technol. 41(6), 391–407 (1990)

    Article  Google Scholar 

  25. Perraud F., Viard-Gaudin C., Morin E., Lallican P.M.: Statistical language models for on-line handwriting recognition. IEICE Trans. Inf. Syst. E88-D(8), 1807–1814 (2005)

    Article  Google Scholar 

  26. Salton G., Wong A., Wang C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  27. Pfeifer, U., Fuhr, N., Huynh, T.: Searching structured documents with the enhanced retrieval functionality of freeWAIS-SF and SFgate. In: Proceedings of 3rd International World Wide Web Conference (WWW ’95), pp. 1027–1036 (1995)

  28. Porter M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  29. Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of 21st International Conference on Machine Learning (ICML ’04) (2004)

  30. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings 14th International Conference on Machine Learning (ICML ’97), pp. 412–420 (1997)

  31. Spärck Jones K.: Experiments in relevance weighting of search terms. Inf. Process Manag. 15, 133–144 (1979)

    Article  Google Scholar 

  32. Baeza-Yates R., Ribeiro-Neto B.: Modern Information Retrieval, chap. Retrieval Evaluation, pp. 73–99. Addison-Wesley, Longman, Reading (1999)

    Google Scholar 

  33. Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of 15th Annual International ACM SIGIR Conference (SIGIR ’92), pp. 37–50 (1992)

  34. Apté, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorization models. In: Proceedings of 17th Annual International ACM SIGIR Conference (SIGIR ’94) pp. 23–30 (1994)

  35. Wagner R.A., Fischer M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)

    Article  MATH  MathSciNet  Google Scholar 

  36. Adai A.T., Data S.V., Wieland S., Marcotte E.M.: LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. J. Mol. Biol. 340(1), 179–190 (2004)

    Article  Google Scholar 

  37. Yang, Y.: A study of thresholding strategies for text categorization. In: Proceedings of 24th Annual International ACM SIGIR Conference (SIGIR ’01), pp. 137–145 (2001)

  38. Weston J., Mukherjee S., Chapelle O., Pontil M., Poggio T., Vapnik V.: Feature selection for SVMs. Adv. Neural Inf. Process Syst. 13, 668–674 (2000)

    Google Scholar 

  39. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning (ECML ’98), pp. 137–142 (1998)

  40. Okamoto S., Yugami N.: Effects of domain characteristics on instance-based learning algorithms. Theor. Comput. Sci. 298(1), 207–233 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  41. Dasarathy B.V.: Nearest Neighbor (NN) Norms—NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos (1991)

    Google Scholar 

  42. Conover W.J.: Practical Nonparametric Statistics. Wiley, New York (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastián Peña Saldarriaga.

Additional information

This work was partially funded by grant ANR-06-TLOG-009 from the French National Research Agency.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peña Saldarriaga, S., Viard-Gaudin, C. & Morin, E. Impact of online handwriting recognition performance on text categorization. IJDAR 13, 159–171 (2010). https://doi.org/10.1007/s10032-009-0108-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-009-0108-6

Keywords

Navigation