Abstract
Today, there is an increasing demand of efficient archival and retrieval methods for online handwritten data. For such tasks, text categorization is of particular interest. The textual data available in online documents can be extracted through online handwriting recognition; however, this process produces errors in the resulting text. This work reports experiments on the categorization of online handwritten documents based on their textual contents. We analyze the effect of word recognition errors on the categorization performances, by comparing the performances of a categorization system with the texts obtained through online handwriting recognition and the same texts available as ground truth. Two well-known categorization algorithms (kNN and SVM) are compared in this work. A subset of the Reuters-21578 corpus consisting of more than 2,000 handwritten documents has been collected for this study. Results show that classification rate loss is not significant, and precision loss is only significant for recall values of 60–80% depending on the noise levels.
Similar content being viewed by others
References
Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Chen N., Blostein D.: A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int. J. Doc. Anal. Recognit. 10(1), 1–16 (2007)
Kolcz A., Alspector J., Augusteijn M., Carlson R., Viorel Popescu G.: A line oriented approach to word spotting in handwritten documents. Pattern Anal. Appl. 3, 153–168 (2000)
Russell, G., Perrone, M., Chee, Y.: Handwritten document retrieval. In: Proceedings of 8th International Workshop on Frontiers in Handwritting Recognition (IWFHR ’02), pp. 233–238 (2002)
Tan C.L., Huang W., Yu Z., Xu Y.: Imaged document text retrieval without OCR. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 838–844 (2002)
Tomai, C., Zhang, B., Govindaraju, V.: Transcript mapping for historic handwritten document images. In: Proceedings of 8th International Workshop on Frontiers in Handwriting Recognition (IWFHR ’02), pp. 3453–3456 (2002)
Marinai, S., Marino, M., Soda, G.: Indexing and retrieval of words in old documents. In: Proceedings of 7th International Conference Document Analysis and Recognition (ICDAR ’03), pp. 223–227 (2003)
Rath, T., Manmatha, R.: Features for word spotting in historical manuscripts. In: Proceedings of 7th International Conference on Document Analysis and Recognition (ICDAR ’03), pp. 218–222 (2003)
Rath, T., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: Proceedings of 27th Annual International ACM SIGIR Conference (SIGIR ’04), pp. 369–376 (2004)
Doulgeri, N., Kavallieratou, E.: Retrieval of historical documents by word spotting. In: Proceedings of Document Recognition and Retrieval XVI, vol. 7247, p. 724706. SPIE (2009)
Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text categorization of low quality images. In: Proceedings of 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’95), pp. 301–315 (1995)
Junker M., Hoch R.: An experimental evaluation of OCR text representations for learning document classifiers. Int. J. Doc. Anal. Recognit. 1(2), 116–122 (1998)
Taghva, K., Nartker, T.A., Borsack, J., Lumos, S., Condit, A., Young, R.: Evaluating text categorization in the presence of OCR errors. In: Proceedings of Document Recognition Retrieval VIII, vol. 4307, pp. 68–74. SPIE (2000)
Murata, M., Busagala, L.S.P., Ohyama, W., Wakabayashi, T., Kimura, F.: The impact of OCR accuracy and feature transformation on automatic text classification. In: Proceedings of 7th IAPR International Workshop on Document Analysis Systems (DAS ’06), pp. 506–517 (2006)
Rocchio J.J.: The SMART Retrieval System-Experiments in Automatic Document Processing, chap. Relevance Feedback in Information Retrieval, pp. 313–323. Prentice-Hall, Upper Saddle River (1971)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’94), pp. 161–175 (1994)
Debole F., Sebastiani F.: An analysis of the relative hardness of Reuters-21578 subsets. J. Am. Soc. Inf. Sci. Technol. 56(6), 584–596 (2005)
Vinciarelli A.: Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1882–1895 (2005)
Koch, G.: Catégorisation automatique de documents manuscrits: application aux courriers entrants. Ph.D. thesis, University of Rouen (2006)
Milewski R.J., Govindaraju V., Bhardwaj A.: Automatic recognition of handwritten medical forms for search engines. Int. J. Doc. Anal. Recognit. 11(4), 203–218 (2009)
Vapnik V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Croft, B., Harding, S., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’94), pp. 115–126 (1994)
Mitchell, T.M.: Machine Learning, chap. Instance-Based Learning. pp. 239–258. McGraw Hill, New York (1997) http://isbndb.com/d/book/machine_learning.html
Deerwester S., Dumais S.T., Furnas G., Landauer T., Harshman R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. Technol. 41(6), 391–407 (1990)
Perraud F., Viard-Gaudin C., Morin E., Lallican P.M.: Statistical language models for on-line handwriting recognition. IEICE Trans. Inf. Syst. E88-D(8), 1807–1814 (2005)
Salton G., Wong A., Wang C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Pfeifer, U., Fuhr, N., Huynh, T.: Searching structured documents with the enhanced retrieval functionality of freeWAIS-SF and SFgate. In: Proceedings of 3rd International World Wide Web Conference (WWW ’95), pp. 1027–1036 (1995)
Porter M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of 21st International Conference on Machine Learning (ICML ’04) (2004)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings 14th International Conference on Machine Learning (ICML ’97), pp. 412–420 (1997)
Spärck Jones K.: Experiments in relevance weighting of search terms. Inf. Process Manag. 15, 133–144 (1979)
Baeza-Yates R., Ribeiro-Neto B.: Modern Information Retrieval, chap. Retrieval Evaluation, pp. 73–99. Addison-Wesley, Longman, Reading (1999)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of 15th Annual International ACM SIGIR Conference (SIGIR ’92), pp. 37–50 (1992)
Apté, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorization models. In: Proceedings of 17th Annual International ACM SIGIR Conference (SIGIR ’94) pp. 23–30 (1994)
Wagner R.A., Fischer M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)
Adai A.T., Data S.V., Wieland S., Marcotte E.M.: LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. J. Mol. Biol. 340(1), 179–190 (2004)
Yang, Y.: A study of thresholding strategies for text categorization. In: Proceedings of 24th Annual International ACM SIGIR Conference (SIGIR ’01), pp. 137–145 (2001)
Weston J., Mukherjee S., Chapelle O., Pontil M., Poggio T., Vapnik V.: Feature selection for SVMs. Adv. Neural Inf. Process Syst. 13, 668–674 (2000)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning (ECML ’98), pp. 137–142 (1998)
Okamoto S., Yugami N.: Effects of domain characteristics on instance-based learning algorithms. Theor. Comput. Sci. 298(1), 207–233 (2003)
Dasarathy B.V.: Nearest Neighbor (NN) Norms—NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos (1991)
Conover W.J.: Practical Nonparametric Statistics. Wiley, New York (1998)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was partially funded by grant ANR-06-TLOG-009 from the French National Research Agency.
Rights and permissions
About this article
Cite this article
Peña Saldarriaga, S., Viard-Gaudin, C. & Morin, E. Impact of online handwriting recognition performance on text categorization. IJDAR 13, 159–171 (2010). https://doi.org/10.1007/s10032-009-0108-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-009-0108-6