Impact of online handwriting recognition performance on text categorization

Peña Saldarriaga, Sebastián; Viard-Gaudin, Christian; Morin, Emmanuel

doi:10.1007/s10032-009-0108-6

Impact of online handwriting recognition performance on text categorization

Original Paper
Published: 16 January 2010

Volume 13, pages 159–171, (2010)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Sebastián Peña Saldarriaga¹,
Christian Viard-Gaudin² &
Emmanuel Morin¹

166 Accesses
1 Citation
Explore all metrics

Abstract

Today, there is an increasing demand of efficient archival and retrieval methods for online handwritten data. For such tasks, text categorization is of particular interest. The textual data available in online documents can be extracted through online handwriting recognition; however, this process produces errors in the resulting text. This work reports experiments on the categorization of online handwritten documents based on their textual contents. We analyze the effect of word recognition errors on the categorization performances, by comparing the performances of a categorization system with the texts obtained through online handwriting recognition and the same texts available as ground truth. Two well-known categorization algorithms (kNN and SVM) are compared in this work. A subset of the Reuters-21578 corpus consisting of more than 2,000 handwritten documents has been collected for this study. Results show that classification rate loss is not significant, and precision loss is only significant for recall values of 60–80% depending on the noise levels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Siamese Neural Networks: An Overview

Introduction to Machine Learning

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

References

Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Chen N., Blostein D.: A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int. J. Doc. Anal. Recognit. 10(1), 1–16 (2007)
Article MATH Google Scholar
Kolcz A., Alspector J., Augusteijn M., Carlson R., Viorel Popescu G.: A line oriented approach to word spotting in handwritten documents. Pattern Anal. Appl. 3, 153–168 (2000)
Article Google Scholar
Russell, G., Perrone, M., Chee, Y.: Handwritten document retrieval. In: Proceedings of 8th International Workshop on Frontiers in Handwritting Recognition (IWFHR ’02), pp. 233–238 (2002)
Tan C.L., Huang W., Yu Z., Xu Y.: Imaged document text retrieval without OCR. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 838–844 (2002)
Article Google Scholar
Tomai, C., Zhang, B., Govindaraju, V.: Transcript mapping for historic handwritten document images. In: Proceedings of 8th International Workshop on Frontiers in Handwriting Recognition (IWFHR ’02), pp. 3453–3456 (2002)
Marinai, S., Marino, M., Soda, G.: Indexing and retrieval of words in old documents. In: Proceedings of 7th International Conference Document Analysis and Recognition (ICDAR ’03), pp. 223–227 (2003)
Rath, T., Manmatha, R.: Features for word spotting in historical manuscripts. In: Proceedings of 7th International Conference on Document Analysis and Recognition (ICDAR ’03), pp. 218–222 (2003)
Rath, T., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: Proceedings of 27th Annual International ACM SIGIR Conference (SIGIR ’04), pp. 369–376 (2004)
Doulgeri, N., Kavallieratou, E.: Retrieval of historical documents by word spotting. In: Proceedings of Document Recognition and Retrieval XVI, vol. 7247, p. 724706. SPIE (2009)
Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text categorization of low quality images. In: Proceedings of 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’95), pp. 301–315 (1995)
Junker M., Hoch R.: An experimental evaluation of OCR text representations for learning document classifiers. Int. J. Doc. Anal. Recognit. 1(2), 116–122 (1998)
Article Google Scholar
Taghva, K., Nartker, T.A., Borsack, J., Lumos, S., Condit, A., Young, R.: Evaluating text categorization in the presence of OCR errors. In: Proceedings of Document Recognition Retrieval VIII, vol. 4307, pp. 68–74. SPIE (2000)
Murata, M., Busagala, L.S.P., Ohyama, W., Wakabayashi, T., Kimura, F.: The impact of OCR accuracy and feature transformation on automatic text classification. In: Proceedings of 7th IAPR International Workshop on Document Analysis Systems (DAS ’06), pp. 506–517 (2006)
Rocchio J.J.: The SMART Retrieval System-Experiments in Automatic Document Processing, chap. Relevance Feedback in Information Retrieval, pp. 313–323. Prentice-Hall, Upper Saddle River (1971)
Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’94), pp. 161–175 (1994)
Debole F., Sebastiani F.: An analysis of the relative hardness of Reuters-21578 subsets. J. Am. Soc. Inf. Sci. Technol. 56(6), 584–596 (2005)
Article Google Scholar
Vinciarelli A.: Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1882–1895 (2005)
Article Google Scholar
Koch, G.: Catégorisation automatique de documents manuscrits: application aux courriers entrants. Ph.D. thesis, University of Rouen (2006)
Milewski R.J., Govindaraju V., Bhardwaj A.: Automatic recognition of handwritten medical forms for search engines. Int. J. Doc. Anal. Recognit. 11(4), 203–218 (2009)
Article Google Scholar
Vapnik V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
MATH Google Scholar
Croft, B., Harding, S., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’94), pp. 115–126 (1994)
Mitchell, T.M.: Machine Learning, chap. Instance-Based Learning. pp. 239–258. McGraw Hill, New York (1997) http://isbndb.com/d/book/machine_learning.html
Deerwester S., Dumais S.T., Furnas G., Landauer T., Harshman R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. Technol. 41(6), 391–407 (1990)
Article Google Scholar
Perraud F., Viard-Gaudin C., Morin E., Lallican P.M.: Statistical language models for on-line handwriting recognition. IEICE Trans. Inf. Syst. E88-D(8), 1807–1814 (2005)
Article Google Scholar
Salton G., Wong A., Wang C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Pfeifer, U., Fuhr, N., Huynh, T.: Searching structured documents with the enhanced retrieval functionality of freeWAIS-SF and SFgate. In: Proceedings of 3rd International World Wide Web Conference (WWW ’95), pp. 1027–1036 (1995)
Porter M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of 21st International Conference on Machine Learning (ICML ’04) (2004)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings 14th International Conference on Machine Learning (ICML ’97), pp. 412–420 (1997)
Spärck Jones K.: Experiments in relevance weighting of search terms. Inf. Process Manag. 15, 133–144 (1979)
Article Google Scholar
Baeza-Yates R., Ribeiro-Neto B.: Modern Information Retrieval, chap. Retrieval Evaluation, pp. 73–99. Addison-Wesley, Longman, Reading (1999)
Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of 15th Annual International ACM SIGIR Conference (SIGIR ’92), pp. 37–50 (1992)
Apté, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorization models. In: Proceedings of 17th Annual International ACM SIGIR Conference (SIGIR ’94) pp. 23–30 (1994)
Wagner R.A., Fischer M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)
Article MATH MathSciNet Google Scholar
Adai A.T., Data S.V., Wieland S., Marcotte E.M.: LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. J. Mol. Biol. 340(1), 179–190 (2004)
Article Google Scholar
Yang, Y.: A study of thresholding strategies for text categorization. In: Proceedings of 24th Annual International ACM SIGIR Conference (SIGIR ’01), pp. 137–145 (2001)
Weston J., Mukherjee S., Chapelle O., Pontil M., Poggio T., Vapnik V.: Feature selection for SVMs. Adv. Neural Inf. Process Syst. 13, 668–674 (2000)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning (ECML ’98), pp. 137–142 (1998)
Okamoto S., Yugami N.: Effects of domain characteristics on instance-based learning algorithms. Theor. Comput. Sci. 298(1), 207–233 (2003)
Article MATH MathSciNet Google Scholar
Dasarathy B.V.: Nearest Neighbor (NN) Norms—NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos (1991)
Google Scholar
Conover W.J.: Practical Nonparametric Statistics. Wiley, New York (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

LINA UMR CNRS 6241, Université de Nantes, Nantes, France
Sebastián Peña Saldarriaga & Emmanuel Morin
IRCCyN UMR CNRS 6597, Université de Nantes, Nantes, France
Christian Viard-Gaudin

Authors

Sebastián Peña Saldarriaga
View author publications
You can also search for this author in PubMed Google Scholar
Christian Viard-Gaudin
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Morin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sebastián Peña Saldarriaga.

Additional information

This work was partially funded by grant ANR-06-TLOG-009 from the French National Research Agency.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peña Saldarriaga, S., Viard-Gaudin, C. & Morin, E. Impact of online handwriting recognition performance on text categorization. IJDAR 13, 159–171 (2010). https://doi.org/10.1007/s10032-009-0108-6

Download citation

Received: 06 April 2009
Revised: 30 October 2009
Accepted: 04 December 2009
Published: 16 January 2010
Issue Date: June 2010
DOI: https://doi.org/10.1007/s10032-009-0108-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Impact of online handwriting recognition performance on text categorization

Abstract

Access this article

Similar content being viewed by others

Siamese Neural Networks: An Overview

Introduction to Machine Learning

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Impact of online handwriting recognition performance on text categorization

Abstract

Access this article

Similar content being viewed by others

Siamese Neural Networks: An Overview

Introduction to Machine Learning

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation