Abstract
Recently, the digitization of paper-based documents is rapidly advanced through the spread of scanners. However, tagging or sorting a huge amount of scanned documents one by one is difficult in terms of time and effort. Therefore, the system which extracts features from texts in the documents automatically, which is available by OCR, and classifies/retrieves documents will be useful. LDA, one of the most popular Topic Models, is known as a method to extract the features of each document and the relationships between documents. However, it is reported that the performance of LDA declines along with poor OCR recognition. This paper assumes the case of applying LDA to Japanese OCR documents and studies the method to improve the performance of topic inference. This paper defines the reliability of the recognized words using N-gram and proposes the weighting LDA method based on the reliability. Adequacy of the reliability of the recognized words is confirmed through the preliminary experiment detecting false recognized words, and then the experiment to classify practical OCR documents are carried out. The experimental results show the improvement of the classification performance by the proposed method comparing with the conventional methods.
Chapter PDF
Similar content being viewed by others
Keywords
- Topic Model
- Latent Dirichlet Allocation
- False Recognition
- Optical Character Recognition
- Dirichlet Distribution
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Bunke, H.: Recognition of cursive roman handwriting: past, present and future. In: Proc. Seventh International Conference on Document Analysis and Recognition, pp. 448–459. IEEE (2003)
Nagata, M.: Japanese ocr error correction using character shape similarity and statistical language model. In: Proc. of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 2, pp. 922–928. Association for Computational Linguistics (1998)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
Walker, D., Lund, W., Ringger, E.: Evaluating models of latent document semantics in the presence of ocr errors. In: Proc. of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 240–250. Association for Computational Linguistics (2010)
Wei, X., Croft, W.: Lda-based document models for ad-hoc retrieval. In: Proc. of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185. ACM (2006)
Yao, L., Mimno, D., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: Proc. of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 937–946. ACM (2009)
Iwata, T., Yamada, T., Ueda, N.: Probabilistic latent semantic visualization: topic model for visualizing documents. In: Proc. of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 363–371. ACM (2008)
Blei, D., Lafferty, J.: Dynamic topic models. In: Proc. of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)
Newman, D., Block, S.: Probabilistic topic decomposition of an eighteenth-century american newspaper. Journal of the American Society for Information Science and Technology 57(6), 753–767 (2006)
Yokoyama, S., Eguchi, K., Ohkawa, T.: Distillating information diffusion networks from blogosphere using latent topics. The IEICE Transactions on Information and Systems (Japanese Edition) 93(3), 180–188 (2010)
Kitajima, R., Kobayashi, I.: An examination for proper events grasping latent topics in a document and its application. IPSJ SIG Notes NL-201(3), 1–8 (2011)
Wilson, A., Chew, P.: Term weighting schemes for latent dirichlet allocation. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT, vol. 10, pp. 465–473 (2010)
Griffiths, T., Steyvers, M.: Finding scientific topics, vol. 101, pp. 5228–5235. National Acad. Sciences (2004)
Och, F., Tillmann, C., Ney, H., et al.: Improved alignment models for statistical machine translation. In: Proc. of the Joint SIGDAT Conf. on Empirical Methods in Natural Language Processing and Very Large Corpora., pp. 20–28 (1999)
Lin, J.: Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory 37(1), 145–151 (1991)
Heinrich, G.: Parameter estimation for text analysis (2005)
Kruskal, J.: Nonmetric multidimensional scaling: a numerical method. Psychometrika 29(2), 115–129 (1964)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tamura, K., Yoshikawa, T., Furuhashi, T. (2013). A Study on Document Retrieval System Based on Visualization to Manage OCR Documents. In: Kurosu, M. (eds) Human-Computer Interaction. Interaction Modalities and Techniques. HCI 2013. Lecture Notes in Computer Science, vol 8007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39330-3_80
Download citation
DOI: https://doi.org/10.1007/978-3-642-39330-3_80
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39329-7
Online ISBN: 978-3-642-39330-3
eBook Packages: Computer ScienceComputer Science (R0)