A Study on Document Retrieval System Based on Visualization to Manage OCR Documents

Tamura, Kazuki; Yoshikawa, Tomohiro; Furuhashi, Takeshi

doi:10.1007/978-3-642-39330-3_80

Kazuki Tamura¹⁷,
Tomohiro Yoshikawa¹⁷ &
Takeshi Furuhashi¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8007))

Included in the following conference series:

International Conference on Human-Computer Interaction

Abstract

Recently, the digitization of paper-based documents is rapidly advanced through the spread of scanners. However, tagging or sorting a huge amount of scanned documents one by one is difficult in terms of time and effort. Therefore, the system which extracts features from texts in the documents automatically, which is available by OCR, and classifies/retrieves documents will be useful. LDA, one of the most popular Topic Models, is known as a method to extract the features of each document and the relationships between documents. However, it is reported that the performance of LDA declines along with poor OCR recognition. This paper assumes the case of applying LDA to Japanese OCR documents and studies the method to improve the performance of topic inference. This paper defines the reliability of the recognized words using N-gram and proposes the weighting LDA method based on the reliability. Adequacy of the reliability of the recognized words is confirmed through the preliminary experiment detecting false recognized words, and then the experiment to classify practical OCR documents are carried out. The experimental results show the improvement of the classification performance by the proposed method comparing with the conventional methods.

Download to read the full chapter text

Chapter PDF

Self-Organizing Map for Multi-view Text Clustering

Using a multimedia semantic graph for web document visualization and summarization

Article Open access 24 September 2020

An Effective of Data Organizing Method Combines with Naïve Bayes for Vietnamese Document Retrieval

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Bunke, H.: Recognition of cursive roman handwriting: past, present and future. In: Proc. Seventh International Conference on Document Analysis and Recognition, pp. 448–459. IEEE (2003)
Google Scholar
Nagata, M.: Japanese ocr error correction using character shape similarity and statistical language model. In: Proc. of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 2, pp. 922–928. Association for Computational Linguistics (1998)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Google Scholar
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Walker, D., Lund, W., Ringger, E.: Evaluating models of latent document semantics in the presence of ocr errors. In: Proc. of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 240–250. Association for Computational Linguistics (2010)
Google Scholar
Wei, X., Croft, W.: Lda-based document models for ad-hoc retrieval. In: Proc. of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185. ACM (2006)
Google Scholar
Yao, L., Mimno, D., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: Proc. of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 937–946. ACM (2009)
Google Scholar
Iwata, T., Yamada, T., Ueda, N.: Probabilistic latent semantic visualization: topic model for visualizing documents. In: Proc. of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 363–371. ACM (2008)
Google Scholar
Blei, D., Lafferty, J.: Dynamic topic models. In: Proc. of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)
Google Scholar
Newman, D., Block, S.: Probabilistic topic decomposition of an eighteenth-century american newspaper. Journal of the American Society for Information Science and Technology 57(6), 753–767 (2006)
Article Google Scholar
Yokoyama, S., Eguchi, K., Ohkawa, T.: Distillating information diffusion networks from blogosphere using latent topics. The IEICE Transactions on Information and Systems (Japanese Edition) 93(3), 180–188 (2010)
Google Scholar
Kitajima, R., Kobayashi, I.: An examination for proper events grasping latent topics in a document and its application. IPSJ SIG Notes NL-201(3), 1–8 (2011)
Google Scholar
Wilson, A., Chew, P.: Term weighting schemes for latent dirichlet allocation. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT, vol. 10, pp. 465–473 (2010)
Google Scholar
Griffiths, T., Steyvers, M.: Finding scientific topics, vol. 101, pp. 5228–5235. National Acad. Sciences (2004)
Google Scholar
Och, F., Tillmann, C., Ney, H., et al.: Improved alignment models for statistical machine translation. In: Proc. of the Joint SIGDAT Conf. on Empirical Methods in Natural Language Processing and Very Large Corpora., pp. 20–28 (1999)
Google Scholar
Lin, J.: Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory 37(1), 145–151 (1991)
Article MATH Google Scholar
Heinrich, G.: Parameter estimation for text analysis (2005)
Google Scholar
Kruskal, J.: Nonmetric multidimensional scaling: a numerical method. Psychometrika 29(2), 115–129 (1964)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Nagoya University, Furo-cho, Chikusa-ku, Nagoya, 464-8603, Japan
Kazuki Tamura, Tomohiro Yoshikawa & Takeshi Furuhashi

Authors

Kazuki Tamura
View author publications
You can also search for this author in PubMed Google Scholar
Tomohiro Yoshikawa
View author publications
You can also search for this author in PubMed Google Scholar
Takeshi Furuhashi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Open University of Japan, 2-11 Wakaba, 261-8586, Mihama-ku, Chiba-shi, Japan
Masaaki Kurosu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tamura, K., Yoshikawa, T., Furuhashi, T. (2013). A Study on Document Retrieval System Based on Visualization to Manage OCR Documents. In: Kurosu, M. (eds) Human-Computer Interaction. Interaction Modalities and Techniques. HCI 2013. Lecture Notes in Computer Science, vol 8007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39330-3_80

Download citation

DOI: https://doi.org/10.1007/978-3-642-39330-3_80
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39329-7
Online ISBN: 978-3-642-39330-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics