Integrating natural language processing with image document analysis: what we learned from two real-world applications

Chen, Jinying; Cao, Huaigu; Natarajan, Premkumar

doi:10.1007/s10032-015-0247-x

Integrating natural language processing with image document analysis: what we learned from two real-world applications

Original Paper
Published: 28 May 2015

Volume 18, pages 235–247, (2015)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Jinying Chen¹,
Huaigu Cao² &
Premkumar Natarajan³

706 Accesses
8 Citations
3 Altmetric
Explore all metrics

Abstract

Automatically accessing information from unconstrained image documents has important applications in business and government operations. These real-world applications typically combine optical character recognition (OCR) with language and information technologies, such as machine translation (MT) and keyword spotting. OCR output has errors and presents unique challenges to late-stage processing. This paper addresses two of these challenges: (1) translating the output from Arabic handwriting OCR which lacks reliable sentence boundary markers, and (2) searching named entities which do not exist in the OCR vocabulary, therefore, completely missing from Arabic handwriting OCR output. We address these challenges by leveraging natural language processing technologies, specifically conditional random field-based sentence boundary detection and out-of-vocabulary (OOV) name detection. This approach significantly improves our state-of-the-art MT system and achieves MT scores close to that achieved by human segmentation. The output from OOV name detection was used as a novel feature for discriminative reranking, which significantly reduced the false alarm rate of OOV name search on OCR output. Our experiments also show substantial performance gains from integrating a variety of features from multiple resources, such as linguistic analysis, image layout analysis, and image text recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Article Open access 22 November 2021

Thomas Hegghammer

A novel Arabic OCR post-processing using rule-based and word context techniques

Article 05 April 2018

Iyad Abu Doush, Faisal Alkhateeb & Anwaar Hamdi Gharaibeh

Arabic Character Recognition

Notes

References

Al-Subaihin, A.A., Al-Khalifa, H.S., Al-Salman, A.S.: Sentence boundary detection in colloquial arabic text: a preliminary result. In: Proceedings of 2011 International Conference on Asian Language Processing, pp. 30–32 (2011)
Alotaiby, F., Alkharashi, I., Foda, S.: Processing large arabic text corpora: Preliminary analysis and results. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools, pp. 78–82 (2009)
Béchet, F., Gorin, A.L., Wright, J.H., Hakkani Tür, D.: Detecting and extracting named entities from spontaneous speech in a mixed-initiative spoken dialogue context: how may i help you? Speech Commun. 42(2), 207–225 (2004)
Article Google Scholar
Bhardwaj, A., Setlur, S., Govindaraju, V.: Keyword spotting techniques for sanskrit documents. In: Sanskrit Computational Linguistics, pp. 403–416. Springer, Berlin (2009)
Cao, H., Chen, J., Devlin, J., Prasad, R., Natarajan, P.: Document recognition and translation system for unconstrained arabic documents. In: Proceedings of Pattern Recognition (ICPR), 21st International Conference on IEEE, pp. 318–321 (2012)
Cao, H., Natarajan, P., Peng, X., Belanger, K.S.D., Li, N.: Progress in the raytheon bbn arabic online handwriting recognition system. In: Proceedings of Frontiers in Handwriting Recognition (ICFHR), 14th International Conference on IEEE, pp. 555–560 (2014)
Chan, J., Ziftci, C., Forsyth, D.: Searching online arabic documents. In: Proceedings of Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on IEEE, vol. 2, pp. 1455–1462 (2006)
Chen, J., Cao, H., Prasad, R., Bhardwaj, A., Natarajan, P.: Gabor features for online arabic handwriting recognition. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 53–58 (2010)
Chen, J., Cao, H., Wu, Y., Natarajan, P.: Confusion network based recurrent neural network language modeling for chinese ocr error detection. In: Proceedings of ICPR, pp. 1266–1271 (2014)
Chen, J., Prasad, R., Cao, H., Natarajan, P.: Detecting oov names in arabic handwritten data. In: Proceedings of Document Analysis and Recognition (ICDAR), 12th International Conference on IEEE, pp. 994–998 (2013)
Chen, W., Ananthakrishnan, S., Kumar, R., Prasad, R., Natarajan, P.: Asr error detection in a conversational spoken language translation system. In: Proceedings of Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on IEEE, pp. 7418–7422 (2013)
Chiang, D., Knight, K., Wang, W.: 11,001 new features for statistical machine translation. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 218–226 (2009)
Ding, W., Suen, C.Y., Krzyzak, A.: A new courtesy amount recognition module of a check reading system. In: Proceedings of ICPR, pp. 1–4 (2008)
Farooq, F., Al-Onaizan, Y.: Effect of degraded input on statistical machine translation. In: 2005 Symposium on Document Image Understanding Technology, pp. 103–109 (2005)
Gales, M.: Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12, 75–98 (1998)
Article Google Scholar
Grefenstette, G., Tapanainen, P.: What is a word, what is a sentence? Problems of tokenization. COMPLEX 1994, 79–87 (1994)
Google Scholar
Huang, L., Yin, F., Chen, Q.H., Liu, C.L.: Keyword spotting in online chinese handwritten documents using a statistical model. In: Proceedings of ICDAR, pp. 78–82 (2011)
Khalifa, I., Feki, Z.A., Farawila, A.: Arabic discourse segmentation based on rhetorical methods. Int. J. Electr. Comput. Sci. 11(1), 10–15 (2011)
Kim, G., Govindaraju, V.: A lexicon driven approach to handwritten word recognition for real-time applications. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 366–379 (1997)
Article Google Scholar
Kiss, T., Strunk, J.: Unsupervised multilingual sentence boundary detection. Comput. Linguist. 32(4), 485–525 (2006)
Article Google Scholar
Kubala, F., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from speech. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pp. 287–292 (1998)
Lafferty, J.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of International Conference on Machine Learning, pp. 282–289 (2001)
Lin, H., Bilmes, J., Vergyri, D., Kirchhoff, K.: Oov detection by joint word/phone lattice alignment. In: Proceedings of Automatic Speech Recognition and Understanding (ASRU), IEEE Workshop on IEEE, pp. 478–483 (2007)
Liu, C.L., Koga, M., Fujisawa, H.: Gabor feature extraction for character recognition: comparison with gradient feature. In: Proceedings of Document Analysis and Recognition, 8th International Conference on IEEE, pp. 121–125 (2005)
Liu, D.C., Nocedal, J.: On the limited memory bfgs method for large scale optimization. Math. Program. 45(1–3), 503–528 (1989)
Article MATH MathSciNet Google Scholar
Liu, Y., Shriberg, E.: Comparing evaluation metrics for sentence boundary detection. In: Proceedings of ICASSP, pp. 451–458 (2007)
Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., Harper, M.P.: Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Audio, Speech Lang. Process. 14(5), 1526–1540 (2006)
Article Google Scholar
Mangu, L., Brill, E., Stolcke, A.: Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Comput. Speech Lang. 14(4), 373–400 (2000)
Article Google Scholar
Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The det curve in assessment of detection task performance. In: Proceedings of Eurospeech. Rhodes, Greece, pp. 1895–1898 (1997)
Matsoukas, S., Bulyko, I., Xiang, B., Nguyen, K., Schwartz, R., Makhoul, J.: Integrating speech recognition and machine translation. In: Proceedings of ICASSP, pp. 1281–1284 (2007)
Matusov, E., Hillard, D., Magimai-Doss, M., Hakkani- Tür, D.Z., Ostendorf, M., Ney, H.: Improving speech translation with automatic boundary prediction. In: INTERSPEECH, vol. 7, pp. 2449–2452 (2007)
McCallum, A., Li, W.: Early results for named entity recognition with conditional random elds, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural language learning at HLT-NAACL 2003, vol. 4, pp. 188–191 (2003)
Mikheev, A.: Tagging sentence boundaries. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference, pp. 264–271 (2000)
Palmer, D.D., Hearst, M.A.: Adaptive multilingual sentence boundary disambiguation. Comput. Linguist. 23(2), 241–267 (1997)
Google Scholar
Parada, C., Dredze, M., Filimonov, D., Jelinek, F.: Contextual information improves oov detection in speech. In: Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 216–224 (2010)
Rastrow, A., Sethy, A., Ramabhadran, B.: A new method for oov detection using hybrid word/fragment system. In: Proceedings of ICASSP, pp. 3953–3956 (2009)
Read, J., Dridan, R., Oepen, S., Solberg, L.J.: Sentence boundary detection: a long solved problem? In: Proceedings of COLING, pp. 985–994 (2012)
Roark, B., Liu, Y., Harper, M., Stewart, R., Lease, M., Snover, M., Shafran, I., Dorr, B., Hale, J., Krasnyanskaya, A.K., et al.: Reranking for sentence boundary detection in conversational speech. In: Proceedings of ICASSP, vol. 1, pp. I-I (2006)
Saleem, S., Cao, H., Subramanian, K., Kamali, M., Prasad, R., Natarajan, P.: Improvements in bbn’s hmmbased online arabic handwriting recognition system. In: Proceedings of Document Analysis and Recognition, the 10th International Conference on IEEE, pp. 773–777 (2009)
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134–141 (2003)
Shen, L., Xu, J., Weischedel, R.M.: A new string-to-dependency machine translation algorithm with a target dependency language model. In: Proceedings of ACL, pp. 577–585 (2008)
Shriberg, E., Stolcke, A., Hakkani-Tür, D., Tür, G.: Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun. 32(1), 127–154 (2000)
Article Google Scholar
Subramanian, K., Prasad, R., MacRostie, E., Natarajan, P.: Robust named entity detection in videotext using character lattices. In: Proceedings of ICASSP, pp. 1241–1244 (2008)
Subramanian, K., Prasad, R., Natarajan, P.: Robust named entity detection from optical character recognition output. Int. J. Doc. Anal. Recognit. (IJDAR) 14(2), 189–200 (2011)
Article Google Scholar
Sun, H., Zhang, G., Zheng, F., Xu, M.: Using word confidence measure for oov words detection in a spontaneous spoken dialog system. In: Proceedings of Eurospeech. Geneva, pp. 2713–2716 (2003)
Toselli, A.H., Vidal, E.: Fast hmm-filler approach for key word spotting in handwritten documents. In: Proceedings of ICDAR, pp. 501–505 (2013)
Touir, A.A., Mathkour, H., Al-Sanea, W.: Semantic-based segmentation of arabic texts. Inf. Technol. J. 7, 1009–1015 (2008)
Tseng, H.: A conditional random field word segmenter. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (2005)
Walker, D.J., Clements, D.E., Darwin, M., Amtrup, J.W.: Sentence boundary detection: a comparison of paradigms for improving mt quality. In: Proceedings of the MT Summit VIII (2001)
Zhang, H., Liu, C.L.: A lattice-based method for keyword spotting in online chinese handwriting. In: Proceedings of ICDAR, pp. 1064–1068 (2011)
Zimmermann, M.: Sentence boundary detection for handwritten text recognition. In: Proceedings of the Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft (2006)
Zhou, B., Besacier, L., Gao, Y.: On efficient coupling of asr and smt for speech translation. In: Proceedings of ICASSP, vol. 4, pp. IV-101 (2007)

Download references

Author information

Authors and Affiliations

Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA, USA
Jinying Chen
Department of Speech, Language and Multimedia, Raytheon BBN Technologies, Cambridge, MA, USA
Huaigu Cao
Information Sciences Institute, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292, USA
Premkumar Natarajan

Authors

Jinying Chen
View author publications
You can also search for this author in PubMed Google Scholar
Huaigu Cao
View author publications
You can also search for this author in PubMed Google Scholar
Premkumar Natarajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinying Chen.

Additional information

This paper is based upon work supported by the DARPA MADCAT program (Approved for Public Release, Distribution Unlimited). The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the US Government.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, J., Cao, H. & Natarajan, P. Integrating natural language processing with image document analysis: what we learned from two real-world applications. IJDAR 18, 235–247 (2015). https://doi.org/10.1007/s10032-015-0247-x

Download citation

Received: 12 November 2013
Revised: 25 April 2015
Accepted: 07 May 2015
Published: 28 May 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10032-015-0247-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrating natural language processing with image document analysis: what we learned from two real-world applications

Abstract

Access this article

Similar content being viewed by others

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

A novel Arabic OCR post-processing using rule-based and word context techniques

Arabic Character Recognition

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Integrating natural language processing with image document analysis: what we learned from two real-world applications

Abstract

Access this article

Similar content being viewed by others

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

A novel Arabic OCR post-processing using rule-based and word context techniques

Arabic Character Recognition

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation