Skip to main content
Log in

Integrating natural language processing with image document analysis: what we learned from two real-world applications

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Automatically accessing information from unconstrained image documents has important applications in business and government operations. These real-world applications typically combine optical character recognition (OCR) with language and information technologies, such as machine translation (MT) and keyword spotting. OCR output has errors and presents unique challenges to late-stage processing. This paper addresses two of these challenges: (1) translating the output from Arabic handwriting OCR which lacks reliable sentence boundary markers, and (2) searching named entities which do not exist in the OCR vocabulary, therefore, completely missing from Arabic handwriting OCR output. We address these challenges by leveraging natural language processing technologies, specifically conditional random field-based sentence boundary detection and out-of-vocabulary (OOV) name detection. This approach significantly improves our state-of-the-art MT system and achieves MT scores close to that achieved by human segmentation. The output from OOV name detection was used as a novel feature for discriminative reranking, which significantly reduced the false alarm rate of OOV name search on OCR output. Our experiments also show substantial performance gains from integrating a variety of features from multiple resources, such as linguistic analysis, image layout analysis, and image text recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://taku910.github.io/crfpp/

  2. http://www1.cs.columbia.edu/~rambow/software-downloads/MADA_Distribution.html

  3. http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm

References

  1. Al-Subaihin, A.A., Al-Khalifa, H.S., Al-Salman, A.S.: Sentence boundary detection in colloquial arabic text: a preliminary result. In: Proceedings of 2011 International Conference on Asian Language Processing, pp. 30–32 (2011)

  2. Alotaiby, F., Alkharashi, I., Foda, S.: Processing large arabic text corpora: Preliminary analysis and results. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools, pp. 78–82 (2009)

  3. Béchet, F., Gorin, A.L., Wright, J.H., Hakkani Tür, D.: Detecting and extracting named entities from spontaneous speech in a mixed-initiative spoken dialogue context: how may i help you? Speech Commun. 42(2), 207–225 (2004)

    Article  Google Scholar 

  4. Bhardwaj, A., Setlur, S., Govindaraju, V.: Keyword spotting techniques for sanskrit documents. In: Sanskrit Computational Linguistics, pp. 403–416. Springer, Berlin (2009)

  5. Cao, H., Chen, J., Devlin, J., Prasad, R., Natarajan, P.: Document recognition and translation system for unconstrained arabic documents. In: Proceedings of Pattern Recognition (ICPR), 21st International Conference on IEEE, pp. 318–321 (2012)

  6. Cao, H., Natarajan, P., Peng, X., Belanger, K.S.D., Li, N.: Progress in the raytheon bbn arabic online handwriting recognition system. In: Proceedings of Frontiers in Handwriting Recognition (ICFHR), 14th International Conference on IEEE, pp. 555–560 (2014)

  7. Chan, J., Ziftci, C., Forsyth, D.: Searching online arabic documents. In: Proceedings of Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on IEEE, vol. 2, pp. 1455–1462 (2006)

  8. Chen, J., Cao, H., Prasad, R., Bhardwaj, A., Natarajan, P.: Gabor features for online arabic handwriting recognition. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 53–58 (2010)

  9. Chen, J., Cao, H., Wu, Y., Natarajan, P.: Confusion network based recurrent neural network language modeling for chinese ocr error detection. In: Proceedings of ICPR, pp. 1266–1271 (2014)

  10. Chen, J., Prasad, R., Cao, H., Natarajan, P.: Detecting oov names in arabic handwritten data. In: Proceedings of Document Analysis and Recognition (ICDAR), 12th International Conference on IEEE, pp. 994–998 (2013)

  11. Chen, W., Ananthakrishnan, S., Kumar, R., Prasad, R., Natarajan, P.: Asr error detection in a conversational spoken language translation system. In: Proceedings of Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on IEEE, pp. 7418–7422 (2013)

  12. Chiang, D., Knight, K., Wang, W.: 11,001 new features for statistical machine translation. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 218–226 (2009)

  13. Ding, W., Suen, C.Y., Krzyzak, A.: A new courtesy amount recognition module of a check reading system. In: Proceedings of ICPR, pp. 1–4 (2008)

  14. Farooq, F., Al-Onaizan, Y.: Effect of degraded input on statistical machine translation. In: 2005 Symposium on Document Image Understanding Technology, pp. 103–109 (2005)

  15. Gales, M.: Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12, 75–98 (1998)

    Article  Google Scholar 

  16. Grefenstette, G., Tapanainen, P.: What is a word, what is a sentence? Problems of tokenization. COMPLEX 1994, 79–87 (1994)

    Google Scholar 

  17. Huang, L., Yin, F., Chen, Q.H., Liu, C.L.: Keyword spotting in online chinese handwritten documents using a statistical model. In: Proceedings of ICDAR, pp. 78–82 (2011)

  18. Khalifa, I., Feki, Z.A., Farawila, A.: Arabic discourse segmentation based on rhetorical methods. Int. J. Electr. Comput. Sci. 11(1), 10–15 (2011)

  19. Kim, G., Govindaraju, V.: A lexicon driven approach to handwritten word recognition for real-time applications. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 366–379 (1997)

    Article  Google Scholar 

  20. Kiss, T., Strunk, J.: Unsupervised multilingual sentence boundary detection. Comput. Linguist. 32(4), 485–525 (2006)

    Article  Google Scholar 

  21. Kubala, F., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from speech. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pp. 287–292 (1998)

  22. Lafferty, J.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of International Conference on Machine Learning, pp. 282–289 (2001)

  23. Lin, H., Bilmes, J., Vergyri, D., Kirchhoff, K.: Oov detection by joint word/phone lattice alignment. In: Proceedings of Automatic Speech Recognition and Understanding (ASRU), IEEE Workshop on IEEE, pp. 478–483 (2007)

  24. Liu, C.L., Koga, M., Fujisawa, H.: Gabor feature extraction for character recognition: comparison with gradient feature. In: Proceedings of Document Analysis and Recognition, 8th International Conference on IEEE, pp. 121–125 (2005)

  25. Liu, D.C., Nocedal, J.: On the limited memory bfgs method for large scale optimization. Math. Program. 45(1–3), 503–528 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  26. Liu, Y., Shriberg, E.: Comparing evaluation metrics for sentence boundary detection. In: Proceedings of ICASSP, pp. 451–458 (2007)

  27. Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., Harper, M.P.: Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Audio, Speech Lang. Process. 14(5), 1526–1540 (2006)

    Article  Google Scholar 

  28. Mangu, L., Brill, E., Stolcke, A.: Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Comput. Speech Lang. 14(4), 373–400 (2000)

    Article  Google Scholar 

  29. Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The det curve in assessment of detection task performance. In: Proceedings of Eurospeech. Rhodes, Greece, pp. 1895–1898 (1997)

  30. Matsoukas, S., Bulyko, I., Xiang, B., Nguyen, K., Schwartz, R., Makhoul, J.: Integrating speech recognition and machine translation. In: Proceedings of ICASSP, pp. 1281–1284 (2007)

  31. Matusov, E., Hillard, D., Magimai-Doss, M., Hakkani- Tür, D.Z., Ostendorf, M., Ney, H.: Improving speech translation with automatic boundary prediction. In: INTERSPEECH, vol. 7, pp. 2449–2452 (2007)

  32. McCallum, A., Li, W.: Early results for named entity recognition with conditional random elds, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural language learning at HLT-NAACL 2003, vol. 4, pp. 188–191 (2003)

  33. Mikheev, A.: Tagging sentence boundaries. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference, pp. 264–271 (2000)

  34. Palmer, D.D., Hearst, M.A.: Adaptive multilingual sentence boundary disambiguation. Comput. Linguist. 23(2), 241–267 (1997)

    Google Scholar 

  35. Parada, C., Dredze, M., Filimonov, D., Jelinek, F.: Contextual information improves oov detection in speech. In: Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 216–224 (2010)

  36. Rastrow, A., Sethy, A., Ramabhadran, B.: A new method for oov detection using hybrid word/fragment system. In: Proceedings of ICASSP, pp. 3953–3956 (2009)

  37. Read, J., Dridan, R., Oepen, S., Solberg, L.J.: Sentence boundary detection: a long solved problem? In: Proceedings of COLING, pp. 985–994 (2012)

  38. Roark, B., Liu, Y., Harper, M., Stewart, R., Lease, M., Snover, M., Shafran, I., Dorr, B., Hale, J., Krasnyanskaya, A.K., et al.: Reranking for sentence boundary detection in conversational speech. In: Proceedings of ICASSP, vol. 1, pp. I-I (2006)

  39. Saleem, S., Cao, H., Subramanian, K., Kamali, M., Prasad, R., Natarajan, P.: Improvements in bbn’s hmmbased online arabic handwriting recognition system. In: Proceedings of Document Analysis and Recognition, the 10th International Conference on IEEE, pp. 773–777 (2009)

  40. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134–141 (2003)

  41. Shen, L., Xu, J., Weischedel, R.M.: A new string-to-dependency machine translation algorithm with a target dependency language model. In: Proceedings of ACL, pp. 577–585 (2008)

  42. Shriberg, E., Stolcke, A., Hakkani-Tür, D., Tür, G.: Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun. 32(1), 127–154 (2000)

    Article  Google Scholar 

  43. Subramanian, K., Prasad, R., MacRostie, E., Natarajan, P.: Robust named entity detection in videotext using character lattices. In: Proceedings of ICASSP, pp. 1241–1244 (2008)

  44. Subramanian, K., Prasad, R., Natarajan, P.: Robust named entity detection from optical character recognition output. Int. J. Doc. Anal. Recognit. (IJDAR) 14(2), 189–200 (2011)

    Article  Google Scholar 

  45. Sun, H., Zhang, G., Zheng, F., Xu, M.: Using word confidence measure for oov words detection in a spontaneous spoken dialog system. In: Proceedings of Eurospeech. Geneva, pp. 2713–2716 (2003)

  46. Toselli, A.H., Vidal, E.: Fast hmm-filler approach for key word spotting in handwritten documents. In: Proceedings of ICDAR, pp. 501–505 (2013)

  47. Touir, A.A., Mathkour, H., Al-Sanea, W.: Semantic-based segmentation of arabic texts. Inf. Technol. J. 7, 1009–1015 (2008)

  48. Tseng, H.: A conditional random field word segmenter. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (2005)

  49. Walker, D.J., Clements, D.E., Darwin, M., Amtrup, J.W.: Sentence boundary detection: a comparison of paradigms for improving mt quality. In: Proceedings of the MT Summit VIII (2001)

  50. Zhang, H., Liu, C.L.: A lattice-based method for keyword spotting in online chinese handwriting. In: Proceedings of ICDAR, pp. 1064–1068 (2011)

  51. Zimmermann, M.: Sentence boundary detection for handwritten text recognition. In: Proceedings of the Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft (2006)

  52. Zhou, B., Besacier, L., Gao, Y.: On efficient coupling of asr and smt for speech translation. In: Proceedings of ICASSP, vol. 4, pp. IV-101 (2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinying Chen.

Additional information

This paper is based upon work supported by the DARPA MADCAT program (Approved for Public Release, Distribution Unlimited). The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the US Government.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, J., Cao, H. & Natarajan, P. Integrating natural language processing with image document analysis: what we learned from two real-world applications. IJDAR 18, 235–247 (2015). https://doi.org/10.1007/s10032-015-0247-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-015-0247-x

Keywords

Navigation