Skip to main content
Log in

Successfully detecting and correcting false friends using channel profiles

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

The detection and correction of false friends—also called real-word errors—is a notoriously difficult problem. On realistic data, the break-even point for automatic correction so far could not be reached: the number of additional infelicitous corrections outnumbered the useful corrections. We present a new approach where we first compute a profile of the error channel for the given text. During the correction process, the profile (1) helps to restrict attention to a small set of “suspicious” lexical tokens of the input text where it is “plausible” to assume that the token represents a false friend. In this way, recognition of false friends is improved. Furthermore, the profile (2) helps to isolate the “most promising” correction suggestion for “suspicious” tokens. Using a conventional word trigram statistics for disambiguation we obtain a correction method that can be successfully applied to unrestricted text. In experiments for OCR documents, we show significant accuracy gains by fully automatic correction of false friends.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bolshakov, I.A., Gelbukh, A.F.: On detection of malapropisms by multistage collocation testing. In: 8th International Conference on Applications of Natural Language to Information Systems, pp. 28–41. Burg (Spreewald), Germany (2003)

  2. Brants, T., Franz, A.: Web 1t 5-gram version 1. Linguistic Data Consortium, Philadelphia (2006)

  3. Dengel A., Hoch R., Hönes F., Jäger T., Malburg M., Weigel A.: Techniques for improving OCR results. In: Bunke, H., Wang, P.S. (eds) Handbook of Character Recognition and Document Image Analysis, pp. 227–258. World Scientific, New Jersey (1997)

    Google Scholar 

  4. Gale, W.A., Church, K.W., Yarowsky, D.: Discrimination decisions for 100,000-dimensional spaces. Current Issues in Computational Linguistics: In Honour of Don Walker, pp. 429–450 (1994)

  5. Golding, A.R.: A bayesian hybrid method for context-sensitive spelling correction, pp. 39–53 (1995)

  6. Golding, A.R., Roth, D.: A winnow-based approach to context-sensitive spelling correction. Machine Learning, pp. 107–130 (1999)

  7. Hirst G., Budanitsky A.: Correcting real-word spelling errors by restoring lexical cohesion. Nat. Lang. Eng. 11(1), 87–111 (2005)

    Article  Google Scholar 

  8. Kantor P.B., Voorhees E.M.: The TREC-5 confusion track: comparing retrieval methods for scanned text. Inf. Retr. 2(2/3), 165–176 (2000)

    Article  Google Scholar 

  9. Kuĉera H., Francis W.N.: Computational Analysis of Present-Day American English. Brown University Press, Providence (1967)

    Google Scholar 

  10. Kukich, K.: Techniques for automatically correcting words in texts. ACM Computing Surveys, pp. 377–439 (1992)

  11. Lopresti, D.: Performance evaluation for text processing of noisy inputs. In: SAC ’05: Proceedings of the 2005 ACM symposium on Applied computing, pp. 759–763. ACM Press, New York (2005)

  12. Mays E., Damerau F.J., Mercer R.L.: Context based spelling correction. Inf. Process. Manage. 27(5), 517–522 (1991)

    Article  Google Scholar 

  13. Mihov, S., Mitankin, P., Gotscharek, A., Reffle, U., Schulz, K.U., Ringlstetter, C.: Using automated error profiling of texts for improved selection of correction candidates for garbled tokens. In: Australian Conference on Artificial Intelligence (AI2007). Lecture Notes in Computer Science, vol. 4830, pp. 456–465 (2007)

  14. Mitton R.: Spelling checkers, spelling correctors and the misspellings of poor spellers. Inf. Process. Manage. 23(5), 495–505 (1987)

    Article  Google Scholar 

  15. Reffle, U., Gotscharek, A., Ringlstetter, C., Schulz, K.U.: Successfully detecting and correcting false friends using channel profiles. In: AND ’08: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, pp. 17–22 (2008)

  16. Reynaert, M.: All, and only, the errors: more complete and consistent spelling and ocr-error correction evaluation. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC) (2008)

  17. Ringlstetter, C., Reffle, U., Gotscharek, A., Schulz, K.U: Deriving symbol dependent edit weights for text correction—the use of error dictionaries. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR), pp. 639–643 (2007)

  18. Taghva K., Borsack J., Condit A.: Effects of OCR errors on ranking and feedback using the vector space model. Inf. Process. Manag. 32(3), 317–327 (1996)

    Article  Google Scholar 

  19. Taghva K., Borsack J., Condit A., Erva S.: The effects of noisy data on text retrieval. J. Am. Soc. Inf. Sci. 45, 50–58 (1994)

    Article  Google Scholar 

  20. Wilcox-O’Hearn, L.A., Hirst, G., Budanitsky, A.: Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model. In: Proc., 9th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2008), pp. 605–616. Haifa (2008)

  21. Yarowsky, D.: Decision lists for lexical ambiguity resolution: application to accent restoration in spanish and french. In: Proc. of the Meeting of the Association for Computational Linguistics, pp. 88–95. ACL Press, USA (1994)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christoph Ringlstetter.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Reffle, U., Gotscharek, A., Ringlstetter, C. et al. Successfully detecting and correcting false friends using channel profiles. IJDAR 12, 165–174 (2009). https://doi.org/10.1007/s10032-009-0091-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-009-0091-y

Keywords

Navigation