Skip to main content

Digitization, Coded Character Sets, and Optical Character Recognition for Multi-script Information Resources: The Case of the Letopis’ Zhurnal’nykh Statei

  • Conference paper
  • First Online:
Research and Advanced Technology for Digital Libraries (ECDL 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2163))

Included in the following conference series:

  • 810 Accesses

Abstract

Multi-lingual information resources that consist of texts in more scripts than can be represented by a single 8-bit encoding scheme can currently be best represented by use of the Unicode multi-byte character-encoding scheme. However use of Unicode could lead to a decrease in the accuracy of Optical Character Recognition (OCR) software because of the similarity of glyphs between certain scripts. This decrease in OCR accuracy can dramatically increase the amount of time needed to proofread the resulting electronic texts. An Indiana University - Digital Library Program project for digitizing a 20-year portion of the Letopis’ Zhurnal’nykh Statei is presented as an example of a digital library project dealing with a multi-script information resource for which Unicode has been used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adams, Glenn: Introduction to Unicode. Cambridge, Mass.: Institute for Advanced Professional Studies, 1994.

    Google Scholar 

  2. Гончаров, M. B., [и др.]: П роблемы предста вления кириллич еской информации в электроннй фор ме элуктронные б иблиотеки (1998) ткм 1, вып. 2 http://www.iis.ru/el-bib/1998/199802/EGHS/eghs.ru.html

  3. Indiana University Digital Library Program: http://www.dlib.indiana.edu/

  4. Indiana University Digital Library Program, Letopis’ Zhurnal’nykh Statei Project. http://www.dlib.indiana.edu/collections/letopis/letopismain.html

  5. Internet Assigned Numbers Authority (IANA): Character Sets: http://www.iana.org/assignments/character-sets

  6. MacKenzie, Charles E.: Coded Character Sets, History and Development. Reading, MA: Addison-Wesley, 1980.

    Google Scholar 

  7. Microsoft Corp.: Character sets and codepages http://www.microsoft.com/typography/unicode/cscp.htm

  8. Phinney, Thomas: TrueType & PostScript Type 1: What’s the Difference? http://www.fontsite.com/Pages/Features/T1vsTTb.html

  9. Unicode Consortium: The Unicode Standard: A Technical Introduction. http://www.unicode.org/unicode/standard/principles.html

  10. Unicode Consortium: The Unicode Standard: Version 3.0. Reading, Mass.: Addison-Wesley, 2000.

    Google Scholar 

  11. Wood, Alan: Setting up Macintosh OS 9 Web Browsers for Multilingual and Unicode Support. http://www.hclrss.demon.co.uk/unicode/macbrowsers.html

  12. World Wide Web Consortium (W3C): i18n/l10n: languages, countries and character sets. http://www.w3.org/International/O-charset-lang.html

  13. World Wide Web Consortium (W3C): Extensible Markup Language (XML) version 1.0 (Second Edition) section 4.3.3 http://www.w3.org/TR/REC-xml#charencoding

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Spencer, G.A. (2001). Digitization, Coded Character Sets, and Optical Character Recognition for Multi-script Information Resources: The Case of the Letopis’ Zhurnal’nykh Statei . In: Constantopoulos, P., Sølvberg, I.T. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2001. Lecture Notes in Computer Science, vol 2163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44796-2_36

Download citation

  • DOI: https://doi.org/10.1007/3-540-44796-2_36

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42537-3

  • Online ISBN: 978-3-540-44796-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics