Skip to main content

Autotag: A tool for creating structured document collections from printed materials

  • Part III: EP'98
  • Conference paper
  • First Online:
Electronic Publishing, Artistic Imaging, and Digital Typography (RIDT 1998)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1375))

Included in the following conference series:

Abstract

We report on the design and implementation of a system which automates the process of capturing structured documents from the optically recognized form of printed materials. The system is intended to be used to convert printed collections into their corresponding tagged electronic versions with little or no manual interventon. This conversion process has some unique problems associated with it, these are discussed, along with our attempts to solve them. This system also establishes a mapping between the bitmap image and its corresponding ASCII representation that can be used to design flexible image-based interfaces for IR-related applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. Allan, C. Buckley, and G. Salton. Automatic routing and ad-hoc retrieval using smart. In Proc. TREC 2. NIST, 1994.

    Google Scholar 

  2. W. Appelt and N. Tetteh-Lartey. The formal specification of the ISO open document architecture (ODA) standard. The Computer Journal, 36(3), 1993.

    Google Scholar 

  3. James P. Callan. Passage-level evidence in document retrieval. In Proc. 17th Intl. ACMSIGIR Conf. on Research and Development in Information Retrieval, pages 302–310, Dublin, Ireland, July 1994.

    Google Scholar 

  4. Daniel S. Connelly and Beth Paddock. XDOC Data Format Technical Specification. Xerox Imaging Systems, Inc., Peabody, MA, March 1992.

    Google Scholar 

  5. Scott C. Deerwester, Keith Waclena, and Michelle LaMar. A textual object management system. In Proc. 15th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 126–139, Denmark, June 1992. ACM Press.

    Google Scholar 

  6. Michael Fuller, Eric Mackie, Ron Sacks-Davis, and Ross Wilkinson. Structured answers for a large structured document collection. In Proc. 16th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 204–213, Pittsburgh, PA, June 1993. ACM Press.

    Google Scholar 

  7. Robert P. Futrelle et al. Document analysis, understanding, and knowledge access. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), pages 101–111, St. Malo, France, 1991.

    Google Scholar 

  8. C. F. Goldfarb. The SGML Handbook. Oxford University Press, 1990.

    Google Scholar 

  9. Udo Hahn. Topic parsing: Accounting for text macro structures in full-text analysis. Inf. Proc. and Management, 26(1):135–170, 1990.

    Article  Google Scholar 

  10. Marti A. Hearst and Christian Plaunt. Subtopic structuring for full-length document access. In Proc. 16th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 59–68, Pittsburgh, PA, June 1993. ACM Press.

    Google Scholar 

  11. Rolf Ingold, Rene-Pierre Bonvin, and Giovanni Coray. Structure recognition of printed documents. In Proc. Intl. Conf. on Electronic Publishing, Document Manipulation and Typography, pages 59–70. Cambridge University Press, April 1988.

    Google Scholar 

  12. K. L. Kwok. The use of title and cited titles as document representation for automatic classification. Inf. Proc. and Management, 11:201–206, 1975.

    Article  MathSciNet  Google Scholar 

  13. I. A. Macleod. A query language for retrieving information from hierarchic text structures. The Computer Journal, 34(3):254–264, 1991.

    Article  Google Scholar 

  14. Ian A. Macleod. Storage and retrieval of structured documents. Information Processing and Management, 26(2): 197–208, 1990.

    Article  Google Scholar 

  15. Gerard Salton, J. Allan, and Chris Buckley. Approaches to passage retrieval in full text information systems. In Proc. 16th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 49–58, Pittsburgh, PA, June 1993. ACM Press.

    Google Scholar 

  16. Science Applications Intl. Corp. Capture station simulation: Lessons learned, Final Report, for the Licensing Support System, November 1990.

    Google Scholar 

  17. R. Southall. Visual structure and transmission of meaning. In Proc. Intl. Conf. on Electronic Publishing, Document Manipulation and Typography, pages 59–70. Cambridge University Press, April 1988.

    Google Scholar 

  18. A. Lawrence Spitz. Style directed document recognition. In Proc. of 1DCAR-91, pages 611–619, St. Malo, France, 1991.

    Google Scholar 

  19. Kazem Taghva, Julie Borsack, Bryan Bullard, and Allen Condit. Post-editing through approximation and global correction. International Journal of Pattern Recognition and Artificial Intelligence, 9(6):911–923, 1995.

    Article  Google Scholar 

  20. Kazem Taghva, Julie Borsack, and Allen Condit. Results of applying probabilistic IR to OCR text. In Proc. 17th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 202–211, Dublin, Ireland, July 1994.

    Google Scholar 

  21. Kazem Taghva, Julie Borsack, and Allen Condit. Effects of OCR errors on ranking and feedback using the vector space model. Inf. Proc. and Management, 32(3):317–327, 1996.

    Article  Google Scholar 

  22. Kazem Taghva, Julie Borsack, and Allen Condit. Evaluation of model-based retrieval effectiveness with OCR text. ACM Transactions on Information Systems, 14(1):64–93, January 1996.

    Article  Google Scholar 

  23. Kazem Taghva, Julie Borsack, Allen Condit, and Srinivas Erva. The effects of noisy data on text retrieval. J. American Soc. for Inf. Sci., 45(1):50–58, January 1994.

    Article  Google Scholar 

  24. Lynn D. Wilcox and A. Lawrence Spitz. Automatic recognition and representation of documents. In Proc. Intl. Conf. on Electronic Publishing, Document Manipulation, and Typography, pages 47–57. Cambridge University Press, April 1988.

    Google Scholar 

  25. Ross Wilkinson. Effective retrieval of structured documents. In Proc. 17th Intl. ACM-SIGIR Conf. on Research and Development in Information Retrieval, pages 311–317, Dublin, Ireland, July 1994.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Roger D. Hersch Jacques André Heather Brown

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Taghva, K., Condit, A., Borsack, J. (1998). Autotag: A tool for creating structured document collections from printed materials. In: Hersch, R.D., André, J., Brown, H. (eds) Electronic Publishing, Artistic Imaging, and Digital Typography. RIDT 1998. Lecture Notes in Computer Science, vol 1375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0053288

Download citation

  • DOI: https://doi.org/10.1007/BFb0053288

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-64298-5

  • Online ISBN: 978-3-540-69718-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics