Abstract
We report on the design and implementation of a system which automates the process of capturing structured documents from the optically recognized form of printed materials. The system is intended to be used to convert printed collections into their corresponding tagged electronic versions with little or no manual interventon. This conversion process has some unique problems associated with it, these are discussed, along with our attempts to solve them. This system also establishes a mapping between the bitmap image and its corresponding ASCII representation that can be used to design flexible image-based interfaces for IR-related applications.
Preview
Unable to display preview. Download preview PDF.
References
J. Allan, C. Buckley, and G. Salton. Automatic routing and ad-hoc retrieval using smart. In Proc. TREC 2. NIST, 1994.
W. Appelt and N. Tetteh-Lartey. The formal specification of the ISO open document architecture (ODA) standard. The Computer Journal, 36(3), 1993.
James P. Callan. Passage-level evidence in document retrieval. In Proc. 17th Intl. ACMSIGIR Conf. on Research and Development in Information Retrieval, pages 302–310, Dublin, Ireland, July 1994.
Daniel S. Connelly and Beth Paddock. XDOC Data Format Technical Specification. Xerox Imaging Systems, Inc., Peabody, MA, March 1992.
Scott C. Deerwester, Keith Waclena, and Michelle LaMar. A textual object management system. In Proc. 15th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 126–139, Denmark, June 1992. ACM Press.
Michael Fuller, Eric Mackie, Ron Sacks-Davis, and Ross Wilkinson. Structured answers for a large structured document collection. In Proc. 16th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 204–213, Pittsburgh, PA, June 1993. ACM Press.
Robert P. Futrelle et al. Document analysis, understanding, and knowledge access. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), pages 101–111, St. Malo, France, 1991.
C. F. Goldfarb. The SGML Handbook. Oxford University Press, 1990.
Udo Hahn. Topic parsing: Accounting for text macro structures in full-text analysis. Inf. Proc. and Management, 26(1):135–170, 1990.
Marti A. Hearst and Christian Plaunt. Subtopic structuring for full-length document access. In Proc. 16th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 59–68, Pittsburgh, PA, June 1993. ACM Press.
Rolf Ingold, Rene-Pierre Bonvin, and Giovanni Coray. Structure recognition of printed documents. In Proc. Intl. Conf. on Electronic Publishing, Document Manipulation and Typography, pages 59–70. Cambridge University Press, April 1988.
K. L. Kwok. The use of title and cited titles as document representation for automatic classification. Inf. Proc. and Management, 11:201–206, 1975.
I. A. Macleod. A query language for retrieving information from hierarchic text structures. The Computer Journal, 34(3):254–264, 1991.
Ian A. Macleod. Storage and retrieval of structured documents. Information Processing and Management, 26(2): 197–208, 1990.
Gerard Salton, J. Allan, and Chris Buckley. Approaches to passage retrieval in full text information systems. In Proc. 16th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 49–58, Pittsburgh, PA, June 1993. ACM Press.
Science Applications Intl. Corp. Capture station simulation: Lessons learned, Final Report, for the Licensing Support System, November 1990.
R. Southall. Visual structure and transmission of meaning. In Proc. Intl. Conf. on Electronic Publishing, Document Manipulation and Typography, pages 59–70. Cambridge University Press, April 1988.
A. Lawrence Spitz. Style directed document recognition. In Proc. of 1DCAR-91, pages 611–619, St. Malo, France, 1991.
Kazem Taghva, Julie Borsack, Bryan Bullard, and Allen Condit. Post-editing through approximation and global correction. International Journal of Pattern Recognition and Artificial Intelligence, 9(6):911–923, 1995.
Kazem Taghva, Julie Borsack, and Allen Condit. Results of applying probabilistic IR to OCR text. In Proc. 17th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 202–211, Dublin, Ireland, July 1994.
Kazem Taghva, Julie Borsack, and Allen Condit. Effects of OCR errors on ranking and feedback using the vector space model. Inf. Proc. and Management, 32(3):317–327, 1996.
Kazem Taghva, Julie Borsack, and Allen Condit. Evaluation of model-based retrieval effectiveness with OCR text. ACM Transactions on Information Systems, 14(1):64–93, January 1996.
Kazem Taghva, Julie Borsack, Allen Condit, and Srinivas Erva. The effects of noisy data on text retrieval. J. American Soc. for Inf. Sci., 45(1):50–58, January 1994.
Lynn D. Wilcox and A. Lawrence Spitz. Automatic recognition and representation of documents. In Proc. Intl. Conf. on Electronic Publishing, Document Manipulation, and Typography, pages 47–57. Cambridge University Press, April 1988.
Ross Wilkinson. Effective retrieval of structured documents. In Proc. 17th Intl. ACM-SIGIR Conf. on Research and Development in Information Retrieval, pages 311–317, Dublin, Ireland, July 1994.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Taghva, K., Condit, A., Borsack, J. (1998). Autotag: A tool for creating structured document collections from printed materials. In: Hersch, R.D., André, J., Brown, H. (eds) Electronic Publishing, Artistic Imaging, and Digital Typography. RIDT 1998. Lecture Notes in Computer Science, vol 1375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0053288
Download citation
DOI: https://doi.org/10.1007/BFb0053288
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64298-5
Online ISBN: 978-3-540-69718-3
eBook Packages: Springer Book Archive