Abstract
Methods and tools for finding documents relevant to a user's needs in document corpora can be found in the information retrieval, library science, and hypertext communities. Typically, these systems provide retrieval capabilities for fairly static corpora, their algorithms are dependent on the language for which they are written, e.g. English, and they don't perform well when presented with misspelled words or text that has been degraded by OCR (optical character recognition) techniques. In this chapter, we present the TELLTALE system. TELLTALE is a dynamic hypertext environment that provides full-text search from a hypertextstyle user interface for text corpora that may be garbled by OCR or transmission errors, and that may contain languages other than English by using several techniques based on n-grams (n character sequences of text). In this chapter, we identify methods and techniques that we have applied to the n-gram data structures. We also discuss algorithms that we used to enhance the scalability of the TELLTALE Dynamic Hypertext System.
Preview
Unable to display preview. Download preview PDF.
References
M. Aboud, C. Chrisment, R. Razouk, F. Sedes, and C. Soule-Dupuy. Querying a hypertext information retrieval system by the use of classification. Information Processing and Management, 29(3):387–396, 1990.
W. B. Cavnar. N-Gram-Based text filtering for TREC-2. In Donna Harman, editor, Proceedings of TREC-2: Text Retrieval Conference 2, Gaithersburg, MD, 1993. National Institute of Standards and Technology.
Jonathan Cohen. Highlights: Language-and domain-independent automatic indexing terms for abstracting. To appear in JASIS, 1995.
The Unicode Consortium. The Unicode Standard: World Wide Character Encoding. Addison-Wesley, Redwood City, CA, 1992.
W. B. Croft and R. Thompson. I 3R: A new approach to the design of document retrieval systems. Journal of the American Society for Information Science, 38:389–404, 1987.
W. B. Croft and H. Turtle. A retrieval model for incorporating hypertext links. In Hypertext '89 Proceedings, pages 213–224. ACM Press, November 1989. Pittsburgh, PA, Nov 5–8.
Donald B. Crouch, Carolyn J. Crouch, and Glenn Andreas. The use of cluster hierarchies in hypertext information retrieval. In Hypertext '89 Proceedings, pages 225–237. ACM Press, November 1989. Pittsburgh, PA, Nov 5–8.
Marc Damashek, 1995. U. S. Patent Number 5,418,951.
Marc Damashek. Gauging similarity with N-Grams: Language-independent categorization of text. Science, 267:843–848, 10 February 1995.
R. D'Amore and C. Mah. One-time complete indexing of text: theory and practice. In Proceedings 8th International ACM Conference on Research and Development in Information Retrieval. ACM Press, 1985.
The dp packagefor Tcl/Tk.Availablefor ftp from ftp://aud.alcatel.com/tcl/extensions/tcl-dp3.3bl.tar.gz.
Douglas C. Engelbart and W. K. English. A research center for augmenting human intellect. In Proceedings of the Fall Joint Computer Conference. AFIPS Press, Montvale, NY, 1968.
Mark E. Frisse and Steven B. Cousins. Information retrieval from hypertext: Update on the dynamic medical handbook project. In Hypertext '89 Proceedings. ACM Press, November 1989. Pittsburgh, PA, Nov 5–8.
Donna Harmon, editor. TREC-2-Text REtrieval Conference-2. National Institute of Standards and Technology, August 1993.
Donald E. Knuth. Sorting and Searching, pages 561–562. Addison Wesley, 1973.
Theodor H. Nelson. Managing immense storage. BYTE, 13(1):225–238, January 1988.
Jakob Nielsen. Hypertext and Hypermedia. Academic Press, San Diego, CA, 1990.
Claudia E. Pearce. A Dynamic Hypertext Environment Through n-gram Analysis. PhD thesis, University of Maryland Baltimore County, 1994.
Claudia E. Pearce. Dynamic hypertext links for highly degraded data in telltale. In Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 89–106. Information Science Research Institute, University of Nevada Las Vegas, University of Nevada, 4505 Maryland Parkway, Box 454021, Las Vegas, Nevada 89154-4021, 1995.
Gerard Salton and Michael McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1983.
C. Y. Suen. n-gram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):164–172, 1979.
Brent B. Welch. Practical Programming in Tcl and Tk. Prentice-Hall, Inc., 1995.
P. Willette. Document retrieval experiments using indexing vocabularies of varying size. II. hashing, truncation, diagram and trigram encoding of index terms. Journal of Documentation, 35:296–305, December 1979.
Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes. Van Nostrand Reinhold, 1994.
E. J. Yannakoudakis, P. Goyal, and J. A. Huggil. The generation and use of text fragments for data compression. Information Processing and Management, 18(1):15–21, 1982.
E. M. Zamora, J. J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information Processing and Management, 17(6):305–316, 1981.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Pearce, C., Miller, E. (1997). The TELLTALE dynamic hypertext environment: Approaches to scalability. In: Nicholas, C., Mayfield, J. (eds) Intelligent Hypertext. WIH WIH 1994 1993. Lecture Notes in Computer Science, vol 1326. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0023962
Download citation
DOI: https://doi.org/10.1007/BFb0023962
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63637-3
Online ISBN: 978-3-540-69622-3
eBook Packages: Springer Book Archive