Abstract
The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bentley, J. and McIlroy, D. (1999) “Data compression using long common strings.” Proc Data Compression Conference, pp. 287–295. IEEE Press, Los Alamitos, CA.
Chinchor, N.A. (1999) “Overview of MUC-7/MET-2.” Proc Message Understanding Conference MUC-7.
Cleary, J.G. and Witten, I.H. (1984) “Data compression using adaptive coding and partial string matching.” IEEE Trans on Communications, Vol. 32, No. 4, pp. 396–402.
Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M. (1998) “Inductive learning algorithms and representations for text categorization.” In Proceedings of the 7th International Conference on Information and Knowledge Management.
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C. and Nevill-Manning, C. (1999) “Domain-specific keyphrase extraction.” Int Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 668–673.
Larsson, N.J. and Moffat, A. (1999) “Offline dictionary-based compression.” Proc Data Compression Conference, pp. 296–305. IEEE Press, Los Alamitos, CA.
Nevill-Manning, C.G. and Witten, I.H. (1997) “Identifying hierarchical structure in sequences: a linear-time algorithm.” J Artificial Intelligence Research, Vol. 7, pp. 67–82.
Nevill-Manning, C.G. and Witten, I.H. (1998) “Phrase hierarchy inference and compression in bounded space,” Proc. Data Compression Conference, J.A. Storer and M. Cohn (Eds.), Los Alamitos, CA: IEEE Press. 179–188.
Nevill-Manning, C.G., Witten, I.H. and Paynter, G.W. (1999) “Lexically-generated subject hierarchies for browsing large collections.” International Journal of Digital Libraries, Vol. 2, No. 2/3, pp. 111–123.
Nevill-Manning, C.G. and Witten, I.H. (2000) “Online and offline heuristics for inferring hierarchies of repetitions in sequences,” Proc. IEEE, Vol. 88, No. 11, pp. 1745–1755.
Teahan, W.J., Wen, Y., McNab, R. and Witten, I.H. (2000) “A compression-based algorithm for Chinese word segmentation.” Computational Linguistics, Vol. 26, No. 3, pp. 375–393.
Witten, I.H., Moffat, A. and Bell, T.C. (1999) Managing gigabytes: compressing and indexing documents and images. Second Edition, Morgan Kaufmann, San Francisco, California.
Witten, I.H. and Bainbridge, D. (2003) How to build a digital library. Morgan Kaufmann, San Francisco, California.
Wol., J.G. (1975) “An algorithm for the segmentation of an artificial language analogue.” British J Psychology, Vol. 66, pp. 79–90.
Yeates, S., Bainbridge, D. and Witten, I.H. (2000) “Using compression to identify acronyms in text.” Proc Data Compression Conference (Poster paper). IEEE Press, Los Alamitos, CA. Full version available as Working Paper 00/1, Department of Computer Science, University of Waikato, New Zealand.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Witten, I.H. (2002). Learning Structure from Sequences, with Applications in a Digital Library. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds) Algorithmic Learning Theory. ALT 2002. Lecture Notes in Computer Science(), vol 2533. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36169-3_6
Download citation
DOI: https://doi.org/10.1007/3-540-36169-3_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00170-6
Online ISBN: 978-3-540-36169-5
eBook Packages: Springer Book Archive