Learning Structure from Sequences, with Applications in a Digital Library

Witten, Ian H.

doi:10.1007/3-540-36169-3_6

Ian H. Witten⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2533))

Included in the following conference series:

International Conference on Algorithmic Learning Theory

521 Accesses
1 Citations

Abstract

The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Sequence Retrieval, Analysis, and Manipulation

A Guide to Dictionary-Based Text Mining

Sequentially Grouping Items into Clusters of Unspecified Number

References

Bentley, J. and McIlroy, D. (1999) “Data compression using long common strings.” Proc Data Compression Conference, pp. 287–295. IEEE Press, Los Alamitos, CA.
Google Scholar
Chinchor, N.A. (1999) “Overview of MUC-7/MET-2.” Proc Message Understanding Conference MUC-7.
Google Scholar
Cleary, J.G. and Witten, I.H. (1984) “Data compression using adaptive coding and partial string matching.” IEEE Trans on Communications, Vol. 32, No. 4, pp. 396–402.
Article Google Scholar
Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M. (1998) “Inductive learning algorithms and representations for text categorization.” In Proceedings of the 7th International Conference on Information and Knowledge Management.
Google Scholar
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C. and Nevill-Manning, C. (1999) “Domain-specific keyphrase extraction.” Int Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 668–673.
Google Scholar
Larsson, N.J. and Moffat, A. (1999) “Offline dictionary-based compression.” Proc Data Compression Conference, pp. 296–305. IEEE Press, Los Alamitos, CA.
Google Scholar
Nevill-Manning, C.G. and Witten, I.H. (1997) “Identifying hierarchical structure in sequences: a linear-time algorithm.” J Artificial Intelligence Research, Vol. 7, pp. 67–82.
MATH Google Scholar
Nevill-Manning, C.G. and Witten, I.H. (1998) “Phrase hierarchy inference and compression in bounded space,” Proc. Data Compression Conference, J.A. Storer and M. Cohn (Eds.), Los Alamitos, CA: IEEE Press. 179–188.
Google Scholar
Nevill-Manning, C.G., Witten, I.H. and Paynter, G.W. (1999) “Lexically-generated subject hierarchies for browsing large collections.” International Journal of Digital Libraries, Vol. 2, No. 2/3, pp. 111–123.
Article Google Scholar
Nevill-Manning, C.G. and Witten, I.H. (2000) “Online and offline heuristics for inferring hierarchies of repetitions in sequences,” Proc. IEEE, Vol. 88, No. 11, pp. 1745–1755.
Google Scholar
Teahan, W.J., Wen, Y., McNab, R. and Witten, I.H. (2000) “A compression-based algorithm for Chinese word segmentation.” Computational Linguistics, Vol. 26, No. 3, pp. 375–393.
Article Google Scholar
Witten, I.H., Moffat, A. and Bell, T.C. (1999) Managing gigabytes: compressing and indexing documents and images. Second Edition, Morgan Kaufmann, San Francisco, California.
Google Scholar
Witten, I.H. and Bainbridge, D. (2003) How to build a digital library. Morgan Kaufmann, San Francisco, California.
Google Scholar
Wol., J.G. (1975) “An algorithm for the segmentation of an artificial language analogue.” British J Psychology, Vol. 66, pp. 79–90.
Google Scholar
Yeates, S., Bainbridge, D. and Witten, I.H. (2000) “Using compression to identify acronyms in text.” Proc Data Compression Conference (Poster paper). IEEE Press, Los Alamitos, CA. Full version available as Working Paper 00/1, Department of Computer Science, University of Waikato, New Zealand.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Waikato, Hamilton, New Zealand
Ian H. Witten

Authors

Ian H. Witten
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Tecnologie dell’Informazione, Università degli Studi di Milano, via Bramante 65, 26013, Crema (CR), Italy
Nicolò Cesa-Bianchi
Department of Computer Science, Tokyo Institute of Technology, 2-12-1, Ohokayama Meguro Ward, 152-8552, Tokyo, Japan
Masayuki Numao
Institut für Theoretische Informatik, Universität zu Lübeck, Wallstr. 40, 23560, Lübeck, Germany
Rüdiger Reischuk

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Witten, I.H. (2002). Learning Structure from Sequences, with Applications in a Digital Library. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds) Algorithmic Learning Theory. ALT 2002. Lecture Notes in Computer Science(), vol 2533. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36169-3_6

Download citation

DOI: https://doi.org/10.1007/3-540-36169-3_6
Published: 08 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00170-6
Online ISBN: 978-3-540-36169-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Learning Structure from Sequences, with Applications in a Digital Library

Abstract

Access this chapter

Preview

Similar content being viewed by others

Sequence Retrieval, Analysis, and Manipulation

A Guide to Dictionary-Based Text Mining

Sequentially Grouping Items into Clusters of Unspecified Number

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Learning Structure from Sequences, with Applications in a Digital Library

Abstract

Access this chapter

Preview

Similar content being viewed by others

Sequence Retrieval, Analysis, and Manipulation

A Guide to Dictionary-Based Text Mining

Sequentially Grouping Items into Clusters of Unspecified Number

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation