Skip to main content

Learning Structure from Sequences, with Applications in a Digital Library

  • Conference paper
  • First Online:
Algorithmic Learning Theory (ALT 2002)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2533))

Included in the following conference series:

Abstract

The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bentley, J. and McIlroy, D. (1999) “Data compression using long common strings.” Proc Data Compression Conference, pp. 287–295. IEEE Press, Los Alamitos, CA.

    Google Scholar 

  2. Chinchor, N.A. (1999) “Overview of MUC-7/MET-2.” Proc Message Understanding Conference MUC-7.

    Google Scholar 

  3. Cleary, J.G. and Witten, I.H. (1984) “Data compression using adaptive coding and partial string matching.” IEEE Trans on Communications, Vol. 32, No. 4, pp. 396–402.

    Article  Google Scholar 

  4. Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M. (1998) “Inductive learning algorithms and representations for text categorization.” In Proceedings of the 7th International Conference on Information and Knowledge Management.

    Google Scholar 

  5. Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C. and Nevill-Manning, C. (1999) “Domain-specific keyphrase extraction.” Int Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 668–673.

    Google Scholar 

  6. Larsson, N.J. and Moffat, A. (1999) “Offline dictionary-based compression.” Proc Data Compression Conference, pp. 296–305. IEEE Press, Los Alamitos, CA.

    Google Scholar 

  7. Nevill-Manning, C.G. and Witten, I.H. (1997) “Identifying hierarchical structure in sequences: a linear-time algorithm.” J Artificial Intelligence Research, Vol. 7, pp. 67–82.

    MATH  Google Scholar 

  8. Nevill-Manning, C.G. and Witten, I.H. (1998) “Phrase hierarchy inference and compression in bounded space,” Proc. Data Compression Conference, J.A. Storer and M. Cohn (Eds.), Los Alamitos, CA: IEEE Press. 179–188.

    Google Scholar 

  9. Nevill-Manning, C.G., Witten, I.H. and Paynter, G.W. (1999) “Lexically-generated subject hierarchies for browsing large collections.” International Journal of Digital Libraries, Vol. 2, No. 2/3, pp. 111–123.

    Article  Google Scholar 

  10. Nevill-Manning, C.G. and Witten, I.H. (2000) “Online and offline heuristics for inferring hierarchies of repetitions in sequences,” Proc. IEEE, Vol. 88, No. 11, pp. 1745–1755.

    Google Scholar 

  11. Teahan, W.J., Wen, Y., McNab, R. and Witten, I.H. (2000) “A compression-based algorithm for Chinese word segmentation.” Computational Linguistics, Vol. 26, No. 3, pp. 375–393.

    Article  Google Scholar 

  12. Witten, I.H., Moffat, A. and Bell, T.C. (1999) Managing gigabytes: compressing and indexing documents and images. Second Edition, Morgan Kaufmann, San Francisco, California.

    Google Scholar 

  13. Witten, I.H. and Bainbridge, D. (2003) How to build a digital library. Morgan Kaufmann, San Francisco, California.

    Google Scholar 

  14. Wol., J.G. (1975) “An algorithm for the segmentation of an artificial language analogue.” British J Psychology, Vol. 66, pp. 79–90.

    Google Scholar 

  15. Yeates, S., Bainbridge, D. and Witten, I.H. (2000) “Using compression to identify acronyms in text.” Proc Data Compression Conference (Poster paper). IEEE Press, Los Alamitos, CA. Full version available as Working Paper 00/1, Department of Computer Science, University of Waikato, New Zealand.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Witten, I.H. (2002). Learning Structure from Sequences, with Applications in a Digital Library. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds) Algorithmic Learning Theory. ALT 2002. Lecture Notes in Computer Science(), vol 2533. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36169-3_6

Download citation

  • DOI: https://doi.org/10.1007/3-540-36169-3_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00170-6

  • Online ISBN: 978-3-540-36169-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics