Skip to main content

An Approach to Indexing and Clustering News Stories Using Continuous Language Models

  • Conference paper
Natural Language Processing and Information Systems (NLDB 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6177))

Abstract

Within the vocabulary used in a set of news stories a minority of terms will be topic-specific in that they occur largely or solely within those stories belonging to a common event. When applying unsupervised learning techniques such as clustering it is useful to determine which words are event-specific and which topic they relate to. Continuous language models are used to model the generation of news stories over time and from these models two measures are derived: bendiness which indicates whether a word is event specific and shape distance which indicates whether two terms are likely to relate to the same topic. These are used to construct a new clustering technique which identifies and characterises the underlying events within the news stream.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic Detection and Tracking Pilot Study Final Report. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 194–218 (1998)

    Google Scholar 

  2. Bache, R., Crestani, F.: Estimating Real-valued Characteristics of Criminals from their Recorded Crimes. In: ACM 17th Conference on Information and Knowledge Management (CIKM 2008), Napa Valley, California (2008)

    Google Scholar 

  3. Bai, J., Nie, J., Paradis, F.: Text Classification Using Language Models. In: Asia Information Retrieval Symposium, Poster Session, Beijing (2004)

    Google Scholar 

  4. Dharanipragada, S., Franz, M., Ward, T., Zhu, W.: Segmentation and Detection at IBM - Hybrid Statisticsl Models and Two-tiered Clustering. In: Allan, J. (ed.) Topic Detection and Tracking. Kluwer Academic Publishers, Norwell (2002)

    Google Scholar 

  5. Clifton, C., Cooley, R., Rennie, J.: TopCat: Data Mining for Topic Identification in a Text Corpus. IEEE Transactions on Knowledge and Data Engineering 16(8) (2004)

    Google Scholar 

  6. Heyer, L., Kruglyak, S., Yooseph, S.: Exploring Expression Data: Identification and Analysis of Coexpressed genes. Genome Research 9, 1106–1115 (1999)

    Article  Google Scholar 

  7. Kleinberg., J.: Bursty and Hierarchical Structure in Streams. In: Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada (2002)

    Google Scholar 

  8. Losada, D.: Language Modeling for Sentence Retrieval: A comparison between Multiple-Bernoulli Models and Multinomial Models. In: Information Retrieval Workshop, Glasgow (2005)

    Google Scholar 

  9. McCallum, A., Nigam, K.: A Comparison of Event Models for Naïve Bayes Text Classification. In: Proc. AAAI/ICML 1998 Workshop on Learning for Text Categorisation, pp. 41–48. AAAI Press, Menlo Park (1998)

    Google Scholar 

  10. Peng, F., Schuurmans, D.: Combining Naïve Bayes and n-gram Language Models for Text Classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  11. Peng, F., Schuurmans, D., Wang, S.: Augmenting Naïve Bayes Classifiers with Statistical Language Models. Information Retrieval 7(3), 317–345 (2003)

    Article  Google Scholar 

  12. Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the Twenty First ACM-SIGIR, Melbourne, Australia, pp. 275–281. ACM Press, New York (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bache, R., Crestani, F. (2010). An Approach to Indexing and Clustering News Stories Using Continuous Language Models. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds) Natural Language Processing and Information Systems. NLDB 2010. Lecture Notes in Computer Science, vol 6177. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13881-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13881-2_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13880-5

  • Online ISBN: 978-3-642-13881-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics