An Approach to Indexing and Clustering News Stories Using Continuous Language Models

Bache, Richard; Crestani, Fabio

doi:10.1007/978-3-642-13881-2_11

Richard Bache²⁰ &
Fabio Crestani²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6177))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

1311 Accesses
5 Citations

Abstract

Within the vocabulary used in a set of news stories a minority of terms will be topic-specific in that they occur largely or solely within those stories belonging to a common event. When applying unsupervised learning techniques such as clustering it is useful to determine which words are event-specific and which topic they relate to. Continuous language models are used to model the generation of news stories over time and from these models two measures are derived: bendiness which indicates whether a word is event specific and shape distance which indicates whether two terms are likely to relate to the same topic. These are used to construct a new clustering technique which identifies and characterises the underlying events within the news stream.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic Detection and Tracking Pilot Study Final Report. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 194–218 (1998)
Google Scholar
Bache, R., Crestani, F.: Estimating Real-valued Characteristics of Criminals from their Recorded Crimes. In: ACM 17th Conference on Information and Knowledge Management (CIKM 2008), Napa Valley, California (2008)
Google Scholar
Bai, J., Nie, J., Paradis, F.: Text Classification Using Language Models. In: Asia Information Retrieval Symposium, Poster Session, Beijing (2004)
Google Scholar
Dharanipragada, S., Franz, M., Ward, T., Zhu, W.: Segmentation and Detection at IBM - Hybrid Statisticsl Models and Two-tiered Clustering. In: Allan, J. (ed.) Topic Detection and Tracking. Kluwer Academic Publishers, Norwell (2002)
Google Scholar
Clifton, C., Cooley, R., Rennie, J.: TopCat: Data Mining for Topic Identification in a Text Corpus. IEEE Transactions on Knowledge and Data Engineering 16(8) (2004)
Google Scholar
Heyer, L., Kruglyak, S., Yooseph, S.: Exploring Expression Data: Identification and Analysis of Coexpressed genes. Genome Research 9, 1106–1115 (1999)
Article Google Scholar
Kleinberg., J.: Bursty and Hierarchical Structure in Streams. In: Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada (2002)
Google Scholar
Losada, D.: Language Modeling for Sentence Retrieval: A comparison between Multiple-Bernoulli Models and Multinomial Models. In: Information Retrieval Workshop, Glasgow (2005)
Google Scholar
McCallum, A., Nigam, K.: A Comparison of Event Models for Naïve Bayes Text Classification. In: Proc. AAAI/ICML 1998 Workshop on Learning for Text Categorisation, pp. 41–48. AAAI Press, Menlo Park (1998)
Google Scholar
Peng, F., Schuurmans, D.: Combining Naïve Bayes and n-gram Language Models for Text Classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)
Chapter Google Scholar
Peng, F., Schuurmans, D., Wang, S.: Augmenting Naïve Bayes Classifiers with Statistical Language Models. Information Retrieval 7(3), 317–345 (2003)
Article Google Scholar
Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the Twenty First ACM-SIGIR, Melbourne, Australia, pp. 275–281. ACM Press, New York (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Glasgow, Glasgow, Scotland
Richard Bache
University of Lugano, Lugano, Switzerland
Fabio Crestani

Authors

Richard Bache
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Crestani
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, Cardiff University, UK
Christina J. Hopfe & Haijiang Li &
Informatics Research Institute, University of Salford, M5 4WT, Greater Manchester, UK
Yacine Rezgui
Centre National des Arts et Métiers,
Elisabeth Métais
School of Computer Science, Cardiff University, UK
Alun Preece

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bache, R., Crestani, F. (2010). An Approach to Indexing and Clustering News Stories Using Continuous Language Models. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds) Natural Language Processing and Information Systems. NLDB 2010. Lecture Notes in Computer Science, vol 6177. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13881-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-13881-2_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13880-5
Online ISBN: 978-3-642-13881-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics