Abstract
Within the vocabulary used in a set of news stories a minority of terms will be topic-specific in that they occur largely or solely within those stories belonging to a common event. When applying unsupervised learning techniques such as clustering it is useful to determine which words are event-specific and which topic they relate to. Continuous language models are used to model the generation of news stories over time and from these models two measures are derived: bendiness which indicates whether a word is event specific and shape distance which indicates whether two terms are likely to relate to the same topic. These are used to construct a new clustering technique which identifies and characterises the underlying events within the news stream.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic Detection and Tracking Pilot Study Final Report. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 194–218 (1998)
Bache, R., Crestani, F.: Estimating Real-valued Characteristics of Criminals from their Recorded Crimes. In: ACM 17th Conference on Information and Knowledge Management (CIKM 2008), Napa Valley, California (2008)
Bai, J., Nie, J., Paradis, F.: Text Classification Using Language Models. In: Asia Information Retrieval Symposium, Poster Session, Beijing (2004)
Dharanipragada, S., Franz, M., Ward, T., Zhu, W.: Segmentation and Detection at IBM - Hybrid Statisticsl Models and Two-tiered Clustering. In: Allan, J. (ed.) Topic Detection and Tracking. Kluwer Academic Publishers, Norwell (2002)
Clifton, C., Cooley, R., Rennie, J.: TopCat: Data Mining for Topic Identification in a Text Corpus. IEEE Transactions on Knowledge and Data Engineering 16(8) (2004)
Heyer, L., Kruglyak, S., Yooseph, S.: Exploring Expression Data: Identification and Analysis of Coexpressed genes. Genome Research 9, 1106–1115 (1999)
Kleinberg., J.: Bursty and Hierarchical Structure in Streams. In: Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada (2002)
Losada, D.: Language Modeling for Sentence Retrieval: A comparison between Multiple-Bernoulli Models and Multinomial Models. In: Information Retrieval Workshop, Glasgow (2005)
McCallum, A., Nigam, K.: A Comparison of Event Models for Naïve Bayes Text Classification. In: Proc. AAAI/ICML 1998 Workshop on Learning for Text Categorisation, pp. 41–48. AAAI Press, Menlo Park (1998)
Peng, F., Schuurmans, D.: Combining Naïve Bayes and n-gram Language Models for Text Classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)
Peng, F., Schuurmans, D., Wang, S.: Augmenting Naïve Bayes Classifiers with Statistical Language Models. Information Retrieval 7(3), 317–345 (2003)
Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the Twenty First ACM-SIGIR, Melbourne, Australia, pp. 275–281. ACM Press, New York (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bache, R., Crestani, F. (2010). An Approach to Indexing and Clustering News Stories Using Continuous Language Models. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds) Natural Language Processing and Information Systems. NLDB 2010. Lecture Notes in Computer Science, vol 6177. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13881-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-13881-2_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13880-5
Online ISBN: 978-3-642-13881-2
eBook Packages: Computer ScienceComputer Science (R0)