Abstract
We adapt the Suffix Tree Clustering method for application within a corpus of Norwegian news articles. Specifically, suffixes are replaced with n-grams and we propose a new measure for cluster similarity as well as a scoring-function for base-clusters. These modifications lead to substantial improvements in effectiveness and efficiency compared to the original algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Allern, S.: Newsvalue: On marketing and journalism in ten norwegian newspapers. IJ Forlaget (Publisher) (2001) (in Norwegian)
Zu Eissen, S.M., Stein, B., Potthast, M.: The Suffix Tree Document Model Revisited. In: Tochtermann, M. (ed.) Proceedings of the I-KNOW 2005, Graz 5th International Conference on Knowledge Management, pp. 596–603 (2005); Journal of Universal Computer Science
Elgesem, D., Moe, H., Sjøvaag, H., Stavelin, E.: The national public service broadcaster’s (NRK) news on the internet in 2009. Report to the Norwegian Media Authority, Department of information science and media studies, University of Bergen (2010) (in Norwegian)
Erdal, J.: Where does the news come from? On the flow of news between newspapers, broadcasters and the internet (in Norwegian). Official Norwegian Reports NOU2010:14, appendix 1 (2010)
Gulla, J.A., Borch, H.O., Ingvaldsen, J.E.: Contextualized Clustering in Exploratory Web Search. In: do Prado, H.A., Ferneda, E. (eds.) Emerging Technologies of Text Mining: Techniques and Applications, pp. 184–207. IGI Global (2007)
Losnegaard, G.: Automatic extraction of news text from online newspapers. Project report, Department of information science and media studies, University of Bergen (2012)
Moe, R.: Improvements to Suffix Tree Clustering. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 662–667. Springer, Heidelberg (2014)
Moe, R., Elgesem, D.: Compact trie clustering for overlap detection in news. In: Proceedings of the Norwegian Informatics Conference (NIK 2013) (2013)
Norwegian Newspaper Corpus, http://avis.uib.no/om-aviskorpuset/english
Oslo-Bergen Tagger, http://tekstlab.uio.no/obt-ny/english/index.html
Smyth, B.: Computing Patterns in Strings. Addison Wesley (2003)
Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM, New York (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Moe, R.E. (2014). Clustering in a News Corpus. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-10816-2_37
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10815-5
Online ISBN: 978-3-319-10816-2
eBook Packages: Computer ScienceComputer Science (R0)