Skip to main content

Clustering in a News Corpus

  • Conference paper
Text, Speech and Dialogue (TSD 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8655))

Included in the following conference series:

  • 1593 Accesses

Abstract

We adapt the Suffix Tree Clustering method for application within a corpus of Norwegian news articles. Specifically, suffixes are replaced with n-grams and we propose a new measure for cluster similarity as well as a scoring-function for base-clusters. These modifications lead to substantial improvements in effectiveness and efficiency compared to the original algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Allern, S.: Newsvalue: On marketing and journalism in ten norwegian newspapers. IJ Forlaget (Publisher) (2001) (in Norwegian)

    Google Scholar 

  2. Zu Eissen, S.M., Stein, B., Potthast, M.: The Suffix Tree Document Model Revisited. In: Tochtermann, M. (ed.) Proceedings of the I-KNOW 2005, Graz 5th International Conference on Knowledge Management, pp. 596–603 (2005); Journal of Universal Computer Science

    Google Scholar 

  3. Elgesem, D., Moe, H., Sjøvaag, H., Stavelin, E.: The national public service broadcaster’s (NRK) news on the internet in 2009. Report to the Norwegian Media Authority, Department of information science and media studies, University of Bergen (2010) (in Norwegian)

    Google Scholar 

  4. Erdal, J.: Where does the news come from? On the flow of news between newspapers, broadcasters and the internet (in Norwegian). Official Norwegian Reports NOU2010:14, appendix 1 (2010)

    Google Scholar 

  5. Gulla, J.A., Borch, H.O., Ingvaldsen, J.E.: Contextualized Clustering in Exploratory Web Search. In: do Prado, H.A., Ferneda, E. (eds.) Emerging Technologies of Text Mining: Techniques and Applications, pp. 184–207. IGI Global (2007)

    Google Scholar 

  6. Losnegaard, G.: Automatic extraction of news text from online newspapers. Project report, Department of information science and media studies, University of Bergen (2012)

    Google Scholar 

  7. Moe, R.: Improvements to Suffix Tree Clustering. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 662–667. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  8. Moe, R., Elgesem, D.: Compact trie clustering for overlap detection in news. In: Proceedings of the Norwegian Informatics Conference (NIK 2013) (2013)

    Google Scholar 

  9. Norwegian Newspaper Corpus, http://avis.uib.no/om-aviskorpuset/english

  10. Oslo-Bergen Tagger, http://tekstlab.uio.no/obt-ny/english/index.html

  11. Smyth, B.: Computing Patterns in Strings. Addison Wesley (2003)

    Google Scholar 

  12. Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM, New York (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Moe, R.E. (2014). Clustering in a News Corpus. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10816-2_37

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10815-5

  • Online ISBN: 978-3-319-10816-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics