Clustering in a News Corpus

Moe, Richard Elling

doi:10.1007/978-3-319-10816-2_37

Richard Elling Moe²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8655))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1593 Accesses

Abstract

We adapt the Suffix Tree Clustering method for application within a corpus of Norwegian news articles. Specifically, suffixes are replaced with n-grams and we propose a new measure for cluster similarity as well as a scoring-function for base-clusters. These modifications lead to substantial improvements in effectiveness and efficiency compared to the original algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

Suffix sorting via matching statistics

Article Open access 12 March 2024

Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

References

Allern, S.: Newsvalue: On marketing and journalism in ten norwegian newspapers. IJ Forlaget (Publisher) (2001) (in Norwegian)
Google Scholar
Zu Eissen, S.M., Stein, B., Potthast, M.: The Suffix Tree Document Model Revisited. In: Tochtermann, M. (ed.) Proceedings of the I-KNOW 2005, Graz 5th International Conference on Knowledge Management, pp. 596–603 (2005); Journal of Universal Computer Science
Google Scholar
Elgesem, D., Moe, H., Sjøvaag, H., Stavelin, E.: The national public service broadcaster’s (NRK) news on the internet in 2009. Report to the Norwegian Media Authority, Department of information science and media studies, University of Bergen (2010) (in Norwegian)
Google Scholar
Erdal, J.: Where does the news come from? On the flow of news between newspapers, broadcasters and the internet (in Norwegian). Official Norwegian Reports NOU2010:14, appendix 1 (2010)
Google Scholar
Gulla, J.A., Borch, H.O., Ingvaldsen, J.E.: Contextualized Clustering in Exploratory Web Search. In: do Prado, H.A., Ferneda, E. (eds.) Emerging Technologies of Text Mining: Techniques and Applications, pp. 184–207. IGI Global (2007)
Google Scholar
Losnegaard, G.: Automatic extraction of news text from online newspapers. Project report, Department of information science and media studies, University of Bergen (2012)
Google Scholar
Moe, R.: Improvements to Suffix Tree Clustering. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 662–667. Springer, Heidelberg (2014)
Chapter Google Scholar
Moe, R., Elgesem, D.: Compact trie clustering for overlap detection in news. In: Proceedings of the Norwegian Informatics Conference (NIK 2013) (2013)
Google Scholar
Norwegian Newspaper Corpus, http://avis.uib.no/om-aviskorpuset/english
Oslo-Bergen Tagger, http://tekstlab.uio.no/obt-ny/english/index.html
Smyth, B.: Computing Patterns in Strings. Addison Wesley (2003)
Google Scholar
Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM, New York (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Science and Media Studies, University of Bergen, Norway
Richard Elling Moe

Authors

Richard Elling Moe
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Botanicá 6a, 60200, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Department of Information Technologies, Masaryk University, 602 00, Brno, Czech Republic
Aleš Horák , Ivan Kopeček & Karel Pala , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moe, R.E. (2014). Clustering in a News Corpus. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-10816-2_37
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10815-5
Online ISBN: 978-3-319-10816-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics