Abstract
We study several techniques for representing, fusing and comparing content representations of news documents. As underlying models we consider the vector space model (both in a term setting and in a latent semantic analysis setting) and probabilistic topic models based on latent Dirichlet allocation. Content terms can be classified as topical terms or named entities, yielding several models for content fusion and comparison. All used methods are completely unsupervised. We find that simple methods can still outperform the current state-of-the-art techniques.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Allan J, Lavrenko V, Swan R (2002) Explorations within topic tracking and detection. Kluwer, Norwell, ir 20, pp 197–224
Allan J, Wade C, Bolivar A (2003) Retrieval and novelty detection at the sentence level. In: SIGIR ’03, ACM, New York, pp 314–321
Bagga A, Baldwin B (1998) Algorithms for scoring coreference chains. In: The first international conference on language resources and evaluation workshop on linguistics coreference, Granada, pp 563–566
Barzilay R, Lee L (2003) Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: HLT-NAACL ’03: main proceedings, Edmonton, pp 16–23
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3: 993–1022
Buntine W, Jakulin A (2006) Discrete component analysis. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, latent structure and feature selection techniques. Springer, Heidelberg, pp 237–247
Cutting DR, Pedersen JO, Karger D, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: SIGIR ’92, Seattle, pp 318–329
de Marneffe MC, Rafferty AN, Manning CD (2008) Finding contradictions in text. In: ACL’08: HLT, Association for Computational Linguistics, Columbus, pp 1039–1047
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41: 391–407
Deschacht K, De Belder J, Moens MF (2012) The latent words language model. Comput Speech Lang 26(5): 384–409
Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: SIGIR ’01, ACM, New York, pp 19–25
Griffiths T, Steyvers M, Tenenbaum J (2007) Topics in semantic representation. Psychol Rev 114(2): 211–244
Hatzivassiloglou V (1998) Automatic acquisition of lexical semantic knowledge from large corpora: the identification of semantically related words, markedness, polarity, and antonymy. PhD thesis, New York
Hershkop S, Stolfo SJ (2005) Combining email models for false positive reduction. In: KDD ’05, ACM, New York, pp 98–107
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of uncertainty in artificial intelligence, Stockholm
Kumaran G, Allan J (2004) Text classification and named entities for new event detection. In: SIGIR ’04, ACM, New York, pp 297–304
Lee MD, Welsh M (2005) An empirical evaluation of models of text document similarity. In: CogSci2005, Erlbaum, pp 1254–1259
Li W, Mccallum A (2006) Pachinko allocation: DAG-structured mixture models of topic correlations. In: ICML ’06, ACM, New York, pp 577–584
Li Z, Wang B, Li M, Ma WY (2005) A probabilistic model for retrospective news event detection. In: SIGIR ’05, ACM, New York, pp 106–113
Makkonen U, Ahonen-Myka H, Marko (2002) Applying semantic classes in event detection and tracking. In: Proceedings of the International Conference on Natural Language Processing (ICON’02), Bombay, pp 175–183
Mccallum A, Corrada-Emmanuel A, Wang X (2005) Topic and role discovery in social networks. In: Proceedings of the 19th international joint conference on artificial intelligence, Edinburgh, pp 786–791
Mckeown K, Radev DR (1995) Generating summaries of multiple news articles. In: SIGIR ’95, Seattle, pp 74–82
Nallapati R, Feng A, Peng F, Allan J (2004) Event threading within news topics. In: CIKM ’04, Washington, pp 446–453
Pearl J (1991) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San Mateo
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20: 53–65
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Boston
Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton
Snoek CGM (2005) Early versus late fusion in semantic video analysis. In: ACM multimedia, New York, pp 399–402
Steinberger J, Ježek K (2009) Update summarization based on novel topic distribution. In: DocEng’09, ACM, New York, pp 205–213
Steinberger J, Poesio M, Kabadjov MA, Jeek K (2007) Two uses of anaphora resolution in summarization. Inf Process Manag 43(6): 1663–1680
Steinberger J, Turchi M, Kabadjov M, Steinberger R, Cristianini N (2010) Wrapping up a summary: from representation to generation. In: Proceedings of the ACL 2010 conference short papers, Association for Computational Linguistics, Uppsala, pp 382–386. http://www.aclweb.org/anthology/P10-2070
Stone B, Dennis S, Kwantes PJ (2011) Comparing methods for single paragraph similarity analysis. Top Cogn Sci 3(1): 92–122. doi:10.1111/j.1756-8765.2010.01108.x
Tsatsaronis G, Varlamis I, Vazirgiannis M (2010) Text relatedness based on a word thesaurus. J Artif Intell Res 37: 1–39
Voorhees EM (1986) Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Technical Report, Ithaca
Wang ZW, Wong SKM, Yao YY (1992) An analysis of vector space models based on computational geometry. In: SIGIR ’92, ACM, New York, pp 152–160
Wang K, Li X, Gao J (2010) Multi-style language model for web scale information retrieval. In: SIGIR ’10, ACM, New York, pp 467–474
Yang Y, Carbonell JG, Brown RD, Pierce T, Archibald BT, Liu X (1999) Learning approaches for detecting and tracking news events. IEEE Intell Syst 14(4): 32–43
Zhang K, Zi J, Wu LG (2007) New event detection based on indexing-tree and named entity. In: SIGIR ’07, ACM, New York, pp 215–222
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: R. Bayardo.
About this article
Cite this article
De Smet, W., Moens, MF. Representations for multi-document event clustering. Data Min Knowl Disc 26, 533–558 (2013). https://doi.org/10.1007/s10618-012-0270-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-012-0270-1