skip to main content
10.1145/3269206.3269309acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking

Published: 17 October 2018 Publication History

Abstract

Topic detection and tracking in document streams is a critical task in many important applications, hence has been attracting research interest in recent decades. With the large size of data streams, there have been a number of works from different approaches that propose automatic methods for the task. However, there is only a few small benchmark datasets that are publicly available for evaluating the proposed methods. The lack of large datasets with fine-grained groundtruth implicitly restrains the development of more advanced methods. In this work, we address this issue by collecting and publishing W2E - a large dataset consisting of news articles from more than 50 prominent mass media channels worldwide. The articles cover a large set of popular events within a full year. W2E is more than 15 times larger than TREC's TDT2 dataset, which is widely used in prior work. We further conduct exploratory analysis to examine the dynamics and diversity of W2E and propose potential uses of the dataset in other research.

References

[1]
Amr Ahmed and Eric P Xing. 2010. Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream. In UAI .
[2]
Adham Beykikhoshk, Ognjen Arandjelović, Svetha Venkatesh, and Dinh Phung. 2015. Hierarchical Dirichlet process for tracking complex topical structure evolution and its application to autism research literature. In PAKDD .
[3]
Steffen Bickel and Tobias Scheffer. 2004. Multi-View Clustering. In Proceedings of the Fourth IEEE International Conference on Data Mining .
[4]
Yan Chen, Hadi Amiri, Zhoujun Li, and Tat-Seng Chua. 2013. Emerging topic detection for organizations from microblogs. In SIGIR .
[5]
Christopher Cieri, Stephanie Strassel, David Graff, Nii Martey, Kara Rennert, and Mark Liberman. 2002. Corpora for topic detection and tracking. In Topic detection and tracking . Springer, 33--66.
[6]
Avinava Dubey, Ahmed Hefny, Sinead Williamson, and Eric P Xing. 2013. A nonparametric mixture model for topic modeling over time. In SDM .
[7]
Tao Ge, Lei Cui, Baobao Chang, Sujian Li, Ming Zhou, and Zhifang Sui. 2016. News stream summarization using burst information networks. In EMNLP .
[8]
Qi He, Kuiyu Chang, Ee-Peng Lim, and Arindam Banerjee. 2010. Keep it simple with time: A reexamination of probabilistic topic detection models. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, 10 (2010), 1795--1808.
[9]
ICDM. 2015. Modeling emerging, evolving and fading topics using dynamic soft orthogonal nmf with sparse representation.
[10]
Yookyung Jo, John E Hopcroft, and Carl Lagoze. 2011. The web of topics: discovering the topology of topic evolution in a corpus. In WWW .
[11]
Xiangfeng Luo, Junyu Xuan, and Guangquan Zhang. 2016. Measuring the semantic uncertainty of news events for evolution potential estimation. TOIS (2016).
[12]
Leysia Palen and Kenneth M Anderson. 2016. Crisis informatics - New data for extraordinary times. Science, Vol. 353, 6296 (2016), 224--225.
[13]
Ankan Saha and Vikas Sindhwani. 2012. Learning Evolving and Emerging Topics in Social Media: A Dynamic Nmf Approach with Temporal Regularization. In WSDM .
[14]
Ben Sayre, Leticia Bode, Dhavan Shah, Dave Wilcox, and Chirag Shah. 2010. Agenda setting in a digital age: Tracking attention to California Proposition 8 in social media, online news and conventional news. Policy & Internet (2010).
[15]
Carmen K Vaca, Amin Mantrach, Alejandro Jaimes, and Marco Saerens. 2014. A time-based collective factorization for topic discovery and monitoring in news. In WWW .
[16]
Xiaolong Wang, Chengxiang Zhai, and Dan Roth. 2013. Understanding Evolution of Research Themes: A Probabilistic Generative Model for Citations (KDD).

Cited By

View all
  • (2023)PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets StreamProceedings of the ACM Web Conference 202310.1145/3543507.3583371(1650-1661)Online publication date: 30-Apr-2023
  • (2023)Event-Centric Opinion Mining via In-Context Learning with ChatGPTKnowledge Graph and Semantic Computing: Knowledge Graph Empowers Artificial General Intelligence10.1007/978-981-99-7224-1_7(83-94)Online publication date: 28-Oct-2023
  • (2023)Present Causal Relationship Retrieval for Historical AnalogyCulture and Computing10.1007/978-3-031-34732-0_41(536-547)Online publication date: 9-Jul-2023
  • Show More Cited By

Index Terms

  1. W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management
      October 2018
      2362 pages
      ISBN:9781450360142
      DOI:10.1145/3269206
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 October 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. benchmark dataset
      2. topic detection
      3. topic tracking

      Qualifiers

      • Short-paper

      Conference

      CIKM '18
      Sponsor:

      Acceptance Rates

      CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;
      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets StreamProceedings of the ACM Web Conference 202310.1145/3543507.3583371(1650-1661)Online publication date: 30-Apr-2023
      • (2023)Event-Centric Opinion Mining via In-Context Learning with ChatGPTKnowledge Graph and Semantic Computing: Knowledge Graph Empowers Artificial General Intelligence10.1007/978-981-99-7224-1_7(83-94)Online publication date: 28-Oct-2023
      • (2023)Present Causal Relationship Retrieval for Historical AnalogyCulture and Computing10.1007/978-3-031-34732-0_41(536-547)Online publication date: 9-Jul-2023
      • (2023)A Multi-stage Event Detection MethodAdvances in Natural Computation, Fuzzy Systems and Knowledge Discovery10.1007/978-3-031-20738-9_106(968-973)Online publication date: 30-Jan-2023
      • (2022)Artificial intelligence for topic modelling in Hindu philosophy: Mapping themes between the Upanishads and the Bhagavad GitaPLOS ONE10.1371/journal.pone.027347617:9(e0273476)Online publication date: 1-Sep-2022
      • (2022)Online Discussion Transition Analysis for Group Learning Support2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT55865.2022.00043(249-255)Online publication date: Nov-2022
      • (2022)ETHOS: a multi-label hate speech detection datasetComplex & Intelligent Systems10.1007/s40747-021-00608-28:6(4663-4678)Online publication date: 4-Jan-2022
      • (2022)RevDet: Robust and Memory Efficient Event Detection and Tracking in Large News FeedsAdvanced Analytics and Learning on Temporal Data10.1007/978-3-030-91445-5_11(170-185)Online publication date: 1-Jan-2022
      • (2021)Türkçe Metinlerde Otomatik Konu TespitiFırat Üniversitesi Mühendislik Bilimleri Dergisi10.35234/fumbd.89991733:2(599-606)Online publication date: 15-Sep-2021
      • (2021)Event Causal Relationship RetrievalIEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology10.1145/3486622.3493936(318-325)Online publication date: 14-Dec-2021
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media