skip to main content
10.1145/1281192.1281276acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Mining correlated bursty topic patterns from coordinated text streams

Published: 12 August 2007 Publication History

Abstract

Previous work on text mining has almost exclusively focused on a single stream. However, we often have available multiple text streams indexed by the same set of time points (called coordinated text streams), which offer new opportunities for text mining. For example, when a major event happens, all the news articles published by different agencies in different languages tend to cover the same event for a certain period, exhibiting a correlated bursty topic pattern in all the news article streams. In general, mining correlated bursty topic patterns from coordinated text streams can reveal interesting latent associations or events behind these streams. In this paper, we define and study this novel text mining problem. We propose a general probabilistic algorithm which can effectively discover correlated bursty patterns and their bursty periods across text streams even if the streams have completely different vocabularies (e.g., English vs Chinese). Evaluation of the proposed method on a news data set and a literature data set shows that it can effectively discover quite meaningful topic patterns from both data sets: the patterns discovered from the news data set accurately reveal the major common events covered in the two streams of news articles (in English and Chinese, respectively), while the patterns discovered from two database publication streams match well with the major research paradigm shifts in database research. Since the proposed method is general and does not require the streams to share vocabulary, it can be applied to any coordinated text streams to discover correlated topic patterns that burst in multiple streams in the same period.

References

[1]
C. Aggarwal. Data Streams: Models and Algorithms. Springer, 2007.
[2]
C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. On demand classification of data streams. In KDD, pages 503--508, 2004.
[3]
R. Agrawal, K.-I. Lin, H. S. Sawhney, and K. Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In VLDB, pages 490--501, 1995.
[4]
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998.
[5]
J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In SIGIR, pages 37--45, 1998.
[6]
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research 3, 2003.
[7]
S. Chien and N. Immorlica. Semantic similarity between search engine queries using temporal correlation. In WWW, pages 2--11, 2005.
[8]
G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu. Parameter free bursty events detection in text streams. In VLDB, pages 181--192, 2005.
[9]
J. Han and M. Kamber. Data Mining: Concepts and Techniques, 2nd Ed. Morgan Kaufmann, 2006.
[10]
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999.
[11]
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In KDD, pages 97--106, 2001.
[12]
J. Kleinberg. Bursty and hierarchical structure in streams. In KDD, pages 91--101, 2002.
[13]
R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In WWW, pages 568--576, 2003.
[14]
Q. Mei, C. Liu, H. Su, and C. Zhai. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In WWW, pages 533--542, 2006.
[15]
Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In KDD, pages 198--207, 2005.
[16]
Q. Mei and C. Zhai. A mixture model for contextual text mining. In KDD, pages 649--655, 2006.
[17]
R. Sproat, T. Tao, and C. Zhai. Named entity transliteration with comparable corpora. In ACL, 2006.
[18]
R. Swan and J. Allan. Extracting significant time varying features from text. In CIKM, pages 38--45, 1999.
[19]
R. Swan and J. Allan. Automatic generation of overview timelines. In SIGIR, pages 49--56, 2000.
[20]
T. Tao and C. Zhai. Mining comparable bilingual text corpora for cross-language information integration. In KDD, pages 691--696, 2005.
[21]
M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos. Identifying similarities, periodicities and bursts for online search queries. In SIGMOD, pages 131--142, 2004.
[22]
J. Xu, R. Weischedel, and C. Nguyen. Evaluating a probabilistic model for cross-lingual information retrieval. In SIGIR, pages 105--110, 2001.
[23]
Y. Yang, T. Ault, T. Pierce, and C. W. Lattimer. Improving text categorization methods for event tracking. In SIGIR, pages 65--72, 2000.
[24]
Y. Yang, T. Pierce, and J. Carbonell. A study of retrospective and on-line event detection. In SIGIR, pages 28--36, 1998.
[25]
C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In KDD, pages 743--748, 2004.

Cited By

View all
  • (2023)Topic Model Based on Co-Occurrence Word Networks for Unbalanced Short Text Datasets2023 5th International Conference on Data-driven Optimization of Complex Systems (DOCS)10.1109/DOCS60977.2023.10294993(1-7)Online publication date: 22-Sep-2023
  • (2022)Short Text Topic Modeling Techniques, Applications, and Performance: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.299248534:3(1427-1445)Online publication date: 1-Mar-2022
  • (2022)Enhancing Heterogeneous Graph-based Short Text Topic Learning2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta)10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00145(977-984)Online publication date: Dec-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. coordinated streams
  3. correlated bursty patterns
  4. reinforcement

Qualifiers

  • Article

Conference

KDD07

Acceptance Rates

KDD '07 Paper Acceptance Rate 111 of 573 submissions, 19%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Topic Model Based on Co-Occurrence Word Networks for Unbalanced Short Text Datasets2023 5th International Conference on Data-driven Optimization of Complex Systems (DOCS)10.1109/DOCS60977.2023.10294993(1-7)Online publication date: 22-Sep-2023
  • (2022)Short Text Topic Modeling Techniques, Applications, and Performance: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.299248534:3(1427-1445)Online publication date: 1-Mar-2022
  • (2022)Enhancing Heterogeneous Graph-based Short Text Topic Learning2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta)10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00145(977-984)Online publication date: Dec-2022
  • (2022)Intent Mining: A Social and Semantic Enhanced Topic Model for Operation-Friendly Digital Marketing2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00308(3254-3267)Online publication date: May-2022
  • (2022)Scalable Information Flow Mining in NetworksMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44845-8_9(130-146)Online publication date: 10-Mar-2022
  • (2021)Building the Bridge: Topic Modeling for Comparative ResearchCommunication Methods and Measures10.1080/19312458.2021.196597316:2(96-114)Online publication date: 7-Sep-2021
  • (2021)On the nature and types of anomalies: a review of deviations in dataInternational Journal of Data Science and Analytics10.1007/s41060-021-00265-112:4(297-331)Online publication date: 4-Aug-2021
  • (2021)Burst: real-time events burst detection in social text streamThe Journal of Supercomputing10.1007/s11227-021-03717-4Online publication date: 22-Mar-2021
  • (2021)Twitter Topic Analysis Using Multi-tweet Sequential Summarization for Sentimental DataAdvances in Smart Grid and Renewable Energy10.1007/978-981-15-7511-2_54(547-554)Online publication date: 5-Jan-2021
  • (2020)SURGE: Continuous Detection of Bursty Regions Over a Stream of Spatial ObjectsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.291565432:11(2254-2268)Online publication date: 1-Nov-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media