skip to main content
10.1145/2020408.2020476acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Smoothing techniques for adaptive online language models: topic tracking in tweet streams

Published: 21 August 2011 Publication History

Abstract

We are interested in the problem of tracking broad topics such as "baseball" and "fashion" in continuous streams of short texts, exemplified by tweets from the microblogging service Twitter. The task is conceived as a language modeling problem where per-topic models are trained using hashtags in the tweet stream, which serve as proxies for topic labels. Simple perplexity-based classifiers are then applied to filter the tweet stream for topics of interest. Within this framework, we evaluate, both intrinsically and extrinsically, smoothing techniques for integrating "foreground" models (to capture recency) and "background" models (to combat sparsity), as well as different techniques for retaining history. Experiments show that unigram language models smoothed using a normalized extension of stupid backoff and a simple queue for history retention performs well on the task.

References

[1]
J. Allan. Topic Detection and Tracking: Event-Based Information Organization. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2002.
[2]
S. Asur and B. A. Huberman. Predicting the future with social media. Technical Report HPL-2010-53, HP Laboratories, 2010.
[3]
S. Asur, B. A. Huberman, G. Szabo, and C. Wang. Trends in social media: Persistence and decay. Technical report, HP Laboratories, 2011.
[4]
E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts. Everyone's an influencer: Quantifying influence on Twitter. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM 2011), pages 65--74, Hong Kong, China, 2011.
[5]
T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 858--867, Prague, Czech Republic, 2007.
[6]
M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi. Measuring user influence in Twitter: The million follower fallacy. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 10--17, Washington, D.C., 2010.
[7]
S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL 1996), pages 310--318, Santa Cruz, California, 1996.
[8]
M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Çetintemel, Y. Xing, and S. B. Zdonik. Scalable distributed stream processing. In Proceedings of the First Biennial Conference on Innovative Data Systems Research (CIDR 2003), Asilomar, California, 2003.
[9]
A. Goyal, H. Daumé, and S. Venkatasubramanian. Streaming for large scale NLP: Language modeling. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 512--520, Boulder, Colorado, 2009.
[10]
D. J. Hopkins and G. King. A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1):229--247, 2010.
[11]
D. Jurafsky and J. H. Martin. Speech and Language Processing. Pearson, Upper Saddle River, New Jersey, 2009.
[12]
H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In Proceedings of the 19th International World Wide Web Conference (WWW 2010), pages 591--600, Raleigh, North Carolina, 2010.
[13]
K. Lerman and R. Ghosh. Information contagion: An empirical study of the spread of news on Digg and Twitter social networks. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 90--97, Washington, D.C., 2010.
[14]
A. Levenberg and M. Osborne. Stream-based randomised language models for SMT. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), pages 756--764, Singapore, 2009.
[15]
X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), Portland, Oregon, 2011.
[16]
S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Computer Science, 1(2), 2005.
[17]
B. O'Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith. From Tweets to polls: Linking text sentiment to public opinion time series. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 122--129, Washington, D.C., 2010.
[18]
S. Petrović, M. Osborne, and V. Lavrenko. Streaming first story detection with application to Twitter. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2010), pages 181--189, Los Angeles, California, 2010.
[19]
D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic models. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 130--137, Washington, D.C., 2010.
[20]
A. Ritter, C. Cherry, and B. Dolan. Unsupervised modeling of Twitter conversations. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2010), pages 172--180, Los Angeles, California, 2010.
[21]
D. M. Romero, B. Meeder, and J. Kleinberg. Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on Twitter. In Proceedings of the 20th International World Wide Web Conference (WWW 2011), pages 695--704, Hyderabad, India, 2011.
[22]
J. Weng, E.-P. Lim, J. Jiang, and Q. He. TwitterRank: Finding topic-sensitive influential Twitterers. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), pages 261--270, New York, New York, 2010.

Cited By

View all
  • (2024)Analysis of Public Sentiment on COVID-19 Mitigation Measures in Social Media in the United States Using Machine LearningIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.321452711:1(307-318)Online publication date: Feb-2024
  • (2023)A Data Quality Multidimensional Model for Social Media AnalysisBusiness & Information Systems Engineering10.1007/s12599-023-00840-966:6(667-689)Online publication date: 10-Nov-2023
  • (2023)Axiomatic Analysis of Pre‐Processing Methodologies Using Machine Learning in Text MiningConvergence of Cloud with AI for Big Data Analytics10.1002/9781119905233.ch11(229-256)Online publication date: 10-Feb-2023
  • Show More Cited By

Index Terms

  1. Smoothing techniques for adaptive online language models: topic tracking in tweet streams

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2011
      1446 pages
      ISBN:9781450308137
      DOI:10.1145/2020408
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 August 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. stream processing
      2. tdt
      3. twitter

      Qualifiers

      • Research-article

      Conference

      KDD '11
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 857 of 6,873 submissions, 12%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)15
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 07 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Analysis of Public Sentiment on COVID-19 Mitigation Measures in Social Media in the United States Using Machine LearningIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.321452711:1(307-318)Online publication date: Feb-2024
      • (2023)A Data Quality Multidimensional Model for Social Media AnalysisBusiness & Information Systems Engineering10.1007/s12599-023-00840-966:6(667-689)Online publication date: 10-Nov-2023
      • (2023)Axiomatic Analysis of Pre‐Processing Methodologies Using Machine Learning in Text MiningConvergence of Cloud with AI for Big Data Analytics10.1002/9781119905233.ch11(229-256)Online publication date: 10-Feb-2023
      • (2022)Migrating social event recommendation over microblogsProceedings of the VLDB Endowment10.14778/3551793.355186415:11(3213-3225)Online publication date: 1-Jul-2022
      • (2022)Machine Learning for Business Analytics: Case Studies and Open Research ProblemsArtificial Intelligence for Data Science in Theory and Practice10.1007/978-3-030-92245-0_1(1-26)Online publication date: 2022
      • (2021)From Symbols to Embeddings: A Tale of Two Representations in Computational Social ScienceJournal of Social Computing10.23919/JSC.2021.00112:2(103-156)Online publication date: Jun-2021
      • (2021)HDQGF:Heterogeneous Data Quality Guarantee Framework Based on Deep Learning2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD49262.2021.9437684(901-906)Online publication date: 5-May-2021
      • (2021)Topic Detection and TrackingText Data Mining10.1007/978-981-16-0100-2_9(201-225)Online publication date: 21-Jan-2021
      • (2020)SocialCCF: Graph-text Collaborative Cleaning Framework Based on Social Networks2020 IEEE International Conference on Artificial Intelligence and Information Systems (ICAIIS)10.1109/ICAIIS49377.2020.9194844(742-747)Online publication date: Mar-2020
      • (2019)New bivariate Hensel lifting algorithm for n factorsACM Communications in Computer Algebra10.1145/3377006.337702153:3(142-145)Online publication date: 17-Dec-2019
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media