research-article

Exploiting hybrid contexts for Tweet segmentation

Authors:
Chenliang Li

Nanyang Technological University, Singapore, Singapore

Nanyang Technological University, Singapore, Singapore
View Profile

,
Aixin Sun

Nanyang Technological University, Singapore, Singapore

Nanyang Technological University, Singapore, Singapore
View Profile

,
Jianshu Weng

Independent Researcher, Singapore, Singapore

Independent Researcher, Singapore, Singapore
View Profile

,
Qi He

IBM Almaden Research Center, San Jose, USA

IBM Almaden Research Center, San Jose, USA
View Profile

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalJuly 2013Pages 523–532https://doi.org/10.1145/2484028.2484044

Published:28 July 2013Publication History

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pages 523–532

ABSTRACT

Twitter has attracted hundred millions of users to share and disseminate most up-to-date information. However, the noisy and short nature of tweets makes many applications in information retrieval (IR) and natural language processing (NLP) challenging. Recently, segment-based tweet representation has demonstrated effectiveness in named entity recognition (NER) and event detection from tweet streams. To split tweets into meaningful phrases or segments, the previous work is purely based on external knowledge bases, which ignores the rich local context information embedded in the tweets. In this paper, we propose a novel framework for tweet segmentation in a batch mode, called HybridSeg. HybridSeg incorporates local context knowledge with global knowledge bases for better tweet segmentation. HybridSeg consists of two steps: learning from off-the-shelf weak NERs and learning from pseudo feedback. In the first step, the existing NER tools are applied to a batch of tweets. The named entities recognized by these NERs are then employed to guide the tweet segmentation process. In the second step, HybridSeg adjusts the tweet segmentation results iteratively by exploiting all segments in the batch of tweets in a collective manner. Experiments on two tweet datasets show that HybridSeg significantly improves tweet segmentation quality compared with the state-of-the-art algorithm. We also conduct a case study by using tweet segments for the task of named entity recognition from tweets. The experimental results demonstrate that HybridSeg significantly benefits the downstream applications.

References

D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Mach. Learn., 34(1--3):177--210, 1999. Google ScholarDigital Library
F. Y. Y. Choi. Advances in domain independent linear text segmentation. In NAACL, pages 26--33, 2000. Google ScholarDigital Library
F. C. T. Chua, W. W. Cohen, J. Betteridge, and E.-P. Lim. Community-based classification of noun phrases in twitter. In CIKM, pages 1702--1706, 2012. Google ScholarDigital Library
J. E. Chung and E. Mustafaraj. Can collective sentiment expressed on twitter predict political elections? In AAAI, 2011.Google Scholar
A. Cui, M. Zhang, Y. Liu, S. Ma, and K. Zhang. Discover breaking events with popular hashtags in twitter. In CIKM, pages 1794--1798, 2012. Google ScholarDigital Library
J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, pages 363--370, 2005. Google ScholarDigital Library
K. Gimpel, N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith. Part-of-speech tagging for twitter: annotation, features, and experiments. In ACL-HLT, pages 42--47, 2011. Google ScholarDigital Library
B. Han and T. Baldwin. Lexical normalisation of short text messages: Makn sens a#twitter. In ACL, pages 368--378, 2011. Google ScholarDigital Library
M. A. Hearst. Texttiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist., 23(1):33--64, Mar. 1997. Google ScholarDigital Library
A. Kazantseva and S. Szpakowicz. Linear text segmentation using affinity propagation. In EMNLP, pages 284--293, 2011. Google ScholarDigital Library
C. Li, A. Sun, and A. Datta. Twevent: segment-based event detection from tweets. In CIKM, pages 155--164, 2012. Google ScholarDigital Library
C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: Named entity recognition in targeted twitter stream. In SIGIR, pages 721--730, 2012. Google ScholarDigital Library
K.-L. Liu, W.-J. Li, and M. Guo. Emoticon smoothed language models for twitter sentiment analysis. In AAAI, 2012.Google ScholarDigital Library
X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing named entities in tweets. In ACL, pages 359--367, 2011. Google ScholarDigital Library
X. Liu, X. Zhou, Z. Fu, F. Wei, and M. Zhou. Exacting social events for tweets using a factor graph. In AAAI, 2012.Google Scholar
Z. Luo, M. Osborne, and T. Wang. Opinion retrieval in twitter. In ICWSM, 2012.Google Scholar
X. Meng, F. Wei, X. Liu, M. Zhou, S. Li, and H. Wang. Entity-centric topic-oriented opinion summarization in twitter. In KDD, pages 379--387, 2012. Google ScholarDigital Library
H. Misra, F. Yvon, J. M. Jose, and O. Cappe. Text segmentation via topic modeling: an analytical study. In CIKM, pages 1553--1556, 2009. Google ScholarDigital Library
L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In CoNLL, pages 147--155, 2009. Google ScholarDigital Library
M. Riedl and C. Biemann. Topictiling: a text segmentation algorithm based on lda. In ACL 2012 Student Research Workshop, pages 37--42, 2012. Google ScholarDigital Library
A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: An experimental study. In EMNLP, pages 1524--1534, 2011. Google ScholarDigital Library
A. Ritter, Mausam, O. Etzioni, and S. Clark. Open domain event extraction from twitter. In KDD, pages 1104--1112, 2012. Google ScholarDigital Library
M. Utiyama and H. Isahara. A statistical model for domain-independent text segmentation. In ACL, pages 499--506, 2001. Google ScholarDigital Library
K. Wang, C. Thrasher, E. Viegas, X. Li, and P. Hsu. An overview of microsoft web n-gram corpus and applications. In HLT-NAACL, pages 45--48, 2010. Google ScholarDigital Library
X. Wang, F. Wei, X. Liu, M. Zhou, and M. Zhang. Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In CIKM, pages 1031--1040, 2011. Google ScholarDigital Library
G. Zhou and J. Su. Named entity recognition using an hmm-based chunk tagger. In ACL, pages 473--480, 2002. Google ScholarDigital Library

Index Terms

Exploiting hybrid contexts for Tweet segmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

Twevent: segment-based event detection from tweets
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Event detection from tweets is an important task to understand the current events/topics attracting a large number of common users. However, the unique characteristics of tweets (e.g. short and noisy content, diverse and fast changing topics, and large ...
Read More
What is a tweet worth?: measuring the value of social media for an academic institution
iConference '12: Proceedings of the 2012 iConference

Determining the influence of organizational Twitter accounts is far from an exact science, although numerous companies (most prominently Klout) have recently sought to find appropriate metrics and algorithms. Klout, a company that measures influence on ...
Read More
Analyzing and predicting viral tweets
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

Twitter and other microblogging services have become indispensable sources of information in today's web. Understanding the main factors that make certain pieces of information spread quickly in these platforms can be decisive for the analysis of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
July 2013
1188 pages
ISBN:9781450320344
DOI:10.1145/2484028
General Chairs:
Gareth J.F. Jones
Dublin City University, Ireland
,
Páraic Sheridan
Dublin City University, Ireland
,
Program Chairs:
Diane Kelly
University of North Carolina, Chapel Hill, USA
,
Maarten de Rijke
University of Amsterdam, The Netherlands
,
Tetsuya Sakai
Microsoft Research Asia, China
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 July 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
named entity recognition
tweet
tweet segmentation
twitter
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '13 Paper Acceptance Rate73of366submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 869
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploiting hybrid contexts for Tweet segmentation

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Twevent: segment-based event detection from tweets

What is a tweet worth?: measuring the value of social media for an academic institution

Analyzing and predicting viral tweets