skip to main content
10.1145/2484028.2484044acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Exploiting hybrid contexts for Tweet segmentation

Authors Info & Claims
Published:28 July 2013Publication History

ABSTRACT

Twitter has attracted hundred millions of users to share and disseminate most up-to-date information. However, the noisy and short nature of tweets makes many applications in information retrieval (IR) and natural language processing (NLP) challenging. Recently, segment-based tweet representation has demonstrated effectiveness in named entity recognition (NER) and event detection from tweet streams. To split tweets into meaningful phrases or segments, the previous work is purely based on external knowledge bases, which ignores the rich local context information embedded in the tweets. In this paper, we propose a novel framework for tweet segmentation in a batch mode, called HybridSeg. HybridSeg incorporates local context knowledge with global knowledge bases for better tweet segmentation. HybridSeg consists of two steps: learning from off-the-shelf weak NERs and learning from pseudo feedback. In the first step, the existing NER tools are applied to a batch of tweets. The named entities recognized by these NERs are then employed to guide the tweet segmentation process. In the second step, HybridSeg adjusts the tweet segmentation results iteratively by exploiting all segments in the batch of tweets in a collective manner. Experiments on two tweet datasets show that HybridSeg significantly improves tweet segmentation quality compared with the state-of-the-art algorithm. We also conduct a case study by using tweet segments for the task of named entity recognition from tweets. The experimental results demonstrate that HybridSeg significantly benefits the downstream applications.

References

  1. D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Mach. Learn., 34(1--3):177--210, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. Y. Y. Choi. Advances in domain independent linear text segmentation. In NAACL, pages 26--33, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. C. T. Chua, W. W. Cohen, J. Betteridge, and E.-P. Lim. Community-based classification of noun phrases in twitter. In CIKM, pages 1702--1706, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. E. Chung and E. Mustafaraj. Can collective sentiment expressed on twitter predict political elections? In AAAI, 2011.Google ScholarGoogle Scholar
  5. A. Cui, M. Zhang, Y. Liu, S. Ma, and K. Zhang. Discover breaking events with popular hashtags in twitter. In CIKM, pages 1794--1798, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, pages 363--370, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Gimpel, N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith. Part-of-speech tagging for twitter: annotation, features, and experiments. In ACL-HLT, pages 42--47, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Han and T. Baldwin. Lexical normalisation of short text messages: Makn sens a#twitter. In ACL, pages 368--378, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. A. Hearst. Texttiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist., 23(1):33--64, Mar. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Kazantseva and S. Szpakowicz. Linear text segmentation using affinity propagation. In EMNLP, pages 284--293, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Li, A. Sun, and A. Datta. Twevent: segment-based event detection from tweets. In CIKM, pages 155--164, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: Named entity recognition in targeted twitter stream. In SIGIR, pages 721--730, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K.-L. Liu, W.-J. Li, and M. Guo. Emoticon smoothed language models for twitter sentiment analysis. In AAAI, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing named entities in tweets. In ACL, pages 359--367, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Liu, X. Zhou, Z. Fu, F. Wei, and M. Zhou. Exacting social events for tweets using a factor graph. In AAAI, 2012.Google ScholarGoogle Scholar
  16. Z. Luo, M. Osborne, and T. Wang. Opinion retrieval in twitter. In ICWSM, 2012.Google ScholarGoogle Scholar
  17. X. Meng, F. Wei, X. Liu, M. Zhou, S. Li, and H. Wang. Entity-centric topic-oriented opinion summarization in twitter. In KDD, pages 379--387, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Misra, F. Yvon, J. M. Jose, and O. Cappe. Text segmentation via topic modeling: an analytical study. In CIKM, pages 1553--1556, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In CoNLL, pages 147--155, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Riedl and C. Biemann. Topictiling: a text segmentation algorithm based on lda. In ACL 2012 Student Research Workshop, pages 37--42, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: An experimental study. In EMNLP, pages 1524--1534, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Ritter, Mausam, O. Etzioni, and S. Clark. Open domain event extraction from twitter. In KDD, pages 1104--1112, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Utiyama and H. Isahara. A statistical model for domain-independent text segmentation. In ACL, pages 499--506, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Wang, C. Thrasher, E. Viegas, X. Li, and P. Hsu. An overview of microsoft web n-gram corpus and applications. In HLT-NAACL, pages 45--48, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. X. Wang, F. Wei, X. Liu, M. Zhou, and M. Zhang. Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In CIKM, pages 1031--1040, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Zhou and J. Su. Named entity recognition using an hmm-based chunk tagger. In ACL, pages 473--480, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploiting hybrid contexts for Tweet segmentation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
        July 2013
        1188 pages
        ISBN:9781450320344
        DOI:10.1145/2484028

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 28 July 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGIR '13 Paper Acceptance Rate73of366submissions,20%Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader