Abstract:
The tracking of microblog discussion, on a given topic, is useful for a wide range of higher level applications. Microblog services like Twitter provide a simple keyword ...Show MoreMetadata
Abstract:
The tracking of microblog discussion, on a given topic, is useful for a wide range of higher level applications. Microblog services like Twitter provide a simple keyword based tracking capability, where any tweet containing a keyword is returned. Due to the short length of microblog posts, using a small number of topic specific query words for tracking, would impact recall. Use of a larger number of keywords (compared to regular document retrieval) is generally required in order to obtain good recall, but this would result in a large number of off-topic posts, resulting in low precision. In our work, we consider the scenario of using a large number of query terms to maintain high recall, for automated tracking of a microblog streams. The challenge we address is how to score each of the returned microblogs, with respect to the query, on-line, in an unsupervised manner, so as to identify those that are on topic. To this end, we proposed a new term-scoring expression, which we call Adjusted Information Gain (AIG), and we compare this to other term-scoring expressions: inverse document frequency, Dice, Jaccard and keyword frequency. Our comparisons consider a selection of document-scoring functions applied to roughly 40 million tweets collects over a 20 day period for each of two topics. Our results show significant improvements (from 8%-40% of the area under the ROC curves) to existing term-scoring expressions, depending on topic and specificity, and provide insight into further work in query expansion techniques.
Date of Conference: 27-30 October 2014
Date Added to IEEE Xplore: 08 January 2015
Electronic ISBN:978-1-4799-5666-1