Elsevier

Neurocomputing

Volume 171, 1 January 2016, Pages 30-38
Neurocomputing

Exploring multiple evidence to infer users’ location in Twitter

https://doi.org/10.1016/j.neucom.2015.05.066Get rights and content

Abstract

Online social networks are valuable sources of information to monitor real-time events, such as earthquakes and epidemics. For this type of surveillance, users’ location is an essential piece of information, but a substantial number of users choose not to disclose their geographical location. However, characteristics of the users׳ behavior, such as the friends they associate with and the types of messages published may hint on their spatial location. In this paper, we propose a method to infer the spatial location of Twitter users. Unlike the approaches proposed so far, it incorporates two sources of information to learn geographical position: the text posted by users and their friendship network. We propose a probabilistic approach that jointly models the geographical labels and Twitter texts of users organized in the form of a graph representing the friendship network. We use the Markov random field probability model to represent the network, and learning is carried out through a Markov Chain Monte Carlo simulation technique to approximate the posterior probability distribution of the missing geographical labels. We show the accuracy of the algorithm in a large dataset of Twitter users, where the ground truth is the location given by GPS. The method presents promising results, with little sensitivity to parameters and high values of precision.

Introduction

Online social networks, such as Twitter and Facebook, were initially conceived as tools to encourage social interactions. However, with time they became powerful real-time sensors, gathering information about what people think and do. As a consequence, online social networks started being employed as monitoring tools, used to provide information about events as diverse as earthquakes [1], epidemics [2] or elections [3].

When monitoring events, knowing where the information comes from is really valuable. Although users from social networks can fill up profiles with their personal information, this data is not always available or can be trusted. For example, Gundecha et al. [4] reported that, in average, only 35% of Facebook users declare location, while Mislove et al. [5] showed that, among US Twitter users, 75% fill up the corresponding field. However, because the location field does not follow any patterns, a large volume of invalid (Mars) or low precision (Brazil) locations are often reported.

Apart from the declared location, user location in Twitter can be retrieved in two other ways: obtaining the geographic location from the computer IP address or from the GPS coordinates of mobile devices. The location given by the IP address is not very reliable and needs to be continually updated. In Brazil, for example, this service correctly locates 72% of IPs with precision within a radius of 40 km [6]. In contrast, GPS provided locations are the ones with best accuracy and reliability, since they are restricted to users posting from mobile devices and that allow such information to be disclosed. However, experimental results showed that, in countries such as Brazil, under 1% of tweets provide GPS data [7].

Given the importance of geolocation for event monitoring, studies on inferring user location based on other public available data became popular. This is also the main goal of this paper: to present a new method for inferring users location in Twitter by combining two sources of evidence: the tweets of the user [8], [9] and their relationships in the network [7], [10]. Although either tweets content or the friendship network have been previously used to infer the location of a Twitter user, to the best of our knowledge, the first method to propose a way to integrate these two sources of evidence was published in [11], and is extended in this paper.

The method proposed in [11] is based on a probabilistic graphical model, where the information of users’ friendships is represented as an undirected graph, with vertexes representing users and edges representing relationships. A friendship is defined by a mutual Twitter relationship among two users. Each vertex in the graph is also associated with the text of the user tweets, and labeled with the geographical location coming from a GPS.

Other location sources are ignored, as they represent less reliable information. Note that the spatial locations are only partially observed, with a substantial proportion of the users’ missing GPS geographical information. Based on a joint, integrated stochastic model relating different information sources, an algorithm is proposed to learn the missing spatial positions. The inference is based on the posterior distribution of the possible geographical locations given all the available information: the friendship network structure, the tweets’ contents for each user, and the available observed spatial locations of users.

We modeled users’ data as a Markov random field with neighborhood structure given by the social network links. A maximum a posteriori estimator is adopted as a point estimate. One additional advantage of our method is the possibility of accessing the uncertainty associated with the estimated labels as a side product of the learning algorithm. We used the Gibbs Sampler algorithm, a specific member of the Markov Chain Monte Carlo (MCMC) algorithm class [12], in order to get the posterior probabilities’ estimates.

Apart from a more detailed description of the method previously proposed, this paper evaluates the sensitivity of the temperature parameter of a probabilistic Potts model [13] in the method, and also assesses the deterioration of the learning algorithm as the proportion of missing spatial locations increases. Finally, it presents new experiments with an updated dataset with 3 and 10 cities, respectively, with 8,590 and 11,850 users and more than one hundred thousand connections.

The results from the integrated approach were compared to the Naive Bayes classifier using only textual information and a semi-supervised graph-based method, called MultiRankWalk (MRW) [14]. The results show that the proposed algorithm and MRW have a similar average performance, while MRW is more sensitive to variations in the initial labeled set. Both methods are also statistically better than Naive Bayes, which shows the network information as a much stronger type of information than text.

The rest of the paper is organized as follows. Section 2 summarizes the current state-of-the-art to learn the geographical location of social network agents. Section 3 describes our new probabilistic model while Section 4 presents MRW, which was used as a baseline for our method. Section 5 presents the experimental results. Finally, Section 6 presents the main conclusions and ideas of future work.

Section snippets

Related work

This section reviews recent works in the literature that consider the text, information of user profile or the relationship graph when predicting the geographical location of a user.

We start by discussing works that consider only the text associated with each user. Gelernter et al. [15] employed a Named Entity Recognition (NER) technique to identify the names of places in tweets. The results were compared with terms manually identified as locations by the authors, and showed that a main

A probabilistic model for inferring user location

This section describes the algorithm proposed to learn users’ location based on the content of his/her tweets and his/her friendship graph, and named Integrated-data approach (IDA, for short). While the friendship graph is more costly to obtain than users’ text, it is an extremely powerful predictor [26], [18].

Let N be the total number of users, θi the ith user location and θ=(θLθU), where θL=(θ1,θ2,,θk) is the set of k users with known location (labeled nodes) and θU=(θk+1,θk+2,,θN) the set

The multirank walk algorithm

This section describes the algorithm applied to learn users’ location based only on their friendship graph, called MRW. The method is based on a modified version of PageRank, called personalized PageRank, and was originally suggested as a general-purpose graph-based classifier by Lin and Cohen [14]. In PageRank [28], given a graph G=(V,E) with n vertices, a n-dim vector r is returned with non-negative values summing to 1. The order implied by these numerical scores is used to sort any subset of

Experimental results

In order to verify the performance of IDA, tests were performed using a Twitter dataset with 11,850 users from 10 different Brazilian cities, collected in the first semester of 2011 using a breath-first search. The seeds for the search were users using the terms “dengue” and “Aedes Aegypti” (the dengue transmitter mosquito), as the objective was to improve localization of users talking about dengue outbreaks in Brazil. As we expect the number of available labels (cities) as well as the number

Conclusions and future work

This work presented IDA, a new probabilistic model for location inference of Twitter users. It extends the original work presented in Rodrigues et al. (2013) [11], where we first proposed to integrate information from Twitter users׳ texts and friendship network. The method was compared to two other algorithms using only one type of information: Naive Bayes and MRW.

The experimental results showed that there is no uniformly better algorithm when considering IDA and MRW. While MRW is essentially

Acknowledgments

The authors would like to thank CNPq, Capes, FAPEMIG, and InWeb – National Institute of Science and Technology for the Web for financial support.

Erica Rodrigues received her Ph.D. in Statistics from the Universidade Federal de Minas Gerais in 2012. She is an associate professor in the Department of Statistics, Universidade Federal de Ouro Preto (UFOP), in Brazil, since 2012. Her research is focused on the development of algorithms and probabilistic models for the statistical analysis of spatial and graph-based data.

References (29)

  • T. Sakaki, M. Okazaki, Y. Matsuo, Earthquake shakes twitter users: real-time event detection by social sensors, in:...
  • J. Gomide, A. Veloso, W. Meira, V. Almeida, F. Benevenuto, F. Ferraz, M. Teixeira, Dengue surveillance based on a...
  • A. Tumasjan et al.

    Predicting elections with twitterwhat 140 characters reveal about political sentiment

    ICWSM

    (2010)
  • P. Gundecha, G. Barbier, H. Liu, Exploiting vulnerability to secure user privacy on a social networking site, in:...
  • A. Mislove, S. Lehmann, Y.-Y. Ahn, J.-P. Onnela, J.N. Rosenquist, Understanding the Demographics of Twitter Users, in:...
  • Geoip City Accuracy for Selected Countries, 〈http://www.maxmind.com/en/city_accuracy〉 (accessed...
  • C.A. Davis et al.

    Inferring the location of twitter messages based on user relationships

    Trans. GIS

    (2011)
  • J. Mahmud, J. Nichols, C. Drews, Where is this tweet from? Inferring home locations of twitter users, in: J.G. Breslin,...
  • Z. Cheng, J. Caverlee, K. Lee, You are where you tweet: a content-based approach to geo-locating twitter users, in:...
  • S.A. Macskassy et al.

    Classification in networked dataa toolkit and a univariate case study

    J. Mach. Learn. Res.

    (2007)
  • E. Rodrigues, R. Assuncao, G.L. Pappa, R. Miranda, W. Meira, Uncovering the location of twitter users, in: 2013...
  • C. Andrieu et al.

    An Introduction to MCMC for Machine Learning

    Mach. Learn.

    (2003)
  • S.Z. Li

    Markov Random Field Modeling in Image Analysis

    (2009)
  • F. Lin, W. Cohen, Semi-supervised classification of network data using very few labels, in: Proceedings of the 2010...
  • Cited by (22)

    • Identifying user geolocation with Hierarchical Graph Neural Networks and explainable fusion

      2022, Information Fusion
      Citation Excerpt :

      Our present work enables better location-awareness than the existing literature and, in particular, HGNN distinguishes the crowd effects from different geographic regions. Online social relationships are also important indicators for user geolocation under the homophily assumption [35–38], i.e., people prefer to interact with others in nearby areas. Backstrom et al. [35] examine the relationship between users’ geographical proximity and online friendships on Facebook, and find that the likelihood of relations between any user pair drops monotonically as a function of distance.

    • An overview of microblog user geolocation methods

      2020, Information Processing and Management
      Citation Excerpt :

      There are many label propagation-based geolocalisation methods presented in existing literature. For example, in Rodrigues, Assunção, Pappa, Rennó, & Jr, 2016, the problem of user geolocalisation is transformed into the problem of node label inference with Markov random field. A probability model is used to represent the network in this work, and the Markov chain Monte Carlo simulation technique is used to approximately infer the posterior probability distribution of unknown geographical labels.

    • Geo-semantic-parsing: AI-powered geoparsing by traversing semantic knowledge graphs

      2020, Decision Support Systems
      Citation Excerpt :

      The technique exploits both textual tweet content and Twitter social graph. Instead, a more powerful technique is proposed in [55], leveraging a probabilistic approach that jointly models geographic labels and Twitter texts of users, organized in the form of a graph representing the friendship network. In detail, authors use a Markov random field probability model to represent the network, and they ground the learning step on a Markov Chain Monte Carlo simulation, that approximates the posterior probability distribution of the missing geographic user labels.

    • Locality-adapted kernel densities of term co-occurrences for location prediction of tweets

      2019, Information Processing and Management
      Citation Excerpt :

      Bakerman et al. (2018) combined tweet text with network data that is composed of previous tweets initiated by users’ friends. Another hybrid method is given in Rodrigues, Assunção, Pappa, Renno, and Meira Jr. (2016). In that work, the authors proposed a Markov random field probability model to infer users’ locations based on the content of their tweets and their friendship networks.

    • Twitter user geolocation using web country noun searches

      2019, Decision Support Systems
      Citation Excerpt :

      Some studies complement the previous methods with the use of additional features (AF), such as the location field from the user profile metadata [24,25] or the tweeted time zone [11]. Other studies combine the different types, such as: IR and WD [26]; WD and FN [27–31]; and WD, FN, and AF [25]. The related work can also be characterized by the text language, location target, discrimination level, search area of interest, computational algorithm, evaluation method (val.),

    • Survey of user geographic location prediction based on online social network

      2024, Jisuanji Yanjiu yu Fazhan/Computer Research and Development
    View all citing articles on Scopus

    Erica Rodrigues received her Ph.D. in Statistics from the Universidade Federal de Minas Gerais in 2012. She is an associate professor in the Department of Statistics, Universidade Federal de Ouro Preto (UFOP), in Brazil, since 2012. Her research is focused on the development of algorithms and probabilistic models for the statistical analysis of spatial and graph-based data.

    Renato Assunção received his Ph.D. in Statistics from the University of Washington in 1994. He is a full professor in the Department of Computer Science, Universidade Federal de Minas Gerais (UFMG), in Brazil, where he is affiliated since 1988. His research is focused on the development of algorithms and probabilistic models for the statistical analysis of spatial and graph-based data.

    Gisele L. Pappa received her Ph.D. from the University of Kent in 2007 and is an Associate Professor at the Computer Science Department at Universidade Federal de Minas Gerais, Brazil. Her main research interests are on data mining algorithms, evolutionary computation, and applications of these areas in social networks and bioinformatics.

    Diogo Renno received a B.Sc. degree in Computer Science from the Universidade Federal de Minas Gerais in 2012 and is currently pursuing a M.Sc. at the same university. His main research interests include data mining and machine learning.

    Wagner Meira Jr. obtained his Ph.D. from the University of Rochester in 1997 and is Full Professor at the Computer Science Department at Universidade Federal de Minas Gerais, Brazil. His research focuses on scalability and efficiency of large scale parallel and distributed systems, from massively parallel to Internet-based platforms, and on data mining algorithms, their parallelization, and application to areas such as information retrieval, bioinformatics, and e-governance.

    View full text