A machine learning based approach to identify geo-location of Twitter users

Published: 22 March 2017 Publication History


Twitter, a popular microblogging platform, has attracted great attention. Twitter enables people from all over the world to interact in an extremely personal way. The immense quantity of user-generated text messages become available on Twitter that could potentially serve as an important source of information for researchers and practitioners. The information available on Twitter may be utilized for many purposes, such as event detection, public health and crisis management. In order to effectively coordinate such activities, the identification of Twitter users' geo-locations is extremely important. Though online social networks can provide some sort of geo-location information based on GPS coordinates, Twitter suffers from geo-location sparseness problem. The identification of Twitter users' geo-location based on the content of send out messages, becomes extremely important. In this regard, this paper presents a machine learning based approach to the problem. In this study, our corpora is represented as a word vector. To obtain a classification scheme with high predictive performance, the performance of five classification algorithms, three ensemble methods and two feature selection methods are evaluated. Among the compared algorithms, the highest results (84.85%) is achieved by AdaBoost ensemble of Random Forest, when the feature set is selected with the use of consistency-based feature selection method in conjunction with best first search.


[1] (2016). Company | About. {online} Available at: {Accessed 5 Oct. 2016}.
Java, A., Song, X., Finin, T. and Tseng, B. 2007. Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD Conference (San Jose, USA, August 12--15, 2007). KDD '07. ACM, New York, NY, 56--65. .
Mahmud, J., Nichols, J. and Drews, C. 2014. Home location identification of twitter users. ACM Transactions on Intelligent Systems and Technology. 5(3) (Sept. 2014), Article No. 47. .
Cheng, Z., Caverlee, J. and Lee, K. 2010. You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (Toronto, Canada, October 26--30, 2010). CIKM '10. ACM, New York, NY, 759--768.
Hecht, B., Hong, L., Suh, B. and Chi, E.D. 2011. Tweets from Justin Bieber's heart: the dynamics of the location field in user profiles. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver, BC, May 7--12, 2011). CHI '11. ACM, New York, NY, 237--246.
Davis Jr, C.A., Pappa, G.L., de Oliveira, D.R.R. and Arcanjo, F.L. 2011. Inferring the location of twitter messages based on user relationships. Transactions in GIS. 15(6) (Dec. 2011), 735--751.
Aggarwal, C.C. and Zhai, C.X. 2012. A survey of text classification algorithms. In Mining text data, C.C.Aggarwal and C.X. Zhai, Ed. Springer-Verlag, Berlin, 77--128.
Onan, A., Korukoğlu, S. and Bulut, H. 2016. Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications. 57 (Sept. 2016), 232--247.
Kotsiantis, S.B., Zaharakis, I.D. and Pintelas, P.E. 2006. Machine learning: A review of classification and combination techniques. Artificial Intelligence Review. 20(3), 159--190.
Eisenstein, J., O'Connor, B., Smith, N.A. and Xing, E.P. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (Massachusetts, USA, October 9--11, 2010). EMNLP '10. ACM, New York, NY, 1277--1287.
Chandra, S., Khan, L. and Muhaya, F.B. 2011. Estimating twitter user location using social interactions-a content based approach. In Proceedings of the IEEE Third International Conference on Social Computing (Boston, USA, October 9--11, 2011). IEEE, New York, NY, 838--843.
Kinsella, S., Murdock, V. and O'Hare, N. 2011. I'm eating a sandwich in Glasgow: modeling locations with tweets. In Proceedings of the Third International Workshop on Search and Mining User-Generated Contents (Glasgow, UK, October 24--28, 2011). ACM, New York, NY, 61--68.
Chang, H-W., Lee, D., Eltaher, M. and Lee, J. 2012. @Phillies Tweeting from Philly? Predicting Twitter User Locations with Spatial Word Usage. In Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (Istanbul, Turkey, August 26--29, 2012). IEEE, New York, NY, 111--118.
Han, B., Cook, P. and Baldwin, T. 2014. Text-based twitter user geolocation prediction. Journal of Artificial Intelligence Research. 49(1) (January 2014), 451--500.
Popescu, A. and Grefenstette, G. 2010. Mining user home location and gender from flickr tags. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (California, USA, May 23--26, 2010). AAII Press, New York, NY, 307--310.
Gao, H., Tang, J. and Liu, H. 2012. Exploring social-historical ties on location-based social networks. In Proceedings of the 6th International Conference on Weblogs and Social Media (California, USA, May 23--26, 2010). AAAI Press, New York, NY, 114--121.
Sakaki, T., Okazaki, M. and Matsuo, Y. 2010. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web (NC, USA, April 26--30, 2010). ACM, New York, NY, 851--860.
MacEachren, A.M., Robinson, A.C., Jaiswal, A., Pezanowski, S., Savelyev, A., Blanford, J. and Mitra, P. 2011. Geo-Twitter Analytics: Applications in Crisis Management. In Proceedings of the 25th International Cartographic Conference. Paris, France, 1--8.
Dredze, M., Paul, M.J., Bergsma, S. and Tran, H. 2013. Carmen: a twitter geolocation system with applications to public health. In Proceedings of AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI. (HIAI), 1--5.
Hall, M.A. 1999. Correlation-based feature selection for machine learning. Doctoral Thesis, University of Waikato.
Hall, M.A. and Smith, L.A. 1999. Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference (Florida, USA, May 16--18, 1999). AAAI Press, New York, NY, 235--239.
Hall, M.A. and Holmes, M. 2003. Benchmarking attribute selection techniques for data mining. IEEE Transactions on Knowledge and Data Engineering. 15(6), 1437--1447.
John, G.H. and Langley, P. 1995. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (Montreal, Canada, August 18--20, 1995). Morgan Kaufmann, San Francisco, 338--345.
Han, J., Kamber, M. and Pei, J. 2011. Data mining: concepts and techniques. Morgan Kaufmann, San Francisco.
Kantardzic, M. 2011. Data mining: concepts, models, methods and algorithms. Wiley-IEEE Press, New York.
Breiman, L. 2001. Random forests. Machine Learning. 45(1), 5--32.
Vapnik, V. 1995. The nature of statistical learning theory. Springer, New York.
Breiman, L. 1996. Bagging predictors. Machine Learning. 4(2), 123--140.
Rokach, L. 2010. Ensemble-based classifiers. Artificial Intelligence Review. 33, 1--39.
Guo, H. and Viktor, H.L. 2004. Boosting with data generation: improving the classification of hard to learn examples. Lecture Notes in Artificial Intelligence. 3029, 1082--1091.
Onan, A. 2015. On the performance of ensemble learning for automated diagnosis of breast cancer. In Artificial Intelligence Perspectives and Applications, R.Silhavy, R.Senkerik, Z.K. Oplatkova, Z.Prokopova and P. Silhavy, Ed. Springer-Verlag, Berlin, 119--129.
Freund, Y. and Schapire, R.E. 1996. Experiments with a new boosting algorithm. In Proceedings of the Thirteent International Conference on Machine Learning (Bari, Italy), 148--156.
Ho, T.K. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence. 20(8), 832--844.

