Elsevier

Knowledge-Based Systems

Volume 51, October 2013, Pages 35-47
Knowledge-Based Systems

Twitter user profiling based on text and community mining for market analysis

https://doi.org/10.1016/j.knosys.2013.06.020Get rights and content

Abstract

This paper proposes demographic estimation algorithms for profiling Twitter users, based on their tweets and community relationships. Many people post their opinions via social media services such as Twitter. This huge volume of opinions, expressed in real time, has great appeal as a novel marketing application. When automatically extracting these opinions, it is desirable to be able to discriminate discrimination based on user demographics, because the ratio of positive and negative opinions differs depending on demographics such as age, gender, and residence area, all of which are essential for market analysis. In this paper, we propose a hybrid text-based and community-based method for the demographic estimation of Twitter users, where these demographics are estimated by tracking the tweet history and clustering of followers/followees. Our experimental results from 100,000 Twitter users show that the proposed hybrid method improves the accuracy of the text-based method. The proposed method is applicable to various user demographics and is suitable even for users who only tweet infrequently.

Introduction

Recently, due to the widespread popularity of the Internet, many people state their opinions via social media services. In particular, Twitter [1] is a suitable platform for real-time, casual communication. Many Twitter users post opinions about products, services, and TV programs. It is essential for companies to make efforts to improve their products and services based on their customers’ requirements. As a means of using user opinions for marketing, reputation analysis technologies have recently attracted a great deal of attention [2], [3]. Compared to previous marketing approaches based on questionnaire surveys, online opinion analysis has many advantages, including real-time feedback, low cost, and high volume. User demographics such as age, gender, and residence area are also essential for marketing analysis, since opinions vary with user demographics. For example, functions of mobile phones that are popular among young people are often found awkward to use by the elderly. Since most Twitter users do not state their demographic information, it has been impossible to extract opinions for individual user demographic segments (such as teens, twenties, or thirties). Several text-based approaches have been proposed to extract user demographic information [4], [5], [6]. However, only few proposals for large-scale and practical marketing analysis applications to perform demographic estimation exist due to difficulties in improving the effectiveness and accuracy to a level sufficient for practical use. Considering practical use, we realized that a general approach is required for estimating wide varieties of demographics such as age, gender, area and other categories. An estimation method targeting users with few tweets such as followers of corporate accounts is also important.

To solve these problems, we propose a hybrid of a text-based method and a community-based method for the demographic estimation of Twitter users. The text-based method estimates the demographics of users whose tweets contain sufficient text features. For all other users, the community-based method analyzes the followers/followees whose tweets contain plentiful text features. The hybrid method covers almost all users by making the most of the Twitter platform, including both tweets as text information and followers/followees as community information. In the text-based method, characteristic terms used by each demographic segment are automatically detected based on linguistic and statistical analysis by tracking the content of users’ tweet histories. For example, users whose tweets often include terms such as “school,” “classroom,” and “examination” are presumed to be teens and students. In the community-based method, demographic information is estimated from the follower/followee relations of the target user. In the proposed method, characteristic biases in the demographic segments of users are detected from the community groups constructed by clustering their followers and followees. A user can have several community groups, such as local friends, co-workers and hobby groups, where the members of each group have something in common such as age, gender and regional area.

Social opinions and demographic information are extremely attractive to businesses. For instance, product planners need to understand user requirements, customer support and service management departments need to monitor customer responses, advertising agencies want to deliver persuasive advertisements to target audiences, and broadcast TV directors need real-time feedback from the audience. In this paper, we focus on Japanese Twitter users. However, the algorithms of the proposed text-based and community-based methods are applicable to any language.

The rest of the paper is structured as follows. We outline related work in Section 2. We describe the proposed text-based method, community-based method and hybrid method for demographic estimation in Section 3 and the results of performance evaluations in Section 4, respectively. We conclude this paper in Section 5.

Section snippets

Related work

Extracting author information from the Web has been attempted for a long time. Table 1 summarizes the previous works. An extraction method for author information from Web sources is proposed for the purpose of judging whether the information is trustworthy [7]. Koppel et al. classify three author attribution problems [8]: (1) the profiling problem, where the challenge is to provide as much demographic or psychological information as possible about the author [4], [5], [6]; (2) the

Proposed method

The proposed hybrid method consists of a training phase and an estimation phase. Fig. 2, Fig. 3 give an overview of the proposed method. In the training phase, the text-based method analyzes the tweet history of known users, extracts characteristic terms and trains SVMs with features of the used terms. In the estimation phase, the text-based method estimates the demographics of unknown users from their tweet history. The community-based method analyzes the follow/follower relations of unknown

Performance evaluation of demographic estimation

We evaluated the performance of the proposed methods using the datasets, evaluation metrics, and experimental environments described in the following subsections.

Conclusions

In this paper, we proposed a hybrid demographic estimation method for Twitter users based on their tweet history and communities constructed from follower/followee relationships. There have been no previous proposals for large-scale and practical marketing analysis methods of such demographic estimation due to the difficulty of producing a method with sufficient effectiveness and accuracy for practical use.

The proposed hybrid method is applicable to multiple user demographics and to users who

References (28)

  • Twitter....
  • K. Dave, S. Lawrence, D.M. Pennock, Mining the peanut gallery: opinion extraction and semantic classification of...
  • J. Wiebe et al.

    Finding mutual benefit between subjectivity analysis and information extraction

    IEEE Transactions on Affective Computing

    (2011)
  • S. Argamon et al.

    Automatically profiling the author of an anonymous text

    Communications of the ACM

    (2009)
  • D. Estival, T. Gaustad, S.B. Pham, W. Radford, B. Hutchinson, Author profiling for English emails, in: Proceedings of...
  • D.D. Pham, G.B. Tran, S.B. Pham, Author profiling for Vietnamese blogs, in: Proceedings of the International Conference...
  • Y. Kato, D. Kawahara, K. Inui, S. Kurohashi, S. Shibata, Identifying the information sender configuration of web pages,...
  • M. Koppel et al.

    Computational methods in authorship attribution

    Journal of the American Society for Information Science and Technology

    (2009)
  • A. Abbasi et al.

    Writeprints: a stylometric approach to identity-level identification and similarity detection

    ACM Transactions on Information Systems

    (2008)
  • M. Koppel et al.

    Measuring differentiability: unmasking pseudonymous authors

    Journal of Machine Learning Research

    (2007)
  • R. Layton, P. Watters, R. Dazeley, Authorship attribution for Twitter in 140 characters or less, in: Proceedings of the...
  • R.S. Silva, G. Laboreiro, L. Sarmento, T. Grant, E. Oliveira, B. Maia, ‘twazn me!!!; (’automatic authorship analysis of...
  • H.G. Small

    Co-citation in the scientific literature: a new measure of relationship between two documents

    Journal of the American Society for Information Science

    (1973)
  • N.I. On et al.

    A link-based cluster ensemble approach for categorical data clustering

    IEEE Transactions on Knowledge and Data Engineering

    (2010)
  • Cited by (126)

    • A model for generating a user dynamic profile on social media

      2022, Journal of King Saud University - Computer and Information Sciences
    • Social media mining under the COVID-19 context: Progress, challenges, and opportunities

      2022, International Journal of Applied Earth Observation and Geoinformation
    • Dual-channel hybrid community detection in attributed networks

      2021, Information Sciences
      Citation Excerpt :

      In real networks, communities are believed to correspond to specific entity groups with similar properties such as interaction behavior and semantics [14]. Previous studies [40,33] have demonstrated that the identification of communities can help reveal the structure, function, and semantics of a network while also supporting several advanced applications such as user profiling [19] and recommender systems [15,43]. Therefore, community detection has emerged as a fundamental task in complex network analysis [16].

    View all citing articles on Scopus
    View full text