User demographics prediction based on mobile data

https://doi.org/10.1016/j.pmcj.2013.07.009Get rights and content

Abstract

Demographics prediction is an important component of user profile modeling. The accurate prediction of users’ demographics can help promote many applications, ranging from web search, personalization to behavior targeting. In this paper, we focus on how to predict users’ demographics, including “gender”, “job type”, “marital status”, “age” and “number of family members”, based on mobile data, such as users’ usage logs, physical activities and environmental contexts. The core idea is to build a supervised learning framework, where each user is represented as a feature vector and users’ demographics are considered as prediction targets. The most important component is to construct features from raw data and then supervised learning models can be applied. We propose a feature construction framework, CFC (contextual feature construction), where each feature is defined as the conditional probability of one user activity under the given contexts. Consequently, besides employing standard supervised learning models, we propose a regularized multi-task learning framework to model different kinds of demographics predictions collectively. We also propose a cost-sensitive classification framework for regression tasks, in order to benefit from the existing dimension reduction methods. Finally, due to the limited training instances, we employ ensemble to avoid overfitting. The experimental results show that the framework achieves classification accuracies on “gender”, “job” and “marital status” as high as 96%, 83% and 86%, respectively, and achieves Root Mean Square Error (RMSE) on “age” and “number of family members” as low as 0.69 and 0.66 respectively, under the leave-one-out evaluation.

Introduction

Demographics have been demonstrated as important information in personalization  [1], web search  [2] and advertisement targeting  [3]. It may help learning models to infer users’ interests and hence improve an application’s performance. However, due to privacy issues or inconvenience, such demographics information is usually incomplete. Thus, how to predict users’ demographics accurately is critical for many applications. Fortunately, nowadays, smart cellphones have become more and more popular, where different censors and applications in a cellphone make it possible to monitor users’ different activities, such as phone calls, visiting, and game playing, and so on. In addition, these activity log data are closely related to users’ interests and backgrounds. For example, male users may play more games than female users; married people may spend more time at home; and business people may make more phone calls. Thus, one may ask, can we exploit such information to help predict users’ demographics?

The third task of the Nokia Mobile Data Challenge (MDC) 2012 provides such a dataset. Its target is to predict users’ demographics, including “gender”, “job type”, “marital status”, “age” and “number of family members”, based on users’ mobile data, including users’ call logs, application logs, media logs, activity logs, travel logs and users’ environmental contexts. We formulate these five prediction tasks as supervised learning problems. Formally, we aim to construct a function, of which input is an instance x and output is the prediction yˆ, i.e., yˆ=f(x). For classification tasks, i.e., “gender”, “job type” and “marital status”, yˆ is a discrete value, and for regression tasks, i.e., “age” and “number of family members”, yˆ is continual. However, different from standard supervised learning problems, there are several issues we need to solve. Firstly, the log data are raw data, which are not well represented and can be noisy. Thus, we need to construct valuable features and represent each user as a feature vector in order to apply machine learning models. To cope with this problem, we propose a feature construction framework called CFC, Contextual Feature Construction, where each feature is defined as the probability of one user activity happening under the given contexts. Secondly, the predictions of different demographics are related to each other. For example, job type is closely dependent on age; a PhD student is unlikely to be younger than 16 or older than 35. Consequently, we consider each individual demographics prediction, for example, the prediction of gender, as a learning task and propose a novel regularized multi-task framework to model the relationship between different demographics predictions. In addition, considering that most machine learning research works focus on classification problems, to benefit from them, we propose a cost-sensitive classification framework to solve the regression tasks.

We briefly describe our solution, which is a supervised learning framework, as follows. The proposed framework has five components:

  • feature construction, which extracts features from the raw data, such as the calling probability in the morning, game playing probability in the evening, and the like, and represents each user as a feature vector;

  • data cleaning, which replaces missing data and performs normalization;

  • feature selection and extraction, which reduces the huge number of dimensions, brought in the feature construction step;

  • model building, which builds classification or regression models based on the selected features;

  • prediction adjustment, which adjusts the prediction results based on demographics relationship and performs ensemble.

We highlight the novelties of our solution as follows.
  • 1.

    A contextual feature construction framework, CFC, is given. In addition to handling huge amounts of data, we formulate the original dataset into an entity-relation model, where each kind of information is an entity (corresponding to one csv file in the provided dataset, such as call.csv) and the relation between each entity is built according to the user ID. Then, to construct each feature, we can send a query to the constructed relational database to obtain sufficient statistics. Our previous work  [4] shows that feature construction is an effective way to provide valuable features.

  • 2.

    Multi-task model building/adjustment is proposed. One interesting point of the MDC challenge is that different demographics predictions or different tasks are correlated. Thus, besides building individual models for each individual demographics prediction, we also propose a multi-task learning model and adjust predictions based on task relevance. In our experiments, the proposed multi-task logistic regression model outperforms the original logistic regression in all classification tasks, and the task-relevant prediction adjustment improves by 4% accuracy on classification tasks and reduces by 0.02 Root Mean Square Error (RMSE) on regression tasks.

  • 3.

    Cost-sensitive classification based regression models are built. As most dimension reduction methods are proposed for classification problems, to utilize these state-of-the-art approaches, we convert the regression tasks (“number of family members” and “age”) into cost-sensitive classification problems. Experimental results show that the cost-sensitive SVM outperforms the support vector regression and linear regression models as high as 0.12 on RMSE.

Considering the limited number of labeled data, we propose a simple model averaging framework to generate the final predictions. Specifically, we first manually select models with best leave-one-out performance and then group them into several levels. Finally, we average the results with respect to the different levels to obtain different final submissions. We analyze that the leave-one-out evaluation is not biased while the model averaging can reduce the variance, and hence the built models have good generalizability. According to the results, our solution ranks first place in the challenge.1

Section snippets

Related works

We summarize related works in this section. The proposed framework can be considered as a multi-task learning algorithm for user profile modeling based on mobile data.

User profile modeling represents each user as a characteristic vector based on the users’ log data, such as borrowing logs, search logs, mobile logs, purchasing logs, and the like. As accurate profile modeling can benefit a lot of applications, such as advertisement targeting, recommendation, personalization, search, and so on, it

Learning framework with contextual feature construction

The proposed framework is composed of five components, including feature construction, data cleaning, feature selection and extraction, model building and prediction adjustment, as shown in Fig. 1. One important point is that we formulate the whole process as an iterative framework, and hence the evaluation results of the built models in each iteration can help us construct more meaningful features in the next iteration.

Experiments

The main task in the experiment is to perform model selection for the submissions. It aims to detect the best combination of feature sets, learning models and model parameters. Most model implementations are based on Weka,2 Libsvm3 and Scikit-learn.4 We evaluate the effectiveness of each combination using the leave-one-out (LOO) method. The results of classification tasks (“gender”, “marital

Submission description in the competition

We describe the submitted results for the competition as follows. We first adjust all predictions using the strategies presented in Section  3.6. Consequently, the adjusted predictions and the predictions not adjusted are pooled together to perform model averaging in order to generate the final submissions. For classification tasks, the final submission is the model averaging of single results. Specifically, each single model will produce a prediction of class probabilities, or confidences for

Conclusion

We presented a supervised learning framework to predict users’ demographics based on log data recorded by mobile phones. Importantly, we proposed a unified feature construction process, Contextual Feature Construction (CFC), to build features from raw data for each user. We formulate each feature as the conditional probability of one user activity under given contexts. Consequently, these conditional probabilities can be computed through activity counts in each csv file. We then preprocessed

Acknowledgments

We thank the support of Hong Kong RGC GRF projects 621010 and 621211. We thank Nathan N. Liu and Yin Zhu for discussions, and Shauna Dalton for revisions.

References (37)

  • K. Kira et al.

    A practical approach to feature selection

  • J. Hu, H.-J. Zeng, H. Li, C. Niu, Z. Chen, Demographic prediction based on user’s browsing behavior, in: Proceedings of...
  • I. Weber, C. Castillo, The demographics of web search, in: Proceedings of the 33rd International ACM SIGIR Conference...
  • L. Li, T. Mei, X. Niu, C.-W. Ngo, Pagesense: style-wise web page advertising, in: Proceedings of the 19th International...
  • W. Fan et al.

    Generalized and heuristic-free feature construction for improved accuracy

  • A. Mislove, B. Viswanath, K.P. Gummadi, P. Druschel, You are who you know: inferring user profiles in online social...
  • A. Ulges, M. Koch, D. Borth, Linking visual concept detection with viewer demographics, in: Proceedings of the 2nd ACM...
  • J. Otterbacher, Inferring gender of movie reviewers: exploiting writing style, content and metadata, in: Proceedings of...
  • G. Chittaranjan, J. Blom, D. Gatica-Perez, Who’s who with big-five: analyzing and classifying personality traits with...
  • J. Staiano et al.

    Friends don’t lie: inferring personality traits from social network structure

  • R. LiKamWa, Y. Liu, N. Lane, L. Zhong, Can your smartphone infer your mood? in: Second International Workshop on...
  • Y.-A. de Montjoye et al.

    Predicting personality using novel mobile phone-based metrics

  • R. Caruana

    Multitask learning

    Machine Learning

    (1997)
  • O. Chapelle, P. Shivaswamy, S. Vadrevu, K. Weinberger, Y. Zhang, B. Tseng, Multi-task learning for boosting with...
  • R. Gupta, L. Ratinov, Text categorization with knowledge transfer from heterogeneous data sources, in: Proceedings of...
  • L. Jacob et al.

    Clustered multi-task learning: a convex formulation

  • J. Chen, L. Tang, J. Liu, J. Ye, A convex formulation for learning shared structures from multiple tasks, in:...
  • T.K. Pong et al.

    Trace norm regularization: reformulations, algorithms, and multi-task learning

    SIAM Journal on Optimization

    (2010)
  • Cited by (51)

    • Zooming into mobility to understand cities: A review of mobility-driven urban studies

      2022, Cities
      Citation Excerpt :

      Specifically, spatiotemporal features extracted from mobility data can be used to reflect individuals' activities such as the number of visits, radius of gyration, travel length, and so on. Then these features can be input to supervised learning models to infer demographics such as gender, education, age, and so on (Arai & Shibasaki, 2013; Zhong et al., 2013). The prediction accuracies of such non-parametric models varied from 53 % to 85 % in different studies.

    • You are how you travel: A multi-task learning framework for Geodemographic inference using transit smart card data

      2020, Computers, Environment and Urban Systems
      Citation Excerpt :

      After extracting activity profiles of individuals, the demographics can be inferred using classifiers. A wide range of traditional machine learning models has been applied for performing the classification task, including support vector machine, decision tree, and Naïve Bayes (Zhao et al., 2017; Zhong et al., 2013). These classifiers require hand-crafted features as input for training.

    View all citing articles on Scopus
    View full text