User demographics prediction based on mobile data

doi:10.1016/j.pmcj.2013.07.009

Pervasive and Mobile Computing

Volume 9, Issue 6, December 2013, Pages 823-837

https://doi.org/10.1016/j.pmcj.2013.07.009 Get rights and content

Abstract

Demographics prediction is an important component of user profile modeling. The accurate prediction of users’ demographics can help promote many applications, ranging from web search, personalization to behavior targeting. In this paper, we focus on how to predict users’ demographics, including “gender”, “job type”, “marital status”, “age” and “number of family members”, based on mobile data, such as users’ usage logs, physical activities and environmental contexts. The core idea is to build a supervised learning framework, where each user is represented as a feature vector and users’ demographics are considered as prediction targets. The most important component is to construct features from raw data and then supervised learning models can be applied. We propose a feature construction framework, CFC (contextual feature construction), where each feature is defined as the conditional probability of one user activity under the given contexts. Consequently, besides employing standard supervised learning models, we propose a regularized multi-task learning framework to model different kinds of demographics predictions collectively. We also propose a cost-sensitive classification framework for regression tasks, in order to benefit from the existing dimension reduction methods. Finally, due to the limited training instances, we employ ensemble to avoid overfitting. The experimental results show that the framework achieves classification accuracies on “gender”, “job” and “marital status” as high as 96%, 83% and 86%, respectively, and achieves Root Mean Square Error (RMSE) on “age” and “number of family members” as low as 0.69 and 0.66 respectively, under the leave-one-out evaluation.

Introduction

Demographics have been demonstrated as important information in personalization [1], web search [2] and advertisement targeting [3]. It may help learning models to infer users’ interests and hence improve an application’s performance. However, due to privacy issues or inconvenience, such demographics information is usually incomplete. Thus, how to predict users’ demographics accurately is critical for many applications. Fortunately, nowadays, smart cellphones have become more and more popular, where different censors and applications in a cellphone make it possible to monitor users’ different activities, such as phone calls, visiting, and game playing, and so on. In addition, these activity log data are closely related to users’ interests and backgrounds. For example, male users may play more games than female users; married people may spend more time at home; and business people may make more phone calls. Thus, one may ask, can we exploit such information to help predict users’ demographics?

The third task of the Nokia Mobile Data Challenge (MDC) 2012 provides such a dataset. Its target is to predict users’ demographics, including “gender”, “job type”, “marital status”, “age” and “number of family members”, based on users’ mobile data, including users’ call logs, application logs, media logs, activity logs, travel logs and users’ environmental contexts. We formulate these five prediction tasks as supervised learning problems. Formally, we aim to construct a function, of which input is an instance $x$ and output is the prediction $\hat{y}$ , i.e., $\hat{y} = f (x)$ . For classification tasks, i.e., “gender”, “job type” and “marital status”, $\hat{y}$ is a discrete value, and for regression tasks, i.e., “age” and “number of family members”, $\hat{y}$ is continual. However, different from standard supervised learning problems, there are several issues we need to solve. Firstly, the log data are raw data, which are not well represented and can be noisy. Thus, we need to construct valuable features and represent each user as a feature vector in order to apply machine learning models. To cope with this problem, we propose a feature construction framework called CFC, Contextual Feature Construction, where each feature is defined as the probability of one user activity happening under the given contexts. Secondly, the predictions of different demographics are related to each other. For example, job type is closely dependent on age; a PhD student is unlikely to be younger than 16 or older than 35. Consequently, we consider each individual demographics prediction, for example, the prediction of gender, as a learning task and propose a novel regularized multi-task framework to model the relationship between different demographics predictions. In addition, considering that most machine learning research works focus on classification problems, to benefit from them, we propose a cost-sensitive classification framework to solve the regression tasks.

We briefly describe our solution, which is a supervised learning framework, as follows. The proposed framework has five components:

•
feature construction, which extracts features from the raw data, such as the calling probability in the morning, game playing probability in the evening, and the like, and represents each user as a feature vector;
•
data cleaning, which replaces missing data and performs normalization;
•
feature selection and extraction, which reduces the huge number of dimensions, brought in the feature construction step;
•
model building, which builds classification or regression models based on the selected features;
•
prediction adjustment, which adjusts the prediction results based on demographics relationship and performs ensemble.

We highlight the novelties of our solution as follows.

1.
A contextual feature construction framework, CFC, is given. In addition to handling huge amounts of data, we formulate the original dataset into an entity-relation model, where each kind of information is an entity (corresponding to one csv file in the provided dataset, such as call.csv) and the relation between each entity is built according to the user ID. Then, to construct each feature, we can send a query to the constructed relational database to obtain sufficient statistics. Our previous work [4] shows that feature construction is an effective way to provide valuable features.
2.
Multi-task model building/adjustment is proposed. One interesting point of the MDC challenge is that different demographics predictions or different tasks are correlated. Thus, besides building individual models for each individual demographics prediction, we also propose a multi-task learning model and adjust predictions based on task relevance. In our experiments, the proposed multi-task logistic regression model outperforms the original logistic regression in all classification tasks, and the task-relevant prediction adjustment improves by 4% accuracy on classification tasks and reduces by 0.02 Root Mean Square Error (RMSE) on regression tasks.
3.
Cost-sensitive classification based regression models are built. As most dimension reduction methods are proposed for classification problems, to utilize these state-of-the-art approaches, we convert the regression tasks (“number of family members” and “age”) into cost-sensitive classification problems. Experimental results show that the cost-sensitive SVM outperforms the support vector regression and linear regression models as high as 0.12 on RMSE.

Considering the limited number of labeled data, we propose a simple model averaging framework to generate the final predictions. Specifically, we first manually select models with best leave-one-out performance and then group them into several levels. Finally, we average the results with respect to the different levels to obtain different final submissions. We analyze that the leave-one-out evaluation is not biased while the model averaging can reduce the variance, and hence the built models have good generalizability. According to the results, our solution ranks first place in the challenge.¹

Section snippets

Related works

We summarize related works in this section. The proposed framework can be considered as a multi-task learning algorithm for user profile modeling based on mobile data.

User profile modeling represents each user as a characteristic vector based on the users’ log data, such as borrowing logs, search logs, mobile logs, purchasing logs, and the like. As accurate profile modeling can benefit a lot of applications, such as advertisement targeting, recommendation, personalization, search, and so on, it

Learning framework with contextual feature construction

The proposed framework is composed of five components, including feature construction, data cleaning, feature selection and extraction, model building and prediction adjustment, as shown in Fig. 1. One important point is that we formulate the whole process as an iterative framework, and hence the evaluation results of the built models in each iteration can help us construct more meaningful features in the next iteration.

Experiments

The main task in the experiment is to perform model selection for the submissions. It aims to detect the best combination of feature sets, learning models and model parameters. Most model implementations are based on Weka,² Libsvm³ and Scikit-learn.⁴ We evaluate the effectiveness of each combination using the leave-one-out (LOO) method. The results of classification tasks (“gender”, “marital

Submission description in the competition

We describe the submitted results for the competition as follows. We first adjust all predictions using the strategies presented in Section 3.6. Consequently, the adjusted predictions and the predictions not adjusted are pooled together to perform model averaging in order to generate the final submissions. For classification tasks, the final submission is the model averaging of single results. Specifically, each single model will produce a prediction of class probabilities, or confidences for

Conclusion

We presented a supervised learning framework to predict users’ demographics based on log data recorded by mobile phones. Importantly, we proposed a unified feature construction process, Contextual Feature Construction (CFC), to build features from raw data for each user. We formulate each feature as the conditional probability of one user activity under given contexts. Consequently, these conditional probabilities can be computed through activity counts in each csv file. We then preprocessed

Acknowledgments

We thank the support of Hong Kong RGC GRF projects 621010 and 621211. We thank Nathan N. Liu and Yin Zhu for discussions, and Shauna Dalton for revisions.

References (37)

K. Kira et al.
A practical approach to feature selection
J. Hu, H.-J. Zeng, H. Li, C. Niu, Z. Chen, Demographic prediction based on user’s browsing behavior, in: Proceedings of...
I. Weber, C. Castillo, The demographics of web search, in: Proceedings of the 33rd International ACM SIGIR Conference...
L. Li, T. Mei, X. Niu, C.-W. Ngo, Pagesense: style-wise web page advertising, in: Proceedings of the 19th International...
W. Fan et al.
Generalized and heuristic-free feature construction for improved accuracy
A. Mislove, B. Viswanath, K.P. Gummadi, P. Druschel, You are who you know: inferring user profiles in online social...
A. Ulges, M. Koch, D. Borth, Linking visual concept detection with viewer demographics, in: Proceedings of the 2nd ACM...
J. Otterbacher, Inferring gender of movie reviewers: exploiting writing style, content and metadata, in: Proceedings of...
G. Chittaranjan, J. Blom, D. Gatica-Perez, Who’s who with big-five: analyzing and classifying personality traits with...
J. Staiano et al.
Friends don’t lie: inferring personality traits from social network structure

R. LiKamWa, Y. Liu, N. Lane, L. Zhong, Can your smartphone infer your mood? in: Second International Workshop on...

Y.-A. de Montjoye et al.

Predicting personality using novel mobile phone-based metrics

R. Caruana

Multitask learning

Machine Learning

(1997)

O. Chapelle, P. Shivaswamy, S. Vadrevu, K. Weinberger, Y. Zhang, B. Tseng, Multi-task learning for boosting with...

R. Gupta, L. Ratinov, Text categorization with knowledge transfer from heterogeneous data sources, in: Proceedings of...

L. Jacob et al.

Clustered multi-task learning: a convex formulation

J. Chen, L. Tang, J. Liu, J. Ye, A convex formulation for learning shared structures from multiple tasks, in:...

T.K. Pong et al.

Trace norm regularization: reformulations, algorithms, and multi-task learning

SIAM Journal on Optimization

(2010)

Cited by (51)

Zooming into mobility to understand cities: A review of mobility-driven urban studies
2022, Cities
Citation Excerpt :
Specifically, spatiotemporal features extracted from mobility data can be used to reflect individuals' activities such as the number of visits, radius of gyration, travel length, and so on. Then these features can be input to supervised learning models to infer demographics such as gender, education, age, and so on (Arai & Shibasaki, 2013; Zhong et al., 2013). The prediction accuracies of such non-parametric models varied from 53 % to 85 % in different studies.
Emerging big datasets about human mobility provide new and powerful ways of studying cities and addressing various urban issues. However, human mobility has usually been defined narrowly in prior research that limits the understanding of its values for urban applications. The aim of this study is to reveal the complexity and multiplicity of human mobility concept for various urban application scenarios, and present a comprehensive review of mobility-driven urban studies through four re-conceptualized urban mobility perspectives. Using a systematic review approach, existing mobility-driven urban studies are classified based on whether they interpret urban mobility as spatial movements, a social phenomenon, an economic indicator or a policy tool. Then, the core values of knowledge about urban mobility for addressing contemporary urban challenges are analyzed, and the current trends and future directions of mobility-driven urban studies are also discussed. Moving forward, the application of urban mobility knowledge can be further advanced by the evolution of mobility concepts, the improvement of mobility data quality and the innovation of mobility analytical methods. This review can contribute to the understanding the state of the art of mobility-driven urban studies, and provide inspiration and guidelines for studies of this area in the future.
Predicting verbal reasoning from virtual community membership in a sample of Russian young adults
2022, Heliyon
Predicting personality traits from social networking site profiles can help to assess individual differences in verbal reasoning without using long questionnaires. Inspired by earlier studies, which investigated whether abstract-thinking ability are predictable by social networking sites data, we used supervised machine learning to predict verbal-reasoning ability based on a proposed set of features extracted from virtual community membership. A large sample (N = 3,646) of Russian young adults aged 18–22 years approved access to the data from their social networking accounts and completed an online test on verbal reasoning. We experimented with binary classification machine-learning models for verbal-reasoning prediction. Prediction performance was tested on isolated control subsamples for men and women. The results of prediction on AUC-ROC metrics for control subsamples over 0.7 indicated reasonably good performance on predicting verbal-reasoning level. We also investigated the contribution of virtual community's genres to verbal reasoning level prediction for male and female participants. Theoretical interpretations of results stemming from both Vygotsky's sociocultural theory and behavioural genomics are discussed, including the implication that virtual communities make up a non-shared environment that can cause variance in verbal reasoning. We intend to conduct studies to explore the implications of the results further.
You are how you travel: A multi-task learning framework for Geodemographic inference using transit smart card data
2020, Computers, Environment and Urban Systems
Citation Excerpt :
After extracting activity profiles of individuals, the demographics can be inferred using classifiers. A wide range of traditional machine learning models has been applied for performing the classification task, including support vector machine, decision tree, and Naïve Bayes (Zhao et al., 2017; Zhong et al., 2013). These classifiers require hand-crafted features as input for training.
Geodemographics, providing the information of population's characteristics in the regions on a geographical basis, is of immense importance in urban studies, public policy-making, social research and business, among others. Such data, however, are difficult to collect from the public, which is usually done via census, with a low update frequency. In urban areas, with the increasing prevalence of public transit equipped with automated fare payment systems, researchers can collect massive transit smart card (SC) data from a large population. The SC data record human daily activities at an individual level with high spatial and temporal resolutions. It can reveal frequent activity areas (e.g., residential areas) and travel behaviours of passengers that are intimately intertwined with personal interests and characteristics. This provides new opportunities for geodemographic study. This paper seeks to develop a framework to infer travellers' demographics (such as age, income level and car ownership, et al.) and their residential areas for geodemographic mapping using SC data with a household survey. We first use a decision tree diagram to detect passengers' residential areas. We then represent each individual's spatio-temporal activity pattern derived from multi-week SC data as a 2D image. Leveraging this representation, a multi-task convolutional neural network (CNN) is employed to predict multiple demographics of individuals from the images. Combing the demographics and locations of their residence, geodemographic information is further obtained. The methodology is applied to a large-scale SC dataset provided by Transport for London. Results provide new insights in understanding the relationship between human activity patterns and demographics. To the best of our knowledge, this is the first attempt to infer geodemographics by using the SC data.
Genders prediction from indoor customer paths by Levenshtein-based fuzzy kNN
2019, Expert Systems with Applications
Companies have an advantage over the competitors if they can present customized offers to customers. Demographic information of customers is critical for the companies to develop individualized systems. While current technologies make it easy to collect customer data, the main problem is that demographic data are usually incomplete. Hence, several methods are developed to predict unknown genders of customers. In this study, customer genders are predicted from their paths in a shopping mall using fuzzy sets. A fuzzy classification method based on Levenshtein distance is developed for string data that refer to the indoor customer paths. Although there are several ways to predict the gender, no study has focused on path-based gender classification. The originality of the research is to classify customer data into the gender classes using indoor paths.
Evaluation of location-data based features using Gaussian mixture models for age group estimation
2024, Journal of Physics: Conference Series
Predicting user demographics based on interest analysis in movie dataset
2024, Multimedia Tools and Applications

View all citing articles on Scopus

View full text