User demographics prediction based on mobile data
Introduction
Demographics have been demonstrated as important information in personalization [1], web search [2] and advertisement targeting [3]. It may help learning models to infer users’ interests and hence improve an application’s performance. However, due to privacy issues or inconvenience, such demographics information is usually incomplete. Thus, how to predict users’ demographics accurately is critical for many applications. Fortunately, nowadays, smart cellphones have become more and more popular, where different censors and applications in a cellphone make it possible to monitor users’ different activities, such as phone calls, visiting, and game playing, and so on. In addition, these activity log data are closely related to users’ interests and backgrounds. For example, male users may play more games than female users; married people may spend more time at home; and business people may make more phone calls. Thus, one may ask, can we exploit such information to help predict users’ demographics?
The third task of the Nokia Mobile Data Challenge (MDC) 2012 provides such a dataset. Its target is to predict users’ demographics, including “gender”, “job type”, “marital status”, “age” and “number of family members”, based on users’ mobile data, including users’ call logs, application logs, media logs, activity logs, travel logs and users’ environmental contexts. We formulate these five prediction tasks as supervised learning problems. Formally, we aim to construct a function, of which input is an instance and output is the prediction , i.e., . For classification tasks, i.e., “gender”, “job type” and “marital status”, is a discrete value, and for regression tasks, i.e., “age” and “number of family members”, is continual. However, different from standard supervised learning problems, there are several issues we need to solve. Firstly, the log data are raw data, which are not well represented and can be noisy. Thus, we need to construct valuable features and represent each user as a feature vector in order to apply machine learning models. To cope with this problem, we propose a feature construction framework called CFC, Contextual Feature Construction, where each feature is defined as the probability of one user activity happening under the given contexts. Secondly, the predictions of different demographics are related to each other. For example, job type is closely dependent on age; a PhD student is unlikely to be younger than 16 or older than 35. Consequently, we consider each individual demographics prediction, for example, the prediction of gender, as a learning task and propose a novel regularized multi-task framework to model the relationship between different demographics predictions. In addition, considering that most machine learning research works focus on classification problems, to benefit from them, we propose a cost-sensitive classification framework to solve the regression tasks.
We briefly describe our solution, which is a supervised learning framework, as follows. The proposed framework has five components:
- •
feature construction, which extracts features from the raw data, such as the calling probability in the morning, game playing probability in the evening, and the like, and represents each user as a feature vector;
- •
data cleaning, which replaces missing data and performs normalization;
- •
feature selection and extraction, which reduces the huge number of dimensions, brought in the feature construction step;
- •
model building, which builds classification or regression models based on the selected features;
- •
prediction adjustment, which adjusts the prediction results based on demographics relationship and performs ensemble.
- 1.
A contextual feature construction framework, CFC, is given. In addition to handling huge amounts of data, we formulate the original dataset into an entity-relation model, where each kind of information is an entity (corresponding to one csv file in the provided dataset, such as call.csv) and the relation between each entity is built according to the user ID. Then, to construct each feature, we can send a query to the constructed relational database to obtain sufficient statistics. Our previous work [4] shows that feature construction is an effective way to provide valuable features.
- 2.
Multi-task model building/adjustment is proposed. One interesting point of the MDC challenge is that different demographics predictions or different tasks are correlated. Thus, besides building individual models for each individual demographics prediction, we also propose a multi-task learning model and adjust predictions based on task relevance. In our experiments, the proposed multi-task logistic regression model outperforms the original logistic regression in all classification tasks, and the task-relevant prediction adjustment improves by 4% accuracy on classification tasks and reduces by 0.02 Root Mean Square Error (RMSE) on regression tasks.
- 3.
Cost-sensitive classification based regression models are built. As most dimension reduction methods are proposed for classification problems, to utilize these state-of-the-art approaches, we convert the regression tasks (“number of family members” and “age”) into cost-sensitive classification problems. Experimental results show that the cost-sensitive SVM outperforms the support vector regression and linear regression models as high as 0.12 on RMSE.
Section snippets
Related works
We summarize related works in this section. The proposed framework can be considered as a multi-task learning algorithm for user profile modeling based on mobile data.
User profile modeling represents each user as a characteristic vector based on the users’ log data, such as borrowing logs, search logs, mobile logs, purchasing logs, and the like. As accurate profile modeling can benefit a lot of applications, such as advertisement targeting, recommendation, personalization, search, and so on, it
Learning framework with contextual feature construction
The proposed framework is composed of five components, including feature construction, data cleaning, feature selection and extraction, model building and prediction adjustment, as shown in Fig. 1. One important point is that we formulate the whole process as an iterative framework, and hence the evaluation results of the built models in each iteration can help us construct more meaningful features in the next iteration.
Experiments
The main task in the experiment is to perform model selection for the submissions. It aims to detect the best combination of feature sets, learning models and model parameters. Most model implementations are based on Weka,2 Libsvm3 and Scikit-learn.4 We evaluate the effectiveness of each combination using the leave-one-out (LOO) method. The results of classification tasks (“gender”, “marital
Submission description in the competition
We describe the submitted results for the competition as follows. We first adjust all predictions using the strategies presented in Section 3.6. Consequently, the adjusted predictions and the predictions not adjusted are pooled together to perform model averaging in order to generate the final submissions. For classification tasks, the final submission is the model averaging of single results. Specifically, each single model will produce a prediction of class probabilities, or confidences for
Conclusion
We presented a supervised learning framework to predict users’ demographics based on log data recorded by mobile phones. Importantly, we proposed a unified feature construction process, Contextual Feature Construction (CFC), to build features from raw data for each user. We formulate each feature as the conditional probability of one user activity under given contexts. Consequently, these conditional probabilities can be computed through activity counts in each csv file. We then preprocessed
Acknowledgments
We thank the support of Hong Kong RGC GRF projects 621010 and 621211. We thank Nathan N. Liu and Yin Zhu for discussions, and Shauna Dalton for revisions.
References (37)
- et al.
A practical approach to feature selection
- J. Hu, H.-J. Zeng, H. Li, C. Niu, Z. Chen, Demographic prediction based on user’s browsing behavior, in: Proceedings of...
- I. Weber, C. Castillo, The demographics of web search, in: Proceedings of the 33rd International ACM SIGIR Conference...
- L. Li, T. Mei, X. Niu, C.-W. Ngo, Pagesense: style-wise web page advertising, in: Proceedings of the 19th International...
- et al.
Generalized and heuristic-free feature construction for improved accuracy
- A. Mislove, B. Viswanath, K.P. Gummadi, P. Druschel, You are who you know: inferring user profiles in online social...
- A. Ulges, M. Koch, D. Borth, Linking visual concept detection with viewer demographics, in: Proceedings of the 2nd ACM...
- J. Otterbacher, Inferring gender of movie reviewers: exploiting writing style, content and metadata, in: Proceedings of...
- G. Chittaranjan, J. Blom, D. Gatica-Perez, Who’s who with big-five: analyzing and classifying personality traits with...
- et al.
Friends don’t lie: inferring personality traits from social network structure
Predicting personality using novel mobile phone-based metrics
Multitask learning
Machine Learning
Clustered multi-task learning: a convex formulation
Trace norm regularization: reformulations, algorithms, and multi-task learning
SIAM Journal on Optimization
Cited by (51)
Zooming into mobility to understand cities: A review of mobility-driven urban studies
2022, CitiesCitation Excerpt :Specifically, spatiotemporal features extracted from mobility data can be used to reflect individuals' activities such as the number of visits, radius of gyration, travel length, and so on. Then these features can be input to supervised learning models to infer demographics such as gender, education, age, and so on (Arai & Shibasaki, 2013; Zhong et al., 2013). The prediction accuracies of such non-parametric models varied from 53 % to 85 % in different studies.
You are how you travel: A multi-task learning framework for Geodemographic inference using transit smart card data
2020, Computers, Environment and Urban SystemsCitation Excerpt :After extracting activity profiles of individuals, the demographics can be inferred using classifiers. A wide range of traditional machine learning models has been applied for performing the classification task, including support vector machine, decision tree, and Naïve Bayes (Zhao et al., 2017; Zhong et al., 2013). These classifiers require hand-crafted features as input for training.
Genders prediction from indoor customer paths by Levenshtein-based fuzzy kNN
2019, Expert Systems with ApplicationsEvaluation of location-data based features using Gaussian mixture models for age group estimation
2024, Journal of Physics: Conference SeriesPredicting user demographics based on interest analysis in movie dataset
2024, Multimedia Tools and Applications