Abstract
Historically, policymakers and practitioners relied exclusively on survey and census data to design and plan for assistive interventions; now, social media offer a timely and cost-effective way to reach out to populations otherwise unobserved. This study was designed to address the needs of a non-for-profit organisation to reach out to the young unemployed individuals in Italy with educational and job opportunities via communication channels that are more likely to appeal to younger generations. To this extend, we developed an ad-hoc Facebook application which administers questionnaires while gathering data about the Likes on Facebook Pages. Then, we developed a machine learning framework that successfully predicts the unemployment status of an unseen individual (.74 AUC). However, blindly delegating to the machine learning model the communication intervention may lead to digital discrimination on the basis of socio-demographic characteristics. Here, we propose a framework that aims to optimising both for the prediction performance as well as the most adequate fairness metric. Our framework is based on an adaptive threshold for gender, while we show that it can be expanded for other socio-demographic attributes and generalised for other interventions of assistive character. We present a doubly cross-validated setting that achieves out-of-sample stability and generalisability of results. We compare the behaviour of models that infer on different sets of data and provide an indepth discussion on the most predictive features, demonstrating that the “fairness through unawareness” approach does not suffice to achieve a fair classification since sensitive demographic information can be inferred not only via other sociodemographic attributes but also from behavioural digital patterns. Finally, we thoroughly assess the behaviour of the adaptive threshold approach and provide an in-depth discussion on the advantages but also the implications of such models offering actionable insights. Our results show that careful assessment of fairness metrics should be considered, primarily when AI models are employed for policymaking.
Similar content being viewed by others
Notes
We conventionally refer to the AUROC values as “accuracy” throughout this paper.
The gender attribute is considered to be a binary variable since very few participants opted for the “Other” option.
A comparison between the geographical distribution of our sample per region and the expected values from the official Census is shown in the Supplementary Materials.
This choice is based on the fact that both groups do not actively search for a job.
Link to the list of categories: https://developers.facebook.com/docs/commerce-platform/catalog/categories/google-product-category-to-facebook-product-category
The full ranges for each hyperparameter are reported in the Supplementary Materials.
The baseline AUC for our tasks is .50.
References
Agarwal A, Beygelzimer A, Dudík M, Langford J, Wallach H (2018) A reductions approach to fair classification. In: International Conference on Machine Learning, pp 60–69. PMLR
Aiken E, Bellue S, Karlan D, Udry C, Blumenstock JE (2022) Machine learning and phone data can improve targeting of humanitarian aid. Nature 1–7
Akintande OJ (2021) Algorithm fairness through data inclusion, participation, and reciprocity. In: International Conference on Database Systems for Advanced Applications, Springer, pp 633–637
Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern Information Retrieval, vol 463. ACM Press, New York
Barocas S, Selbst AD (2016) Big data’s disparate impact. Calif L Rev 104:671
Becker GS (2010) The Economics of Discrimination. University of Chicago Press, Chicago
Bento M, Martinez LM, Martinez LF (2018) Brand engagement and search for brands on social media: Comparing generations x and y in portugal. J of Retailing and Consum Serv 43:234–241
Beutel A, Chen J, Doshi T, Qian H, Woodruff A, Luu C, Kreitmann P, Bischof J, Chi EH (2019) Putting fairness principles into practice: Challenges, metrics, and improvements. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp 453–459
Bi B, Shokouhi M, Kosinski M, Graepel T (2013) Inferring the demographics of search users: Social data meets search queries. In: Proceedings of the 22Nd International Conference on World Wide Web. WWW ’13, ACM, New York, NY, USA, pp 131–140. https://doi.org/10.1145/2488388.2488401
Bokányi E, Lábszki Z, Vattay G (2017) Prediction of employment and unemployment rates from twitter daily rhythms in the us. EPJ Data Sci 6(1):14
Bonanomi A, Rosina A, Cattuto C, Kalimeri K (2017) Understanding youth unemployment in italy via social media data. In: 28th IUSSP International Population Conference, Cape Town, South Africa
Calders T, Verwer S (2010) Three naive bayes approaches for discrimination-free classification. Data mining and knowl discov 21(2):277–292
Chhabra A, Masalkovaitė K, Mohapatra P (2021) An overview of fairness in clustering. IEEE Access
Chouldechova A (2017) Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5(2):153–163
Corbett-Davies S, Pierson E, Feller A, Goel S, Huq A (2017) Algorithmic decision making and the cost of fairness. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’17, Association for Computing Machinery, New York, NY, USA pp 797–806. https://doi.org/10.1145/3097983.3098095
Desiere S, Langenbucher K, et al. (2018) Profiling tools for early identification of jobseekers who need extra support. OECD Policy Brief on Activation Policies (dec) 1–4
Desiere S, Struyven L (2020) Using artificial intelligence to classify jobseekers: The accuracy-equity trade-off. Journal Of Social Policy
Dong Y, Yang Y, Tang J, Yang Y, Chawla NV (2014) Inferring user demographics and social strategies in mobile social networks. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, USA, pp 15–24. https://doi.org/10.1145/2623330.2623703
Dutta S, Wei D, Yueksel H, Chen P-Y, Liu S, Varshney K (2020) Is there a trade-off between fairness and accuracy? a perspective using mismatched hypothesis testing. In: International Conference on Machine Learning, pp 2803–2813. PMLR
Eslami, M., Krishna Kumaran, S.R., Sandvig, C., Karahalios, K.: Communicating algorithmic process in online behavioral advertising. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2018)
Fatehkia M, Kashyap R, Weber I (2018) Using facebook ad data to track the global digital gender gap. World Dev 107:189–209
Fatehkia M, Coles B, Ofli F, Weber I (2020) The relative value of facebook advertising data for poverty mapping. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 14, pp 934–938
Felbo B, Sundsøy P, Lehmann S, de Montjoye Y-A et al. (2017) Modeling the temporal nature of human behavior for demographics prediction. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 140–152
Gao J, Zhang Y-C, Zhou T (2019) Computational socioeconomics. Physics Reports
Goel S, Hofman J, Sirer MI (2012) Who does what on the web: Studying web browsing behavior at scale. In: International Conference on Weblogs and Social Media, pp 130–137
Goyat S (2011) The basis of market segmentation: A critical review of literature. Eur J of Bus and Management 3(9):45–54
Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16, Red Hook, NY, USA, pp 3323–3331
ISTAT (2020) ISTAT Database. Data on unemployed rate. http://dati.istat.it
Kalimeri K, Beiró MG, Delfino M, Raleigh R, Cattuto C (2019) Predicting demographics, moral foundations, and human values from digital behaviours. Comput in Human Behav 92:428–445
Kalimeri K, Beiró MG, Bonanomi A, Rosina A, Cattuto C (2020) Traditional versus facebook-based surveys: Evaluation of biases in self-reported demographic and psychometric information. Demogr Res 42(5):133–148
Kamiran F, Calders T (2012) Data preprocessing techniques for classification without discrimination. Knowl and Inf Syst 33(1):1–33
Kamishima T, Akaho S, Asoh H, Sakuma J (2012) Fairness-aware classifier with prejudice remover regularizer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 35–50
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y (2017) Lightgbm: A highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, pp 3146–3154
Kilbertus N, Rojas Carulla M, Parascandolo G, Hardt M, Janzing D, Schölkopf B (2017) Avoiding discrimination through causal reasoning. Advances in neural information processing systems 30
Kleinberg J, Mullainathan S, Raghavan M (2016) Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807
Kosinski M, Stillwell D, Graepel T (2013) Private traits and attributes are predictable from digital records of human behavior. Proc of the National Acad of Sci 110(15):5802–5805
Kuhn P (1987) Sex discrimination in labor markets: The role of statistical evidence. The American Economic Review 567–583
Leonelli S, Lovell R, Wheeler BW, Fleming L, Williams H (2021) From fair data to fair data use: Methodological data fairness in health-related social media research. Big Data & Soc 8(1):20539517211010310
Llorente A, Garcia-Herranz M, Cebrian M, Moro E (2015) Social media fingerprints of unemployment. PLOS ONE 10(5):1–13
Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S-I (2019) Explainable AI for Trees: From Local Explanations to Global Understanding
Lundberg SM, Lee S-I (2017a) A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30, pp 4765–4774
Lundberg S, Lee S-I (2017b) A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874
Malmi E, Weber I (2016) You are what apps you use: Demographic prediction based on user’s apps. ICWSM, 635–638
Mason SJ, Graham NE (2002) Areas beneath the relative operating characteristics (roc) and relative operating levels (rol) curves: Statistical significance and interpretation. Quarterly J of the Royal Meteorol Soc 128(584):2145–2166
Matz SC, Menges JI, Stillwell DJ, Schwartz HA (2019) Predicting individual-level income from facebook profiles. PloS one 14(3):0214369
Ntoutsi E, Fafalios P, Gadiraju U, Iosifidis V, Nejdl W, Vidal M-E, Ruggieri S, Turini F, Papadopoulos S, Krasanakis E et al (2020) Bias in data-driven artificial intelligence systems-an introductory survey. Wiley Int Rev: Data Mining and Knowl Discov 10(3):1356
Olteanu A, Castillo C, Diaz F, Kıcıman E (2019) Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data 2:13
Olteanu A, Castillo C, Diaz F, Kiciman E (2016) Social data: Biases, methodological pitfalls, and ethical boundaries. https://doi.org/10.2139/ssrn.2886526
O’Neil C (2016) Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown, New York
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J of Mach Learning Res 12:2825–2830
Pedreshi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 560–568
Pessach D, Shmueli E (2022) A review on fairness in machine learning. ACM Comput Surveys (CSUR) 55(3):1–44
Rama D, Mejova Y, Tizzoni M, Kalimeri K, Weber I (2020) Facebook ads as a demographic tool to measure the urban-rural divide. In: Proceedings of The Web Conference 2020, pp 327–338
Saleiro P, Kuester B, Stevens A, Anisfeld A, Hinkson L, London J, Ghani R (2018) Aequitas: A bias and fairness audit toolkit. arXiv preprint arXiv:1811.05577
Seneviratne S, Seneviratne A, Mohapatra P, Mahanti A (2015) Your installed apps reveal your gender and more! ACM SIGMOBILE Mobile Comput and Commun Rev 18(3):55–61
Stoll MA, Raphael S, Holzer HJ (2004) Black job applicants and the hiring officer’s race. ILR Rev 57(2):267–287
Sundsøy P, Bjelland J, Reme B-A, Jahani E, Wetter E, Bengtsson L (2016) Estimating individual employment status using mobile phone network data. arXiv preprint arXiv:1612.03870
Toole JL, Lin Y-R, Muehlegger E, Shoag D, González MC, Lazer D (2015) Tracking employment shocks using mobile phone data. J of The Royal Soc Int 12(107):20150185
Urbinati A, Kalimeri K, Bonanomi A, Rosina A, Cattuto C, Paolotti D (2020) Young adult unemployment through the lens of social media: Italy as a case study. In: International Conference on Social Informatics, Springer, Cham, pp 380–396
van Landeghem B, Desiere S, Struyven L (2021) Statistical profiling of unemployed jobseekers. IZA World of Labor, Germany
Van Rossum G, Drake FL (2009) Python 3 Reference Manual. CreateSpace, Scotts Valley, CA
Verma S, Rubin J (2018) Fairness definitions explained. In: 2018 IEEE/ACM International Workshop on Software Fairness (fairware), pp 1–7. IEEE
Wood R, Murch B, Betteridge R (2019) A comparison of population segmentation methods. Oper Res for Health Care 22:100192
Yeung K, Lodge M (2019) The Possibilities of Digital Discrimination: Research on E-commerce, Algorithms and Big Data. Oxford University Press, UK
Ying JJ-C, Chang Y-J, Huang C-M, Tseng VS (2012) Demographic prediction based on users mobile behaviors. Mobile Data Challenge
Zafar MB, Valera I, Gomez Rodriguez M, Gummadi KP (2017) Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In: Proceedings of the 26th International Conference on World Wide Web, pp 1171–1180
Zemel R, Wu Y, Swersky K, Pitassi T, Dwork C (2013) Learning fair representations. In: International Conference on Machine Learning, pp 325–333. PMLR
Zhang BH, Lemoine B, Mitchell M (2018) Mitigating unwanted biases with adversarial learning. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp 335–340
Zhong Y, Yuan NJ, Zhong W, Zhang F, Xie X (2015) You are where you go: Inferring demographic attributes from location check-ins. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. WSDM ’15, ACM, New York, NY, USA, pp 295–304
Acknowledgements
K.K acknowledges support from the “Lagrange Project” of the ISI Foundation funded by the Fondazione CRT.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Responsible editor: Toon Calders.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Beiró, M.G., Kalimeri, K. Fairness in vulnerable attribute prediction on social media. Data Min Knowl Disc 36, 2194–2213 (2022). https://doi.org/10.1007/s10618-022-00855-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-022-00855-y