Skip to main content
Log in

Fairness in vulnerable attribute prediction on social media

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Historically, policymakers and practitioners relied exclusively on survey and census data to design and plan for assistive interventions; now, social media offer a timely and cost-effective way to reach out to populations otherwise unobserved. This study was designed to address the needs of a non-for-profit organisation to reach out to the young unemployed individuals in Italy with educational and job opportunities via communication channels that are more likely to appeal to younger generations. To this extend, we developed an ad-hoc Facebook application which administers questionnaires while gathering data about the Likes on Facebook Pages. Then, we developed a machine learning framework that successfully predicts the unemployment status of an unseen individual (.74 AUC). However, blindly delegating to the machine learning model the communication intervention may lead to digital discrimination on the basis of socio-demographic characteristics. Here, we propose a framework that aims to optimising both for the prediction performance as well as the most adequate fairness metric. Our framework is based on an adaptive threshold for gender, while we show that it can be expanded for other socio-demographic attributes and generalised for other interventions of assistive character. We present a doubly cross-validated setting that achieves out-of-sample stability and generalisability of results. We compare the behaviour of models that infer on different sets of data and provide an indepth discussion on the most predictive features, demonstrating that the “fairness through unawareness” approach does not suffice to achieve a fair classification since sensitive demographic information can be inferred not only via other sociodemographic attributes but also from behavioural digital patterns. Finally, we thoroughly assess the behaviour of the adaptive threshold approach and provide an in-depth discussion on the advantages but also the implications of such models offering actionable insights. Our results show that careful assessment of fairness metrics should be considered, primarily when AI models are employed for policymaking.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://ec.europa.eu/social/main.jsp?catId=1036

  2. We conventionally refer to the AUROC values as “accuracy” throughout this paper.

  3. The gender attribute is considered to be a binary variable since very few participants opted for the “Other” option.

  4. A comparison between the geographical distribution of our sample per region and the expected values from the official Census is shown in the Supplementary Materials.

  5. This choice is based on the fact that both groups do not actively search for a job.

  6. Link to the list of categories: https://developers.facebook.com/docs/commerce-platform/catalog/categories/google-product-category-to-facebook-product-category

  7. The full ranges for each hyperparameter are reported in the Supplementary Materials.

  8. All experiments are performed in Python (Van Rossum and Drake 2009) with scikit-learn (Pedregosa et al. 2011).

  9. The baseline AUC for our tasks is .50.

References

  • Agarwal A, Beygelzimer A, Dudík M, Langford J, Wallach H (2018) A reductions approach to fair classification. In: International Conference on Machine Learning, pp 60–69. PMLR

  • Aiken E, Bellue S, Karlan D, Udry C, Blumenstock JE (2022) Machine learning and phone data can improve targeting of humanitarian aid. Nature 1–7

  • Akintande OJ (2021) Algorithm fairness through data inclusion, participation, and reciprocity. In: International Conference on Database Systems for Advanced Applications, Springer, pp 633–637

  • Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern Information Retrieval, vol 463. ACM Press, New York

    Google Scholar 

  • Barocas S, Selbst AD (2016) Big data’s disparate impact. Calif L Rev 104:671

    Google Scholar 

  • Becker GS (2010) The Economics of Discrimination. University of Chicago Press, Chicago

    Google Scholar 

  • Bento M, Martinez LM, Martinez LF (2018) Brand engagement and search for brands on social media: Comparing generations x and y in portugal. J of Retailing and Consum Serv 43:234–241

    Article  Google Scholar 

  • Beutel A, Chen J, Doshi T, Qian H, Woodruff A, Luu C, Kreitmann P, Bischof J, Chi EH (2019) Putting fairness principles into practice: Challenges, metrics, and improvements. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp 453–459

  • Bi B, Shokouhi M, Kosinski M, Graepel T (2013) Inferring the demographics of search users: Social data meets search queries. In: Proceedings of the 22Nd International Conference on World Wide Web. WWW ’13, ACM, New York, NY, USA, pp 131–140. https://doi.org/10.1145/2488388.2488401

  • Bokányi E, Lábszki Z, Vattay G (2017) Prediction of employment and unemployment rates from twitter daily rhythms in the us. EPJ Data Sci 6(1):14

    Article  Google Scholar 

  • Bonanomi A, Rosina A, Cattuto C, Kalimeri K (2017) Understanding youth unemployment in italy via social media data. In: 28th IUSSP International Population Conference, Cape Town, South Africa

  • Calders T, Verwer S (2010) Three naive bayes approaches for discrimination-free classification. Data mining and knowl discov 21(2):277–292

    Article  MathSciNet  Google Scholar 

  • Chhabra A, Masalkovaitė K, Mohapatra P (2021) An overview of fairness in clustering. IEEE Access

  • Chouldechova A (2017) Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5(2):153–163

    Article  Google Scholar 

  • Corbett-Davies S, Pierson E, Feller A, Goel S, Huq A (2017) Algorithmic decision making and the cost of fairness. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’17, Association for Computing Machinery, New York, NY, USA pp 797–806. https://doi.org/10.1145/3097983.3098095

  • Desiere S, Langenbucher K, et al. (2018) Profiling tools for early identification of jobseekers who need extra support. OECD Policy Brief on Activation Policies (dec) 1–4

  • Desiere S, Struyven L (2020) Using artificial intelligence to classify jobseekers: The accuracy-equity trade-off. Journal Of Social Policy

  • Dong Y, Yang Y, Tang J, Yang Y, Chawla NV (2014) Inferring user demographics and social strategies in mobile social networks. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, USA, pp 15–24. https://doi.org/10.1145/2623330.2623703

  • Dutta S, Wei D, Yueksel H, Chen P-Y, Liu S, Varshney K (2020) Is there a trade-off between fairness and accuracy? a perspective using mismatched hypothesis testing. In: International Conference on Machine Learning, pp 2803–2813. PMLR

  • Eslami, M., Krishna Kumaran, S.R., Sandvig, C., Karahalios, K.: Communicating algorithmic process in online behavioral advertising. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2018)

  • Fatehkia M, Kashyap R, Weber I (2018) Using facebook ad data to track the global digital gender gap. World Dev 107:189–209

    Article  Google Scholar 

  • Fatehkia M, Coles B, Ofli F, Weber I (2020) The relative value of facebook advertising data for poverty mapping. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 14, pp 934–938

  • Felbo B, Sundsøy P, Lehmann S, de Montjoye Y-A et al. (2017) Modeling the temporal nature of human behavior for demographics prediction. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 140–152

  • Gao J, Zhang Y-C, Zhou T (2019) Computational socioeconomics. Physics Reports

  • Goel S, Hofman J, Sirer MI (2012) Who does what on the web: Studying web browsing behavior at scale. In: International Conference on Weblogs and Social Media, pp 130–137

  • Goyat S (2011) The basis of market segmentation: A critical review of literature. Eur J of Bus and Management 3(9):45–54

    Google Scholar 

  • Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16, Red Hook, NY, USA, pp 3323–3331

  • ISTAT (2020) ISTAT Database. Data on unemployed rate. http://dati.istat.it

  • Kalimeri K, Beiró MG, Delfino M, Raleigh R, Cattuto C (2019) Predicting demographics, moral foundations, and human values from digital behaviours. Comput in Human Behav 92:428–445

    Article  Google Scholar 

  • Kalimeri K, Beiró MG, Bonanomi A, Rosina A, Cattuto C (2020) Traditional versus facebook-based surveys: Evaluation of biases in self-reported demographic and psychometric information. Demogr Res 42(5):133–148

    Article  Google Scholar 

  • Kamiran F, Calders T (2012) Data preprocessing techniques for classification without discrimination. Knowl and Inf Syst 33(1):1–33

    Article  Google Scholar 

  • Kamishima T, Akaho S, Asoh H, Sakuma J (2012) Fairness-aware classifier with prejudice remover regularizer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 35–50

  • Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y (2017) Lightgbm: A highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, pp 3146–3154

  • Kilbertus N, Rojas Carulla M, Parascandolo G, Hardt M, Janzing D, Schölkopf B (2017) Avoiding discrimination through causal reasoning. Advances in neural information processing systems 30

  • Kleinberg J, Mullainathan S, Raghavan M (2016) Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807

  • Kosinski M, Stillwell D, Graepel T (2013) Private traits and attributes are predictable from digital records of human behavior. Proc of the National Acad of Sci 110(15):5802–5805

    Article  Google Scholar 

  • Kuhn P (1987) Sex discrimination in labor markets: The role of statistical evidence. The American Economic Review 567–583

  • Leonelli S, Lovell R, Wheeler BW, Fleming L, Williams H (2021) From fair data to fair data use: Methodological data fairness in health-related social media research. Big Data & Soc 8(1):20539517211010310

    Article  Google Scholar 

  • Llorente A, Garcia-Herranz M, Cebrian M, Moro E (2015) Social media fingerprints of unemployment. PLOS ONE 10(5):1–13

    Article  Google Scholar 

  • Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S-I (2019) Explainable AI for Trees: From Local Explanations to Global Understanding

  • Lundberg SM, Lee S-I (2017a) A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30, pp 4765–4774

  • Lundberg S, Lee S-I (2017b) A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874

  • Malmi E, Weber I (2016) You are what apps you use: Demographic prediction based on user’s apps. ICWSM, 635–638

  • Mason SJ, Graham NE (2002) Areas beneath the relative operating characteristics (roc) and relative operating levels (rol) curves: Statistical significance and interpretation. Quarterly J of the Royal Meteorol Soc 128(584):2145–2166

    Article  Google Scholar 

  • Matz SC, Menges JI, Stillwell DJ, Schwartz HA (2019) Predicting individual-level income from facebook profiles. PloS one 14(3):0214369

    Article  Google Scholar 

  • Ntoutsi E, Fafalios P, Gadiraju U, Iosifidis V, Nejdl W, Vidal M-E, Ruggieri S, Turini F, Papadopoulos S, Krasanakis E et al (2020) Bias in data-driven artificial intelligence systems-an introductory survey. Wiley Int Rev: Data Mining and Knowl Discov 10(3):1356

    Google Scholar 

  • Olteanu A, Castillo C, Diaz F, Kıcıman E (2019) Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data 2:13

    Article  Google Scholar 

  • Olteanu A, Castillo C, Diaz F, Kiciman E (2016) Social data: Biases, methodological pitfalls, and ethical boundaries. https://doi.org/10.2139/ssrn.2886526

  • O’Neil C (2016) Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown, New York

    MATH  Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J of Mach Learning Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Pedreshi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 560–568

  • Pessach D, Shmueli E (2022) A review on fairness in machine learning. ACM Comput Surveys (CSUR) 55(3):1–44

    Article  Google Scholar 

  • Rama D, Mejova Y, Tizzoni M, Kalimeri K, Weber I (2020) Facebook ads as a demographic tool to measure the urban-rural divide. In: Proceedings of The Web Conference 2020, pp 327–338

  • Saleiro P, Kuester B, Stevens A, Anisfeld A, Hinkson L, London J, Ghani R (2018) Aequitas: A bias and fairness audit toolkit. arXiv preprint arXiv:1811.05577

  • Seneviratne S, Seneviratne A, Mohapatra P, Mahanti A (2015) Your installed apps reveal your gender and more! ACM SIGMOBILE Mobile Comput and Commun Rev 18(3):55–61

    Article  Google Scholar 

  • Stoll MA, Raphael S, Holzer HJ (2004) Black job applicants and the hiring officer’s race. ILR Rev 57(2):267–287

    Article  Google Scholar 

  • Sundsøy P, Bjelland J, Reme B-A, Jahani E, Wetter E, Bengtsson L (2016) Estimating individual employment status using mobile phone network data. arXiv preprint arXiv:1612.03870

  • Toole JL, Lin Y-R, Muehlegger E, Shoag D, González MC, Lazer D (2015) Tracking employment shocks using mobile phone data. J of The Royal Soc Int 12(107):20150185

    Article  Google Scholar 

  • Urbinati A, Kalimeri K, Bonanomi A, Rosina A, Cattuto C, Paolotti D (2020) Young adult unemployment through the lens of social media: Italy as a case study. In: International Conference on Social Informatics, Springer, Cham, pp 380–396

  • van Landeghem B, Desiere S, Struyven L (2021) Statistical profiling of unemployed jobseekers. IZA World of Labor, Germany

    Book  Google Scholar 

  • Van Rossum G, Drake FL (2009) Python 3 Reference Manual. CreateSpace, Scotts Valley, CA

    Google Scholar 

  • Verma S, Rubin J (2018) Fairness definitions explained. In: 2018 IEEE/ACM International Workshop on Software Fairness (fairware), pp 1–7. IEEE

  • Wood R, Murch B, Betteridge R (2019) A comparison of population segmentation methods. Oper Res for Health Care 22:100192

    Article  Google Scholar 

  • Yeung K, Lodge M (2019) The Possibilities of Digital Discrimination: Research on E-commerce, Algorithms and Big Data. Oxford University Press, UK

    Google Scholar 

  • Ying JJ-C, Chang Y-J, Huang C-M, Tseng VS (2012) Demographic prediction based on users mobile behaviors. Mobile Data Challenge

  • Zafar MB, Valera I, Gomez Rodriguez M, Gummadi KP (2017) Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In: Proceedings of the 26th International Conference on World Wide Web, pp 1171–1180

  • Zemel R, Wu Y, Swersky K, Pitassi T, Dwork C (2013) Learning fair representations. In: International Conference on Machine Learning, pp 325–333. PMLR

  • Zhang BH, Lemoine B, Mitchell M (2018) Mitigating unwanted biases with adversarial learning. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp 335–340

  • Zhong Y, Yuan NJ, Zhong W, Zhang F, Xie X (2015) You are where you go: Inferring demographic attributes from location check-ins. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. WSDM ’15, ACM, New York, NY, USA, pp 295–304

Download references

Acknowledgements

K.K acknowledges support from the “Lagrange Project” of the ISI Foundation funded by the Fondazione CRT.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mariano G. Beiró or Kyriaki Kalimeri.

Additional information

Responsible editor: Toon Calders.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 1261 KB)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Beiró, M.G., Kalimeri, K. Fairness in vulnerable attribute prediction on social media. Data Min Knowl Disc 36, 2194–2213 (2022). https://doi.org/10.1007/s10618-022-00855-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00855-y

Keywords

Navigation