ABSTRACT
There are many cases where one may wish to retrieve non-random, diverse, and fair samples from an imbalanced dataset. With over 90K tech job descriptions and corresponding resumes that applied to those jobs, we describe our approach using evolutionary algorithms to derive a diverse and gender-fair subset for use in validating ML algorithms. Since 3/4 of the applicants were male, we had an imbalanced dataset. We describe how, through the use of evolutionary algorithms, we were able to discover different characteristics between genders as well as recognize issues with sparse representations. We constructed additional optimizing objectives to rectify these issues to ultimately unearth a desired sample.
- [n.d.]. Gender API. https://gender-api.com/en/Google Scholar
- [n.d.]. Infographic: Women's Representation in Big Tech. https://www.statista.com/chart/4467/female-employees-at-tech-companies/Google Scholar
- Zeinab Abbassi, Vahab S. Mirrokni, and Mayur Thakur. 2013. Diversity maximization under matroid constraints. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '13). Association for Computing Machinery, New York, NY, USA, 32--40. Google ScholarDigital Library
- Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit (1st edition ed.). O'Reilly Media, Beijing ; Cambridge Mass.Google Scholar
- François-Michel De Rainville, Félix-Antoine Fortin, Marc-André Gardner, Marc Parizeau, and Christian Gagné. 2012. DEAP: a python framework for evolutionary algorithms. In Proceedings of the 14th annual conference companion on Genetic and evolutionary computation (GECCO '12). Association for Computing Machinery, New York, NY, USA, 85--92. Google ScholarDigital Library
- K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6, 2 (April 2002), 182--197. Conference Name: IEEE Transactions on Evolutionary Computation. Google ScholarDigital Library
- Erhan Erkut. 1990. The discrete p-dispersion problem. European Journal of Operational Research 46, 1 (May 1990), 48--60. Google ScholarCross Ref
- Mehrdad Ghadiri, Samira Samadi, and Santosh Vempala. 2020. Socially Fair k-Means Clustering. arXiv:2006.10085 [cs, stat] (Oct. 2020). http://arxiv.org/abs/2006.10085 arXiv: 2006.10085 version: 2.Google Scholar
- Matthäus Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. 2019. Fair k-Center Clustering for Data Summarization. In Proceedings of the 36th International Conference on Machine Learning. PMLR, 3448--3457. https://proceedings.mlr.press/v97/kleindessner19a.html ISSN: 2640--3498.Google Scholar
- I. Douglas Moon and Sohail S. Chaudhry. 1984. An Analysis of Network Location Problems with Distance Constraints. Management Science 30, 3 (1984), 290--307. https://www.jstor.org/stable/2631804 Publisher: INFORMS.Google ScholarDigital Library
- Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging {AI} Applications. 561--577. https://www.usenix.org/conference/osdi18/presentation/moritzGoogle Scholar
- Zafeiria Moumoulidou, Andrew McGregor, and Alexandra Meliou. 2020. Diverse Data Selection under Fairness Constraints. arXiv:2010.09141 [cs] (Oct. 2020). http://arxiv.org/abs/2010.09141 arXiv: 2010.09141.Google Scholar
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 85 (2011), 2825--2830. http://jmlr.org/papers/v12/pedregosa11a.htmlGoogle ScholarDigital Library
- Gerard Salton and Michael J. McGill. 1983. Introduction to modern information retrieval. New York : McGraw-Hill. http://archive.org/details/introductiontomo00saltGoogle ScholarDigital Library
Index Terms
- Optimizing sample diversity with fairness constraints on imbalanced, sparse, hiring data
Recommendations
Diversity in software engineering research
ESEC/FSE 2013: Proceedings of the 2013 9th Joint Meeting on Foundations of Software EngineeringOne of the goals of software engineering research is to achieve generality: Are the phenomena found in a few projects reflective of others? Will a technique perform as well on projects other than the projects it is evaluated on? While it is common ...
Leveraging multi-AP diversity for transmission resilience in wireless networks: architecture and performance analysis
With the increasing development of IEEE 802.11 based Wireless Local Area Network (WLAN) devices, large-scale multi-cell WLANs with a high density of users and access points (APs) have emerged widely in various hotspots. Providing resilient data ...
Population Diversity as a Selection Factor: Improving Fitness by Increasing Diversity
GECCO '16 Companion: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference CompanionEvolutionary algorithms search for problem solutions by selecting individuals for survival and reproduction with a bias towards higher fitness. Such biases may lead to premature convergence on sub-optimal solutions. A bias toward greater diversity can ...
Comments