Abstract
Machine learning can be applied in applications that take decisions that impact people’s lives. Such techniques have the potential to make decision making more objective, but there also is a risk that the decisions can discriminate against certain groups as a result of bias in the underlying data. Reducing bias, or promoting fairness, has been a focus of significant investigation in machine learning, for example, based on pre-processing the training data, changing the learning algorithm, or post-processing the results of the learning. However, prior to these activities, data integration discovers and integrates the data that is used for training, and data integration processes have the potential to produce data that leads to biased conclusions. In this article, we propose an approach that generates schema mappings in ways that take into account: (i) properties that are intrinsic to mapping results that may give rise to bias in analyses; and (ii) bias observed in classifiers trained on the results of different sets of mappings. The approach explores a space of different ways of integrating the data, using a Tabu search algorithm, guided by bias-aware objective functions that represent different types of bias.The resulting approach is evaluated using Adult Census and German Credit datasets to explore the extent to which and the circumstances in which the approach can increase the fairness of the results of the data integration process.
- [1] . 2018. User driven multi-criteria source selection. Inf. Sci. 430 (2018), 179–199.
DOI: Google ScholarCross Ref - [2] . 2020. Coverage-based rewriting for data preparation. In Proc. EDBT/ICDT’20 Joint Conference Workshops, Vol. 2578. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-2578/PIE4.pdf.Google Scholar
- [3] . 2021. FAIR-DB: FunctionAl dependencies to discoveR data bias. In Proc. EDBT/ICDT 2021 Joint Conference Workshops, Vol. 2841. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-2841/PIE+Q_4.pdf.Google Scholar
- [4] . 2019. Fairness and Machine Learning. Retrieved from fairmlbook.org.Google Scholar
- [5] . 2010. Subsumption and complementation as data fusion operators. In Extended Database Technology Conference (EDBT’10). 513–524.
DOI: Google ScholarCross Ref - [6] . 2009. Building classifiers with independency constraints. In IEEE International Conference on Data Mining Workshops. 13–18.
DOI: Google ScholarCross Ref - [7] . 2017. Optimized pre-processing for discrimination prevention. In Conference on Advances in Neural Information Processing Systems. 3992–4001. Retrieved from http://papers.nips.cc/paper/6988-optimized-pre-processing-for-discrimination-prevention.pdf.Google Scholar
- [8] . 2002. SMOTE: Synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1 (
June 2002), 321–357. Google ScholarDigital Library - [9] . 2005. Wrapper-based computation and evaluation of sampling methods for imbalanced datasets. In International Workshop on Utility-based Data Mining. ACM, 24–33.
DOI: Google ScholarCross Ref - [10] . 2018. Why Is My Classifier Discriminatory?
arxiv:stat.ML/ 1805.12002. Google Scholar - [11] . 2017. Conscientious classification: A data scientist’s guide to discrimination-aware classification. Big Data 5 (
6 2017), 120–134.DOI: Google ScholarCross Ref - [12] . 2015. Certifying and removing disparate impact. In ACM SIGKDD (KDD’15). 259–268.
DOI: Google ScholarCross Ref - [13] . 2020. Ethical dimensions for data quality. ACM J. Data Inf. Qual. 12, 1 (2020), 2:1–2:5.
DOI: Google ScholarCross Ref - [14] . 2019. A comparative study of fairness-enhancing interventions in machine learning. In Conference on Fairness, Accountability, and Transparency. ACM, 329–338.
DOI: Google ScholarCross Ref - [15] . 2016. Data wrangling for big data: Challenges and opportunities. In Extended Database Technology Conference (EDBT’16). 473–478.
DOI: Google ScholarCross Ref - [16] . 2017. On formalizing fairness in prediction with machine learning. Retrieved from http://arxiv.org/abs/1710.03184.Google Scholar
- [17] . 2020. Fair data integration. Retrieved from https://arxiv.org/abs/2006.06053.Google Scholar
- [18] . 1997. Tabu Search. Kluwer.
DOI: Google ScholarCross Ref - [19] . 2021. Optimising fairness through parametrised data sampling. In 24th International Conference on Extending Database Technology (EDBT’21). OpenProceedings.org, 445–450.
DOI: Google ScholarCross Ref - [20] . 2016. Equality of opportunity in supervised learning. In 30th International Conference on Advances in Neural Information Processing Systems (NIPS’16). Curran Associates Inc., 3323–3331. Google ScholarDigital Library
- [21] . 2019. Improving fairness in machine learning systems: What do industry practitioners need? In Conference on Human Factors in Computing Systems (CHI’19). ACM, 600.
DOI: Google ScholarCross Ref - [22] . 2018. A short-term intervention for long-term fairness in the labor market. World Wide Web Conference on World Wide Web (WWW’18).
DOI: Google ScholarCross Ref - [23] . 2012. Benefits, adoption barriers and myths of open data and open government. Inf. Syst. Manag. 29, 4 (2012), 258–268.
DOI: Google ScholarCross Ref - [24] . 2009. Classifying without discriminating. In 2nd International Conference on Computer, Control and Communication. 1–6.
DOI: Google ScholarCross Ref - [25] . 2011. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33 (2011), 1–33.Google ScholarDigital Library
- [26] . 2012. Decision theory for discrimination-aware classification. In IEEE International Conference on Data Mining (ICDM’12). 924–929.
DOI: Google ScholarCross Ref - [27] . 2012. Fairness-aware classifier with prejudice remover regularizer. In Machine Learning and Knowledge Discovery in Databases (ECML PKDD’12), P. A. Flach, T. De Bie, and N. Cristianini (Eds.). Lecture Notes in Computer Science, vol 7524. Google ScholarCross Ref
- [28] . 2017. Avoiding discrimination through causal reasoning. In Conference on Advances in Neural Information Processing Systems (NIPS’17). 656–666. Retrieved from https://proceedings.neurips.cc/paper/2017/hash/f5f8590cd58a54e94377e6ae2eded4d9-Abstract.html.Google Scholar
- [29] . 2017. Avoiding discrimination through causal reasoning. In Conference on Advances in Neural Information Processing Systems (NIPS’17). 656–666. Google Scholar
- [30] . 2011. Improving SVM classification on imbalanced time series data sets with ghost points. Knowl. Inf. Syst. 28, 1 (
July 2011), 1–23. Google ScholarCross Ref - [31] . 2019. VADA: An architecture for end user informed data preparation. J. Big Data 6:74 (2019).
DOI: Google ScholarCross Ref - [32] . 2018. Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11, 7 (2018), 759–772.Google ScholarDigital Library
- [33] . 2022. Schema mapping generation in the wild. Inf. Syst. 104 (2022), 101904.
DOI: Google ScholarCross Ref - [34] . 2020. Fairness in data wrangling. In 21st International Conference on Information Reuse and Integration for Data Science (IRI’20). IEEE, 341–348.
DOI: Google ScholarCross Ref - [35] . 2021. Data wrangling for fair classification. In EDBT/ICDT 2021 Joint Conference Workshops (CEUR Workshop Proceedings), Vol. 2841. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-2841/PIE+Q_1.pdf.Google Scholar
- [36] . 2019. A survey on bias and fairness in machine learning. Retrieved from http://arxiv.org/abs/1908.09635.Google Scholar
- [37] . 2019. Data lake management: Challenges and opportunities. Proc. VLDB Endow. 12, 12 (2019), 1986–1989.
DOI: Google ScholarCross Ref - [38] . 1996. Bagging, boosting, and C4.5. In 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference (AAAI’96). AAAI Press/The MIT Press, 725–730. Retrieved from http://www.aaai.org/Library/AAAI/1996/aaai96-108.php.Google Scholar
- [39] . 2016. SourceSight: Enabling effective source selection. In ACM SIGMOD. 2157–2160.
DOI: Google ScholarCross Ref - [40] . 2021. FairOD: Fairness-aware outlier detection. In AAAI/ACM Conference on AI, Ethics, and Society (AIES’21). 210–220.
DOI: Google ScholarCross Ref - [41] . 2019. The impact of data preparation on the fairness of software systems. Retrieved from http://arxiv.org/abs/1910.02321.Google Scholar
- [42] . 2019. Parametrised data sampling for fairness optimisation. In Conference on Knowledge Discovery and Data Mining XAI.Google Scholar
- [43] . 2013. Learning fair representations. In International Machine Learning Conference (ICML’13). 325–333.Google Scholar
- [44] . 2018. Mitigating unwanted biases with adversarial learning. In AAAI Conference on Artificial Intelligence. 335–340.
DOI: Google ScholarCross Ref - [45] . 2019. Inherent tradeoffs in learning fair representations. In Advances in Neural Information Processing Systems, , , , , , and (Eds.), Vol. 32. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2019/file/b4189d9de0fb2b9cce090bd1a15e3420-Paper.pdf.Google Scholar
Index Terms
- Fairness-aware Data Integration
Recommendations
Responsible Data Integration: Next-generation Challenges
SIGMOD '22: Proceedings of the 2022 International Conference on Management of DataData integration has been extensively studied by the data management community and is a core task in the data pre-processing step of ML pipelines. When the integrated data is used for analysis and model training, responsible data science requires ...
Data Integration and Machine Learning: A Natural Synergy
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataThere is now more data to analyze than ever before. As data volume and variety have increased, so have the ties between machine learning and data integration become stronger. For machine learning to be effective, one must utilize data from the greatest ...
On-demand big data integration
Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager extract, transform, and load (ETL) process constructs an integrated data repository ...
Comments