research-article

Fairness-aware Data Integration

Authors:
Lacramioara Mazilu

University of Manchester, United Kingdom and Peak AI Ltd., Manchester, United Kingdom

University of Manchester, United Kingdom and Peak AI Ltd., Manchester, United Kingdom
View Profile

,
Norman W. Paton

University of Manchester, Manchester, United Kingdom

University of Manchester, Manchester, United Kingdom

0000-0003-2008-6617
View Profile

,
Nikolaos Konstantinou

University of Manchester, Manchester, United Kingdom

University of Manchester, Manchester, United Kingdom
View Profile

,
Alvaro A. A. Fernandes

University of Manchester, Manchester, United Kingdom

University of Manchester, Manchester, United Kingdom
View Profile

Authors Info & Claims

Journal of Data and Information Quality Volume 14 Issue 4Article No.: 28pp 1–26https://doi.org/10.1145/3519419

Published:23 November 2022Publication History

Journal of Data and Information Quality

Abstract

Machine learning can be applied in applications that take decisions that impact people’s lives. Such techniques have the potential to make decision making more objective, but there also is a risk that the decisions can discriminate against certain groups as a result of bias in the underlying data. Reducing bias, or promoting fairness, has been a focus of significant investigation in machine learning, for example, based on pre-processing the training data, changing the learning algorithm, or post-processing the results of the learning. However, prior to these activities, data integration discovers and integrates the data that is used for training, and data integration processes have the potential to produce data that leads to biased conclusions. In this article, we propose an approach that generates schema mappings in ways that take into account: (i) properties that are intrinsic to mapping results that may give rise to bias in analyses; and (ii) bias observed in classifiers trained on the results of different sets of mappings. The approach explores a space of different ways of integrating the data, using a Tabu search algorithm, guided by bias-aware objective functions that represent different types of bias.The resulting approach is evaluated using Adult Census and German Credit datasets to explore the extent to which and the circumstances in which the approach can increase the fairness of the results of the data integration process.

REFERENCES

[1] Abel Edward, Keane John A., Paton Norman W., Fernandes Alvaro A. A., Koehler Martin, Konstantinou Nikolaos, Ríos Julio César Cortés, Azuan Nurzety A., and Embury Suzanne M.. 2018. User driven multi-criteria source selection. Inf. Sci. 430 (2018), 179–199. DOI:Google ScholarCross Ref
[2] Accinelli Chiara, Minisi Simone, and Catania Barbara. 2020. Coverage-based rewriting for data preparation. In Proc. EDBT/ICDT’20 Joint Conference Workshops, Vol. 2578. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-2578/PIE4.pdf.Google Scholar
[3] Azzalini Fabio, Criscuolo Chiara, and Tanca Letizia. 2021. FAIR-DB: FunctionAl dependencies to discoveR data bias. In Proc. EDBT/ICDT 2021 Joint Conference Workshops, Vol. 2841. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-2841/PIE+Q_4.pdf.Google Scholar
[4] Barocas Solon, Hardt Moritz, and Narayanan Arvind. 2019. Fairness and Machine Learning. Retrieved from fairmlbook.org.Google Scholar
[5] Bleiholder Jens, Szott Sascha, Herschel Melanie, Kaufer Frank, and Naumann Felix. 2010. Subsumption and complementation as data fusion operators. In Extended Database Technology Conference (EDBT’10). 513–524. DOI:Google ScholarCross Ref
[6] Calders Toon, Kamiran Faisal, and Pechenizkiy Mykola. 2009. Building classifiers with independency constraints. In IEEE International Conference on Data Mining Workshops. 13–18. DOI:Google ScholarCross Ref
[7] Calmon Flavio, Wei Dennis, Vinzamuri Bhanukiran, Ramamurthy Karthikeyan Natesan, and Varshney Kush R.. 2017. Optimized pre-processing for discrimination prevention. In Conference on Advances in Neural Information Processing Systems. 3992–4001. Retrieved from http://papers.nips.cc/paper/6988-optimized-pre-processing-for-discrimination-prevention.pdf.Google Scholar
[8] Chawla Nitesh V., Bowyer Kevin W., Hall Lawrence O., and Kegelmeyer W. Philip. 2002. SMOTE: Synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1 (June 2002), 321–357. Google ScholarDigital Library
[9] Chawla Nitesh V., Hall Lawrence O., and Joshi Ajay. 2005. Wrapper-based computation and evaluation of sampling methods for imbalanced datasets. In International Workshop on Utility-based Data Mining. ACM, 24–33. DOI:Google ScholarCross Ref
[10] Chen Irene, Johansson Fredrik D., and Sontag David. 2018. Why Is My Classifier Discriminatory?arxiv:stat.ML/ 1805.12002.Google Scholar
[11] d’Alessandro Brian, O’Neil Cathy, and LaGatta Tom. 2017. Conscientious classification: A data scientist’s guide to discrimination-aware classification. Big Data 5 (6 2017), 120–134. DOI:Google ScholarCross Ref
[12] Feldman Michael, Friedler Sorelle A., Moeller John, Scheidegger Carlos, and Venkatasubramanian Suresh. 2015. Certifying and removing disparate impact. In ACM SIGKDD (KDD’15). 259–268. DOI:Google ScholarCross Ref
[13] Firmani Donatella, Tanca Letizia, and Torlone Riccardo. 2020. Ethical dimensions for data quality. ACM J. Data Inf. Qual. 12, 1 (2020), 2:1–2:5. DOI:Google ScholarCross Ref
[14] Friedler Sorelle A., Scheidegger Carlos, Venkatasubramanian Suresh, Choudhary Sonam, Hamilton Evan P., and Roth Derek. 2019. A comparative study of fairness-enhancing interventions in machine learning. In Conference on Fairness, Accountability, and Transparency. ACM, 329–338. DOI:Google ScholarCross Ref
[15] Furche Tim, Gottlob Georg, Libkin Leonid, Orsi Giorgio, and Paton Norman W.. 2016. Data wrangling for big data: Challenges and opportunities. In Extended Database Technology Conference (EDBT’16). 473–478. DOI:Google ScholarCross Ref
[16] Gajane Pratik. 2017. On formalizing fairness in prediction with machine learning. Retrieved from http://arxiv.org/abs/1710.03184.Google Scholar
[17] Galhotra Sainyam, Shanmugam Karthikeyan, Sattigeri Prasanna, and Varshney Kush R.. 2020. Fair data integration. Retrieved from https://arxiv.org/abs/2006.06053.Google Scholar
[18] Glover Fred W. and Laguna Manuel. 1997. Tabu Search. Kluwer. DOI:Google ScholarCross Ref
[19] González-Zelaya Vladimiro, Salas Julián, Prangle Dennis, and Missier Paolo. 2021. Optimising fairness through parametrised data sampling. In 24th International Conference on Extending Database Technology (EDBT’21). OpenProceedings.org, 445–450. DOI:Google ScholarCross Ref
[20] Hardt Moritz, Price Eric, and Srebro Nathan. 2016. Equality of opportunity in supervised learning. In 30th International Conference on Advances in Neural Information Processing Systems (NIPS’16). Curran Associates Inc., 3323–3331. Google ScholarDigital Library
[21] Holstein Kenneth, Vaughan Jennifer Wortman, III Hal Daumé, Dudík Miroslav, and Wallach Hanna M.. 2019. Improving fairness in machine learning systems: What do industry practitioners need? In Conference on Human Factors in Computing Systems (CHI’19). ACM, 600. DOI:Google ScholarCross Ref
[22] Hu Lily and Chen Yiling. 2018. A short-term intervention for long-term fairness in the labor market. World Wide Web Conference on World Wide Web (WWW’18). DOI:Google ScholarCross Ref
[23] Janssen Marijn, Charalabidis Yannis, and Zuiderwijk Anneke. 2012. Benefits, adoption barriers and myths of open data and open government. Inf. Syst. Manag. 29, 4 (2012), 258–268. DOI:Google ScholarCross Ref
[24] Kamiran Faisal and Calders Toon. 2009. Classifying without discriminating. In 2nd International Conference on Computer, Control and Communication. 1–6. DOI:Google ScholarCross Ref
[25] Kamiran Faisal and Calders Toon. 2011. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33 (2011), 1–33.Google ScholarDigital Library
[26] Kamiran Faisal, Karim Asim, and Zhang Xiangliang. 2012. Decision theory for discrimination-aware classification. In IEEE International Conference on Data Mining (ICDM’12). 924–929. DOI:Google ScholarCross Ref
[27] Kamishima Toshihiro, Akaho Shotaro, Asoh Hideki, and Sakuma Jun. 2012. Fairness-aware classifier with prejudice remover regularizer. In Machine Learning and Knowledge Discovery in Databases (ECML PKDD’12), P. A. Flach, T. De Bie, and N. Cristianini (Eds.). Lecture Notes in Computer Science, vol 7524. Google ScholarCross Ref
[28] Kilbertus Niki, Rojas-Carulla Mateo, Parascandolo Giambattista, Hardt Moritz, Janzing Dominik, and Schölkopf Bernhard. 2017. Avoiding discrimination through causal reasoning. In Conference on Advances in Neural Information Processing Systems (NIPS’17). 656–666. Retrieved from https://proceedings.neurips.cc/paper/2017/hash/f5f8590cd58a54e94377e6ae2eded4d9-Abstract.html.Google Scholar
[29] Kilbertus Niki, Rojas-Carulla Mateo, Parascandolo Giambattista, Hardt Moritz, Janzing Dominik, and Schölkopf Bernhard. 2017. Avoiding discrimination through causal reasoning. In Conference on Advances in Neural Information Processing Systems (NIPS’17). 656–666. Google Scholar
[30] Köknar-Tezel Suzan and Latecki Longin Jan. 2011. Improving SVM classification on imbalanced time series data sets with ghost points. Knowl. Inf. Syst. 28, 1 (July 2011), 1–23. Google ScholarCross Ref
[31] Konstantinou Nikolaos, Abel Edward, Bellomarini Luigi, Bogatu Alex, Civili Cristina, Irfanie Endri, Koehler Martin, Mazilu Lacramioara, Sallinger Emanuel, Fernandes Alvaro, Gottlob Georg, Keane John, and Paton Norman. 2019. VADA: An architecture for end user informed data preparation. J. Big Data 6:74 (2019). DOI:Google ScholarCross Ref
[32] Kruse Sebastian and Naumann Felix. 2018. Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11, 7 (2018), 759–772.Google ScholarDigital Library
[33] Mazilu Lacramioara, Paton Norman W., Fernandes Alvaro A. A., and Koehler Martin. 2022. Schema mapping generation in the wild. Inf. Syst. 104 (2022), 101904. DOI:Google ScholarCross Ref
[34] Mazilu Lacramioara, Paton Norman W., Konstantinou Nikolaos, and Fernandes Alvaro A. A.. 2020. Fairness in data wrangling. In 21st International Conference on Information Reuse and Integration for Data Science (IRI’20). IEEE, 341–348. DOI:Google ScholarCross Ref
[35] Mazilu Lacramioara, Paton Norman W., Konstantinou Nikolaos, and Fernandes Alvaro A. A.. 2021. Data wrangling for fair classification. In EDBT/ICDT 2021 Joint Conference Workshops (CEUR Workshop Proceedings), Vol. 2841. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-2841/PIE+Q_1.pdf.Google Scholar
[36] Mehrabi Ninareh, Morstatter Fred, Saxena Nripsuta, Lerman Kristina, and Galstyan Aram. 2019. A survey on bias and fairness in machine learning. Retrieved from http://arxiv.org/abs/1908.09635.Google Scholar
[37] Nargesian Fatemeh, Zhu Erkang, Miller Renée J., Pu Ken Q., and Arocena Patricia C.. 2019. Data lake management: Challenges and opportunities. Proc. VLDB Endow. 12, 12 (2019), 1986–1989. DOI:Google ScholarCross Ref
[38] Quinlan J. Ross. 1996. Bagging, boosting, and C4.5. In 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference (AAAI’96). AAAI Press/The MIT Press, 725–730. Retrieved from http://www.aaai.org/Library/AAAI/1996/aaai96-108.php.Google Scholar
[39] Rekatsinas Theodoros, Deshpande Amol, Dong Xin Luna, Getoor Lise, and Srivastava Divesh. 2016. SourceSight: Enabling effective source selection. In ACM SIGMOD. 2157–2160. DOI:Google ScholarCross Ref
[40] Shekhar Shubhranshu, Shah Neil, and Akoglu Leman. 2021. FairOD: Fairness-aware outlier detection. In AAAI/ACM Conference on AI, Ethics, and Society (AIES’21). 210–220. DOI:Google ScholarCross Ref
[41] Valentim Inês, Lourenço Nuno, and Antunes Nuno. 2019. The impact of data preparation on the fairness of software systems. Retrieved from http://arxiv.org/abs/1910.02321.Google Scholar
[42] Zelaya Vladimiro, Missier Paolo, and Prangle Dennis. 2019. Parametrised data sampling for fairness optimisation. In Conference on Knowledge Discovery and Data Mining XAI.Google Scholar
[43] Zemel Richard, Wu Yu, Swersky Kevin, Pitassi Toniann, and Dwork Cynthia. 2013. Learning fair representations. In International Machine Learning Conference (ICML’13). 325–333.Google Scholar
[44] Zhang Brian Hu, Lemoine Blake, and Mitchell Margaret. 2018. Mitigating unwanted biases with adversarial learning. In AAAI Conference on Artificial Intelligence. 335–340. DOI:Google ScholarCross Ref
[45] Zhao Han and Gordon Geoff. 2019. Inherent tradeoffs in learning fair representations. In Advances in Neural Information Processing Systems, Wallach H., Larochelle H., Beygelzimer A., Alché-Buc F. d’, Fox E., and Garnett R. (Eds.), Vol. 32. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2019/file/b4189d9de0fb2b9cce090bd1a15e3420-Paper.pdf.Google Scholar

Index Terms

Fairness-aware Data Integration
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning
      2. Federated databases
2. Social and professional topics
  1. Professional topics
    1. Computing profession
      1. Codes of ethics

Recommendations

Responsible Data Integration: Next-generation Challenges
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

Data integration has been extensively studied by the data management community and is a core task in the data pre-processing step of ML pipelines. When the integrated data is used for analysis and model training, responsible data science requires ...
Read More
Data Integration and Machine Learning: A Natural Synergy
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

There is now more data to analyze than ever before. As data volume and variety have increased, so have the ties between machine learning and data integration become stronger. For machine learning to be effective, one must utilize data from the greatest ...
Read More
On-demand big data integration

Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager extract, transform, and load (ETL) process constructs an integrated data repository ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of Data and Information Quality Volume 14, Issue 4
December 2022
173 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3563905
Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 November 2022
- Online AM: 5 July 2022
- Accepted: 17 February 2022
- Revised: 13 January 2022
- Received: 6 July 2021
Published in jdiq Volume 14, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data integration
data preparation
fairness
bias
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 384
  Total Downloads
- Downloads (Last 12 months)218
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Fairness-aware Data Integration

Journal of Data and Information Quality

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Responsible Data Integration: Next-generation Challenges

Data Integration and Machine Learning: A Natural Synergy

On-demand big data integration