skip to main content
research-article

Fairness-aware Data Integration

Published:23 November 2022Publication History
Skip Abstract Section

Abstract

Machine learning can be applied in applications that take decisions that impact people’s lives. Such techniques have the potential to make decision making more objective, but there also is a risk that the decisions can discriminate against certain groups as a result of bias in the underlying data. Reducing bias, or promoting fairness, has been a focus of significant investigation in machine learning, for example, based on pre-processing the training data, changing the learning algorithm, or post-processing the results of the learning. However, prior to these activities, data integration discovers and integrates the data that is used for training, and data integration processes have the potential to produce data that leads to biased conclusions. In this article, we propose an approach that generates schema mappings in ways that take into account: (i) properties that are intrinsic to mapping results that may give rise to bias in analyses; and (ii) bias observed in classifiers trained on the results of different sets of mappings. The approach explores a space of different ways of integrating the data, using a Tabu search algorithm, guided by bias-aware objective functions that represent different types of bias.The resulting approach is evaluated using Adult Census and German Credit datasets to explore the extent to which and the circumstances in which the approach can increase the fairness of the results of the data integration process.

REFERENCES

  1. [1] Abel Edward, Keane John A., Paton Norman W., Fernandes Alvaro A. A., Koehler Martin, Konstantinou Nikolaos, Ríos Julio César Cortés, Azuan Nurzety A., and Embury Suzanne M.. 2018. User driven multi-criteria source selection. Inf. Sci. 430 (2018), 179199. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Accinelli Chiara, Minisi Simone, and Catania Barbara. 2020. Coverage-based rewriting for data preparation. In Proc. EDBT/ICDT’20 Joint Conference Workshops, Vol. 2578. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-2578/PIE4.pdf.Google ScholarGoogle Scholar
  3. [3] Azzalini Fabio, Criscuolo Chiara, and Tanca Letizia. 2021. FAIR-DB: FunctionAl dependencies to discoveR data bias. In Proc. EDBT/ICDT 2021 Joint Conference Workshops, Vol. 2841. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-2841/PIE+Q_4.pdf.Google ScholarGoogle Scholar
  4. [4] Barocas Solon, Hardt Moritz, and Narayanan Arvind. 2019. Fairness and Machine Learning. Retrieved from fairmlbook.org.Google ScholarGoogle Scholar
  5. [5] Bleiholder Jens, Szott Sascha, Herschel Melanie, Kaufer Frank, and Naumann Felix. 2010. Subsumption and complementation as data fusion operators. In Extended Database Technology Conference (EDBT’10). 513524. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Calders Toon, Kamiran Faisal, and Pechenizkiy Mykola. 2009. Building classifiers with independency constraints. In IEEE International Conference on Data Mining Workshops. 1318. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Calmon Flavio, Wei Dennis, Vinzamuri Bhanukiran, Ramamurthy Karthikeyan Natesan, and Varshney Kush R.. 2017. Optimized pre-processing for discrimination prevention. In Conference on Advances in Neural Information Processing Systems. 39924001. Retrieved from http://papers.nips.cc/paper/6988-optimized-pre-processing-for-discrimination-prevention.pdf.Google ScholarGoogle Scholar
  8. [8] Chawla Nitesh V., Bowyer Kevin W., Hall Lawrence O., and Kegelmeyer W. Philip. 2002. SMOTE: Synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1 (June 2002), 321357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Chawla Nitesh V., Hall Lawrence O., and Joshi Ajay. 2005. Wrapper-based computation and evaluation of sampling methods for imbalanced datasets. In International Workshop on Utility-based Data Mining. ACM, 2433. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Chen Irene, Johansson Fredrik D., and Sontag David. 2018. Why Is My Classifier Discriminatory?arxiv:stat.ML/ 1805.12002.Google ScholarGoogle Scholar
  11. [11] d’Alessandro Brian, O’Neil Cathy, and LaGatta Tom. 2017. Conscientious classification: A data scientist’s guide to discrimination-aware classification. Big Data 5 (6 2017), 120134. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Feldman Michael, Friedler Sorelle A., Moeller John, Scheidegger Carlos, and Venkatasubramanian Suresh. 2015. Certifying and removing disparate impact. In ACM SIGKDD (KDD’15). 259268. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Firmani Donatella, Tanca Letizia, and Torlone Riccardo. 2020. Ethical dimensions for data quality. ACM J. Data Inf. Qual. 12, 1 (2020), 2:1–2:5. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Friedler Sorelle A., Scheidegger Carlos, Venkatasubramanian Suresh, Choudhary Sonam, Hamilton Evan P., and Roth Derek. 2019. A comparative study of fairness-enhancing interventions in machine learning. In Conference on Fairness, Accountability, and Transparency. ACM, 329338. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Furche Tim, Gottlob Georg, Libkin Leonid, Orsi Giorgio, and Paton Norman W.. 2016. Data wrangling for big data: Challenges and opportunities. In Extended Database Technology Conference (EDBT’16). 473478. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Gajane Pratik. 2017. On formalizing fairness in prediction with machine learning. Retrieved from http://arxiv.org/abs/1710.03184.Google ScholarGoogle Scholar
  17. [17] Galhotra Sainyam, Shanmugam Karthikeyan, Sattigeri Prasanna, and Varshney Kush R.. 2020. Fair data integration. Retrieved from https://arxiv.org/abs/2006.06053.Google ScholarGoogle Scholar
  18. [18] Glover Fred W. and Laguna Manuel. 1997. Tabu Search. Kluwer. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] González-Zelaya Vladimiro, Salas Julián, Prangle Dennis, and Missier Paolo. 2021. Optimising fairness through parametrised data sampling. In 24th International Conference on Extending Database Technology (EDBT’21). OpenProceedings.org, 445450. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Hardt Moritz, Price Eric, and Srebro Nathan. 2016. Equality of opportunity in supervised learning. In 30th International Conference on Advances in Neural Information Processing Systems (NIPS’16). Curran Associates Inc., 33233331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Holstein Kenneth, Vaughan Jennifer Wortman, III Hal Daumé, Dudík Miroslav, and Wallach Hanna M.. 2019. Improving fairness in machine learning systems: What do industry practitioners need? In Conference on Human Factors in Computing Systems (CHI’19). ACM, 600. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Hu Lily and Chen Yiling. 2018. A short-term intervention for long-term fairness in the labor market. World Wide Web Conference on World Wide Web (WWW’18). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Janssen Marijn, Charalabidis Yannis, and Zuiderwijk Anneke. 2012. Benefits, adoption barriers and myths of open data and open government. Inf. Syst. Manag. 29, 4 (2012), 258268. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Kamiran Faisal and Calders Toon. 2009. Classifying without discriminating. In 2nd International Conference on Computer, Control and Communication. 1–6. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Kamiran Faisal and Calders Toon. 2011. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33 (2011), 133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Kamiran Faisal, Karim Asim, and Zhang Xiangliang. 2012. Decision theory for discrimination-aware classification. In IEEE International Conference on Data Mining (ICDM’12). 924929. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Kamishima Toshihiro, Akaho Shotaro, Asoh Hideki, and Sakuma Jun. 2012. Fairness-aware classifier with prejudice remover regularizer. In Machine Learning and Knowledge Discovery in Databases (ECML PKDD’12), P. A. Flach, T. De Bie, and N. Cristianini (Eds.). Lecture Notes in Computer Science, vol 7524. Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Kilbertus Niki, Rojas-Carulla Mateo, Parascandolo Giambattista, Hardt Moritz, Janzing Dominik, and Schölkopf Bernhard. 2017. Avoiding discrimination through causal reasoning. In Conference on Advances in Neural Information Processing Systems (NIPS’17). 656666. Retrieved from https://proceedings.neurips.cc/paper/2017/hash/f5f8590cd58a54e94377e6ae2eded4d9-Abstract.html.Google ScholarGoogle Scholar
  29. [29] Kilbertus Niki, Rojas-Carulla Mateo, Parascandolo Giambattista, Hardt Moritz, Janzing Dominik, and Schölkopf Bernhard. 2017. Avoiding discrimination through causal reasoning. In Conference on Advances in Neural Information Processing Systems (NIPS’17). 656666. Google ScholarGoogle Scholar
  30. [30] Köknar-Tezel Suzan and Latecki Longin Jan. 2011. Improving SVM classification on imbalanced time series data sets with ghost points. Knowl. Inf. Syst. 28, 1 (July 2011), 123. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Konstantinou Nikolaos, Abel Edward, Bellomarini Luigi, Bogatu Alex, Civili Cristina, Irfanie Endri, Koehler Martin, Mazilu Lacramioara, Sallinger Emanuel, Fernandes Alvaro, Gottlob Georg, Keane John, and Paton Norman. 2019. VADA: An architecture for end user informed data preparation. J. Big Data 6:74 (2019). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Kruse Sebastian and Naumann Felix. 2018. Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11, 7 (2018), 759772.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Mazilu Lacramioara, Paton Norman W., Fernandes Alvaro A. A., and Koehler Martin. 2022. Schema mapping generation in the wild. Inf. Syst. 104 (2022), 101904. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Mazilu Lacramioara, Paton Norman W., Konstantinou Nikolaos, and Fernandes Alvaro A. A.. 2020. Fairness in data wrangling. In 21st International Conference on Information Reuse and Integration for Data Science (IRI’20). IEEE, 341348. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Mazilu Lacramioara, Paton Norman W., Konstantinou Nikolaos, and Fernandes Alvaro A. A.. 2021. Data wrangling for fair classification. In EDBT/ICDT 2021 Joint Conference Workshops (CEUR Workshop Proceedings), Vol. 2841. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-2841/PIE+Q_1.pdf.Google ScholarGoogle Scholar
  36. [36] Mehrabi Ninareh, Morstatter Fred, Saxena Nripsuta, Lerman Kristina, and Galstyan Aram. 2019. A survey on bias and fairness in machine learning. Retrieved from http://arxiv.org/abs/1908.09635.Google ScholarGoogle Scholar
  37. [37] Nargesian Fatemeh, Zhu Erkang, Miller Renée J., Pu Ken Q., and Arocena Patricia C.. 2019. Data lake management: Challenges and opportunities. Proc. VLDB Endow. 12, 12 (2019), 19861989. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Quinlan J. Ross. 1996. Bagging, boosting, and C4.5. In 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference (AAAI’96). AAAI Press/The MIT Press, 725730. Retrieved from http://www.aaai.org/Library/AAAI/1996/aaai96-108.php.Google ScholarGoogle Scholar
  39. [39] Rekatsinas Theodoros, Deshpande Amol, Dong Xin Luna, Getoor Lise, and Srivastava Divesh. 2016. SourceSight: Enabling effective source selection. In ACM SIGMOD. 21572160. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Shekhar Shubhranshu, Shah Neil, and Akoglu Leman. 2021. FairOD: Fairness-aware outlier detection. In AAAI/ACM Conference on AI, Ethics, and Society (AIES’21). 210220. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Valentim Inês, Lourenço Nuno, and Antunes Nuno. 2019. The impact of data preparation on the fairness of software systems. Retrieved from http://arxiv.org/abs/1910.02321.Google ScholarGoogle Scholar
  42. [42] Zelaya Vladimiro, Missier Paolo, and Prangle Dennis. 2019. Parametrised data sampling for fairness optimisation. In Conference on Knowledge Discovery and Data Mining XAI.Google ScholarGoogle Scholar
  43. [43] Zemel Richard, Wu Yu, Swersky Kevin, Pitassi Toniann, and Dwork Cynthia. 2013. Learning fair representations. In International Machine Learning Conference (ICML’13). 325333.Google ScholarGoogle Scholar
  44. [44] Zhang Brian Hu, Lemoine Blake, and Mitchell Margaret. 2018. Mitigating unwanted biases with adversarial learning. In AAAI Conference on Artificial Intelligence. 335340. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Zhao Han and Gordon Geoff. 2019. Inherent tradeoffs in learning fair representations. In Advances in Neural Information Processing Systems, Wallach H., Larochelle H., Beygelzimer A., Alché-Buc F. d’, Fox E., and Garnett R. (Eds.), Vol. 32. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2019/file/b4189d9de0fb2b9cce090bd1a15e3420-Paper.pdf.Google ScholarGoogle Scholar

Index Terms

  1. Fairness-aware Data Integration

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Journal of Data and Information Quality
          Journal of Data and Information Quality  Volume 14, Issue 4
          December 2022
          173 pages
          ISSN:1936-1955
          EISSN:1936-1963
          DOI:10.1145/3563905
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 November 2022
          • Online AM: 5 July 2022
          • Accepted: 17 February 2022
          • Revised: 13 January 2022
          • Received: 6 July 2021
          Published in jdiq Volume 14, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)218
          • Downloads (Last 6 weeks)13

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format