skip to main content
research-article
Open Access

E-FAIR-DB: Functional Dependencies to Discover Data Bias and Enhance Data Equity

Published:23 November 2022Publication History

Skip Abstract Section

Abstract

Decisions based on algorithms and systems generated from data have become essential tools that pervade all aspects of our daily lives; for these advances to be reliable, the results should be accurate but should also respect all the facets of data equity [11]. In this context, the concepts of Fairness and Diversity have become relevant topics of discussion within the field of Data Science Ethics and, in general, in Data Science. Although data equity is desirable, reconciling this property with accurate decision-making is a critical tradeoff, because applying a repair procedure to restore equity might modify the original data in such a way that the final decision is inaccurate w.r.t. the ultimate objective of the analysis. In this work, we propose E-FAIR-DB, a novel solution that, exploiting the notion of Functional Dependency—a type of data constraint—aims at restoring data equity by discovering and solving discrimination in datasets. The proposed solution is implemented as a pipeline that, first, mines functional dependencies to detect and evaluate fairness and diversity in the input dataset, and then, based on these understandings and on the objective of the data analysis, mitigates data bias, minimizing the number of modifications. Our tool can identify, through the mined dependencies, the attributes of the database that encompass discrimination (e.g., gender, ethnicity, or religion); then, based on these dependencies, it determines the smallest amount of data that must be added and/or removed to mitigate such bias. We evaluate our proposal both through theoretical considerations and experiments on two real-world datasets.

Skip 1INTRODUCTION Section

1 INTRODUCTION

In the past decades, computer scientists have been attempting to make decision-making systems fair; however, nowadays, after some research on the topic of ethics, the scope of fairness is often considered narrow. This is confirmed by Chouldechova’s Impossibility Theorem [5], stating that, given a dataset, it is not possible to guarantee that more than two definitions of fairness are satisfied simultaneously.

As a result, a discussion about broader concepts such as data equity is taking place [11]: Whereas equality, focused on equality of opportunity, aims to achieve fairness through equal treatment regardless of need [11], equity as a social concept promotes fairness by treating people differently depending on their endowments and needs, focusing on equality of outcome. Although data equity is desirable in data, reconciling this property with accurate decision-making is a critical tradeoff; indeed, how can we perform an accurate prediction task if our dataset is altered by a repair procedure?

Data equity has four different facets: representation, feature, access, and outcome [11]. Here, we deal with the first two ones. Representation issues arise when there is a significant difference between the information contained in the dataset and the world that data should represent, for example, when more negative information is collected about people belonging to certain groups w.r.t. to that collected for more advantaged groups; therefore, to guarantee representation equity in a dataset, it may be useful to ensure its diversity. Feature equity refers to the availability of all the features needed to represent the members of every group and to perform the desired analyses, e.g., information on the race of each group member; feature equity is thus connected with the possibility to control fairness. Let us define the two main data ethics concepts we face in this article:

  • Fairness: the absence of any prejudice or favoritism toward an individual or a group based on their inherent or acquired characteristics [14, p. 100]. There are already many formal definitions that try to express this concept [17].

  • Diversity: a general term used to capture the quality of a collection of items, or of a composite item, with regards to the variety of its constituent elements [7, p. 1]. Diversity ensures that different kinds of objects are represented in the data.

In this work, we present E-FAIR-DB (enhance data Equity by FunctionAl dependencIes to discoveR Data Bias), a novel solution that exploits the notion of Approximate Conditional Functional Dependency (ACFD), a type of constraint based on the relationship between two sets of attributes of a data table, formally defined in next section. We use ACFDs to detect biases and discover discrimination in the datasets subject to analysis, by recognizing cases where the value of a protected attribute, (e.g., sex, ethnicity, or religion) frequently determines the value of another one (such as range of the proposed salary or social state).

Note that we foresee two different applications for our framework. Indeed, given a dataset, our system may be used to discover and repair bias or to verify whether the data science model learned from that dataset would be biased with respect to the specific application at hand. Indeed, not necessarily a bias that has been discovered constitutes a problem for the specific application: For example, the system might highlight that women are discriminated with respect to salary, while the object of investigation is related to age and education level and therefore no repair action is needed with respect to gender bias.

In this article, we expand our previous work [6] by making the following contributions:

  • we deepen the study of the procedure to detect historical data bias by discovering ACFDs that show unfairness;

  • we present a method to visualize diversity, showing various kinds of distribution of discriminated and privileged groups;

  • we show with an example the evaluation of a model extracted from a biased dataset, reasoning on the fairness issues previously discovered in the dataset from which the model had been generated;

  • we propose an ACFD-Repair method to mitigate data bias and improve data equity;

  • we make an experimental comparison between E-FAIR-DB and other state-of-the-art systems.

The article is organized as follows: Section 2 contains the related work, while Sections 3, 4, 5, 6, and 7 detail the methodology, Section 8 presents the comparison with other works, and Section 9 concludes the article.

Skip 2RELATED WORK Section

2 RELATED WORK

As said in the Introduction, most of the research done up to now in the area of Data Science Ethics affects the concept of fairness rather then the concept of data equity.

A recent work on data equity is [2], in which the system assesses diversity (more specifically coverage) of a given dataset over multiple categorical attributes. Based on new efficient techniques to identify the regions of the attribute space not adequately covered by data, the system can determine the least amount of data that must be added to solve this lack of diversity.

Instead, to enforce fairness in a data analysis application there are three possible approaches: (i) preprocessing techniques, i.e., procedures that, before the application of a prediction algorithm, make sure that the learning data are fair; (ii) inprocessing techniques, i.e., procedures that ensure that, during the learning phase, the algorithm does not pick up the bias present in the data; and (iii) postprocessing techniques, i.e., procedures that correct the algorithm’s decisions with the scope of making them fair.

In the machine learning context, the majority of works that tries to enforce fairness is related to a prediction task, and more specifically to classification algorithms [1, 3, 16]. One of these works is AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias [3], an open-source framework whose aim is to reach algorithmic fairness for classifiers. It tries to mitigate data bias, quantified using different statistical measures, by exploiting pre-processing, in-processing, and post-processing techniques. The user can choose between four pre-processing techniques and five statistical measures to solve bias in the dataset.

Another interesting work on fairness that focuses also on diversity is Nutritional Labels for Data and Models by Stoyanovich and Howe [15]. The authors developed an interpretability and transparency tool, based on the concept of Nutritional Labels drawing an analogy to the food industry, where simple and standardized labels convey information about the ingredients and the production processes. Nutritional labels are derived semi-automatically on the basis of Ranking Facts, a collection of visual widgets. Specifically, the Fairness widget quantifies whether the ranked output exhibit statistical parity with respect to one or more protected attributes and the Diversity one displays the distribution for each category.

Last, but not least, a preprocessing technique to discover and solve discrimination is presented by Pedreschi et al. [10, 13]. The process, on the basis of discrimination measures used in the legal literature, can identify potentially discriminatory itemsets by discovering association rules. Furthermore, the authors propose a set of sanitization methods: Given a discriminatory rule, it is sanitized by modifying the itemset distribution to prevent discrimination. For each discrimination measure, they propose a method to achieve a fair dataset by introducing a reasonable (controlled) pattern distortion. In Section 8, we compare the E-FAIR-DB with the main competitors showing the differences and similarities.

Skip 3METHODOLOGY AND PRELIMINARIES Section

3 METHODOLOGY AND PRELIMINARIES

3.1 A Bird’s Eye View of the Methodology

The aim of the E-FAIR-DB system is to detect bias in datasets with the help of Approximate Conditional Functional Dependencies and mitigate this bias by enhancing diversity: Figure 1 gives a general overview of the framework. Here, we present the main phases, while the next sections contain in-depth descriptions of each specific step.

Fig. 1.

Fig. 1. E-FAIR-DB flow.

Starting from the input dataset, there is an investigation phase: The system applies the Data Bias discovery phase [6] (Section 4), a procedure that discovers a list of dependencies that show (if present) the discriminated and privileged groups of the dataset and also highlights its level of diversity (Section 5).

However, it is also possible that the bias discovered up to now is not relevant with respect to the analysis that the user wants to carry out. Therefore, the user has to make a decision. We envisage the possibility to the user to train a model on the original dataset and then, if possible, evaluate it on the basis of the type of analysis to be carried out and on the fairness issues detected during the Data Bias discovery step (an example in Section 6). At this point, whether the model evaluation phase has been applied or not, the user can decide to use the ACFD-Repair procedure (Section 7), aimed at mitigating the initial bias on the basis of the previously ACFDs found. This step can be repeated until the user is satisfied and decides to save the cleaned dataset.

3.2 Preliminary Notions

We now introduce some fundamental notions that will accompany us along our discussion.

A Functional Dependency \( FD: X \rightarrow Y \) is a type of database integrity constraints that holds between two sets X and Y of attributes in a relation of a database. It specifies that the values of the attributes of X uniquely (or functionally) determine the values of the attributes of Y. In an FD, X is called antecedent, or left-hand-side (LHS), while Y is called consequent, or right-hand-side (RHS).

The constraints that Functional Dependencies impose are often too strict for real-world datasets, since they must hold for all the values of the attribute sets X and Y. For this reason, researchers have begun to study generalizations of FDs, called Relaxed Functional Dependencies [4], which relax one or more constraints of canonical FDs.

In particular, a Conditional Functional Dependency (CFD) is a pair \( \left(X \rightarrow Y, t_p \right) \), where X and Y are sets of attributes, \( X \rightarrow Y \) is a standard functional dependency, and \( t_p \) is a pattern tuple over the attributes in X and Y; for each A in \( X \cup {Y} \), \( t_p[A] \) is a constant “a” in dom(A), or an unnamed variable “_”. With this type of dependency, we can spot specific concrete patterns in the dataset, and thus, we are able to analyze behaviors in correspondence to precise values. Approximate Conditional Functional Dependencies (ACFDs) are uncertain CFDs that hold only on a subset of the tuples. Given a dataset D, the support of a CFD \( \left(X \rightarrow Y, t_p \right) \) is defined as the proportion of tuples t in the dataset D that contain \( t_p \), that is: \( \begin{equation*} {\it Support}(X \rightarrow Y, t_p) = \frac{| t \in D; t_p \subseteq t|}{|D|}. \end{equation*} \) Confidence instead indicates how often the CFD has been found to be true. Let \( t_p = (x \cup y) \) where x is a tuple over X and y is a tuple over Y. The confidence value of a CFD \( \left(X \rightarrow Y, t_p \right) \) is the proportion of the tuples t containing x, that also contain y: \( \begin{equation*} {\it Confidence}(X \rightarrow Y, t_p) = \frac{| t \in D; t_p \subseteq t|}{| t \in D; x \subseteq t|}. \end{equation*} \) A protected attribute is a characteristic for which non-discrimination should be established, such as religion, race, sex, and so on [17]. Finally, the target variable is the feature of a dataset about which the user wants to gain a deeper understanding, for example, the income, or a boolean label that indicates whether a loan is authorized or not, and so on.

Skip 4DATA BIAS DISCOVERY Section

4 DATA BIAS DISCOVERY

This section presents the details of the Data Bias discovery procedure, Figure 2 shows its workflow.

Fig. 2.

Fig. 2. Steps of the Data Bias Discovery procedure.

We give a detailed description of these phases with the help of two running examples based on the U.S. Census Adult dataset and the Titanic dataset.

Example 1

(Dataset Description).

For a part of our experiments, we have used the U.S. Census Adult Dataset1 [8], containing information about many social factors of U.S. adults such as “Income,” “Age,” “Workclass,” “Education,” “Education-Num” (i.e., the number of years already attended at school), “Marital-Status,” “Race,” “Sex,” (work) “Hours-Per-Week,” “Native-Country,” and some more. The total number of tuples in the dataset is 30,169.

4.1 Data Preparation and Exploration

In this primary phase, we import the data, perform (if needed) data integration and apply the typical data preprocessing steps (e.g., solve missing values, apply discretization), needed to clean and prepare the data. During this phase, we also might visualize the attribute features using different Data Visualization techniques.

Example 2

(Data Preparation and Exploration).

Before being able to gather useful insights from the running example, we have to perform some preprocessing operations as Data Cleaning, Feature Selection, and Discretization. First, we noticed that the majority of the missing values belong to attributes that are not relevant for our analysis (e.g., “Marital-Status”), and therefore decided to perform feature selection first and then remove the few tuples that still contain missing values. Regarding feature selection, we also removed those columns that, though possibly interesting, were related to another one expressing the same meaning; e.g., we noticed that two of the columns, “Education” and “Education-Num,” actually represent the same information, since the latter can be obtained through a numerical encoding of the former. For the attributes “Race” and “Native-Country,” although only apparently correlated, we decided to keep both; indeed, for example, the offspring of migrants have the same race as their parents but different native countries, because they may be born in the U.S.. To extract Functional Dependencies that do not depend on specific values of a continuous or rational attribute, it is often useful to group the values into appropriately defined bins. In particular, we created five bins for the attribute “Hours-Per-Week”: “0–20,” “21–40,” “41–60,” “61–80,” “81–100,” and five bins for “Age”: “15–30,” “31–45,” “46–60,” “61–75,” “76–100.” We concluded with the choice of the protected attributes: “Sex,” “Race,” and “Native-Country” (for brevity “NC”), and with the selection of “Income” as target-variable. To have readable and more effective Functional Dependencies, we keep the attribute “Race” as is in the original dataset and group the values of the attribute “NC” using four different values: “NC-US,” “NC-Hispanic,” “NC-Non-US-Hispanic,” and “NC-Asian-Pacific.” Table 1 contains examples of the tuples after this phase.

Table 1.
Age-RangeWorkclassEducationRaceSexHours-Per-WeekNCIncome
075–100PrivateHS-collegeWhiteFemale21–40NC-US\( \le \)50K
175–100PrivateHS-collegeWhiteFemale0–20NC-US\( \le \)50K
245–60PrivateHS-collegeBlackFemale21–40NC-US\( \le \)50K

Table 1. First Three Tuples of the U.S. Census Adult Dataset after Data Preparation Phase

4.2 ACFD Discovery and Filtering

In this phase, we apply an ACFD Discovery algorithm [12] to extract Approximate Conditional Functional Dependencies from the dataset. Given an instance D of a schema R, support threshold \( \gamma \), confidence threshold \( \epsilon \), and maximum antecedent size \( \tau \), the approximate CFDs discovery problem is to find all ACFDs \( \phi \) : \( (X \rightarrow Y, t_p) \) over R with:

  • support\( \left(\phi ,D \right) \ge \gamma \)

  • confidence\( \left(\phi ,D \right) \ge \epsilon \)

  • \( |X| \le \tau \).

The ACFDs obtained are in this form: \( \begin{equation*} (lhsAttr_1 = value_1, \ldots , lhsAttr_N = value_N) \rightarrow (rhsAttr = value). \end{equation*} \)

The algorithm returns all the dependencies that satisfy the aforementioned criteria. We suggest to tune these parameters according to the research scope under consideration: If the user wants to discover discriminations that affect minorities, then she should lower the values of confidence and support; however, if she is interested in larger groups, then these values should be increased.

Example 3

(ACFD Discovery).

Since we are interested in studying groups that represent at least 3% of the people contained in the dataset, we set the minimum support = 0.03, while a minimum confidence = 0.8 finds rules that are valid in at least 80% of the cases. Finally, considering the number of protected attributes and the length of the dataset, we set the maximum antecedent size = 2. The algorithm, applied to the dataset resulting from the previous phase, finds 118 ACFDs. Table 2 reports a few of them. We can easily detect the ACFDs that contain variables, for example, dependency number \( \phi _{3} \): (Education-Degree, Income = “\( \le \)50K”) \( \rightarrow \)Native-Country, does not specify the values of the attributes “Education” and “Native-Country.” As the user can notice, there are also dependencies that should be removed from the list, because they do not contain any protected attribute.

Table 2.
(Education = “Middle-school”) \( \rightarrow \) Income = “\( \le \)50K”
(Age-Range = “15–30”) \( \rightarrow \) Income = “\( \le \)50K”
(Education, Income = “\( \le \)50K”) \( \rightarrow \) Native-Country
(Native-Country = “NC-Hispanic”) \( \rightarrow \) Income = “\( \le \)50K”
(Income = “\( \le \)50K”) \( \rightarrow \) Native-Country = “NC-US”

Table 2. ACFD Discovery Output

Now, we filter the dependencies, discarding the ones that do not satisfy the following constraints:

  • all the attributes of the dependency must be assigned a value, i.e., we are only interested in Constant ACFDs [9];

  • at least one protected attribute and the target variable must be present inside the dependency, so the ACFDs might indicate some bias.

Example 4

(ACFD Filtering).

From Table 2, we discarded the third dependency for not being a Constant ACFD and the first three dependencies for not containing any protected attribute. After this phase, we are left with 84 of the original 118 dependencies.

4.3 ACFD Selection

This phase is responsible for finding the dependencies that actually reveal unfairness in the dataset: In fact, even if the ACFDs identified in the previous step contain protected attributes, not all of them necessarily show some unethical behavior. To do so, we have devised two unfairness measures:

  • Difference: It indicates how much “unethical” is a dependency. The higher this metric, the stronger the alert for an unfair behavior.

  • Protected Attribute Difference (P-Difference): It indicates how much the dependency shows bias with respect to a specific protected attribute.

To assess the unfair behavior of a dependency, we also take into consideration its support, which indicates the pervasiveness of the ACFD; unethical dependencies with high support will impact many tuples and thus will be more important.

For each dependency \( \phi \), we define the Difference metric of \( \phi \) as the difference between the dependency confidence and the confidence of the dependency computed without the protected attributes of the LHS of the ACFD. Given a dependency in the form \( \phi : (X \rightarrow Y, t_p) \), let \( Z = (X - \lbrace ProtectedAttributes\rbrace) \), that is, the LHS of the dependency without its protected attributes, and \( \phi ^{\prime }: (Z \rightarrow Y, t_p) \). We define the Difference as: \( \begin{equation*} {\it Difference}(\phi) = {\it Confidence}(\phi) - \end{equation*} \) Confidence(\( \phi ^{\prime } \)). The Difference metric gives us an idea of how much the values of the protected attributes influence the value of Y.

Example 5

(Difference Score of a Dependency).

Analyzing the following dependency: \( \begin{equation*} \phi _1: {\it (Sex = ``Female,\!\hbox{''} Workclass = ``Private\hbox{''})} \rightarrow {\it Income} = ``\!\le \!50K\hbox{''} \end{equation*} \) Considering “Sex” as the protected attribute in \( \phi _1 \), we can compute the Difference as: \( {\it Diff}(\phi _1)= Conf(\phi _1) - {\it Conf}(\phi _1^{\prime }) \) where \( \begin{equation*} \phi _1^{\prime }: {\it (Workclass = ``Private\hbox{''})} \rightarrow {\it Income = ``\!\le \!50K\hbox{''}.} \end{equation*} \) Three different behaviors can emerge:

  • If the Difference is close to zero, then we can presume that fairness is respected, since it means that females are treated equally to all the elements of the population that have the same characteristics (without specifying the protected attribute).

  • If the Difference is positive, then it means that the women that work privately and gain less than 50K dollars/year are overall treated worse than the generality of people that also work privately and gain less than 50K dollars/year.

  • If the Difference is negative, then the opposite situation with respect to the previous point is detected.

A dependency could contain in the LHS more than one protected attribute at the same time. For this reason, we introduce the last metric: The P-Difference, which focuses on one protected attribute P at a time. It is computed as Difference where \( Z = X - P \), so excluding only the protected attribute P from the LHS of the dependency. The P-Difference, \( {\it P-Diff}(\phi , P) \) for short, gives precise information about the level of unfairness of the value of P. In fact, when an ACFD has more than one protected attribute, the user could be interested in understanding which attribute values are more correlated to the discriminatory behavior.

Finally, in this phase, we choose the ACFDs whose Difference is above the minThreshold \( \delta \), which means that there is a significant inequality between the groups involved in the ACFD and the general behavior of the population.

Example 6

(P-Difference Analysis).

Table 3 reports three possible dependencies with their corresponding P-Difference metrics. All the dependencies have positive Difference, so they are all potentially discriminatory. However, from a closer examination of \( \phi _1 \), we see that the positive value of the Sex-Difference indicates that the people that are “Female” and “White” are discriminated on the “Sex” aspect and not on “Race,” since the Race-Difference is negative. Instead, \( \phi _2 \) has both metrics positive, thus people that are “Female” and “Black” are discriminated both on the “Sex” and “Race” aspects. Finally, \( \phi _3 \) has negative Sex-Difference, while Race-Difference is positive, indicating that this group is discriminated on the “Race” aspect. To conclude, all the rules present a discriminatory behavior, but with the P-Difference the user can understand which attributes influence the bias worst.

Table 3.
ACFDDiffSex-DiffRace-Diff
\( \phi _1 \): (Sex = “Female,” Race = “White”) \( \rightarrow \) Income = “\( \le \)50K”0.1260.141\( - \)0.009
\( \phi _2 \): (Sex = “Female,” Race = “Black”) \( \rightarrow \) Income = “\( \le \)50K”0.1880.0690.053
\( \phi _3 \): (Sex = “Male,” Race = “Black”) \( \rightarrow \) Income = “\( \le \)50K”0.051\( - \)0.0680.116

Table 3. A Few of the Selected ACFDs with the Corresponding P-Difference Metric

4.4 ACFD Ranking

In a real-world dataset, the number of ACFDs selected in the previous step could be very large, even in the order of thousands; therefore, for the user to look at all these dependencies would be a very demanding task. Thus, it is necessary to order the dependencies according to some criterion, enabling the user to analyze the most important and interesting ones first, speeding up the process and reducing the cost.

In our framework the user can sort the dependencies according to one of the following criteria:

  • Support-based: The support indicates the proportion of tuples impacted by the dependency: the higher the support, the more tuples are involved in the ACFD. Ordering dependencies by support highlights the pervasiveness of the dependency.

  • Difference-based: Since this criterion highlights the dependencies where the values of the protected attributes influence most the value of their RHS, this ordering privileges the unethical aspect of the dependencies.

  • Mean-based: This method tries to combine both aspects: the unethical perspective and its pervasiveness. Sorting the ACFDs using this criterion results in positioning first the dependencies that have the best tradeoff between difference and support.

Example 7

(Selected and Ordered ACFDs).

Table 4 reports 3 of the 17 dependencies that satisfied the selection criteria (minimum threshold \( \delta = 0.07 \)) along with their relevant metrics, sorted by their Difference. From the example, “Hispanic,” “Female,” and “Black” groups suffer from discrimination with respect to the rest of the population: People that belong to one or more of these groups have an income that is below 50,000 dollars/year because of their nationality, sex, or race.

Table 4.
ACFDSupportDifference
(Native-Country = “NC-Hispanic”) \( \rightarrow \) Income = “\( \le \)50K”0.0440.158
(Sex = “Female”) \( \rightarrow \) Income = “\( \le \)50K”0.2870.135
(Race = “Black”) \( \rightarrow \) Income = “\( \le \)50K”0.0810.119

Table 4. A Few of the Selected ACFDs Selected Using the Difference Metric and Ordered with the Difference-based Criterion

4.5 ACFD User Selection and Scoring

In this last phase, the user selects from the ranked list N dependencies that are interesting for the research needs. In fact, some dependencies might not be interesting for the user research scopes or not be discriminatory.

After that, using only the N selected ACFDs, the system computes a set of scores that summarize the properties of the entire dataset:

  • Cumulative Support: the percentage of tuples in the dataset involved in some selected ACFDs. The closer this value to 1, the more tuples are impacted by unfair dependencies.

  • Difference Mean: the mean of all the “Difference” scores of the selected ACFDs. It indicates how much unethical the dataset is according to the selected dependencies. The greater the value, the higher the bias in the dataset.

  • Protected Attribute Difference Mean: for each protected attribute P, we report the mean of its P-Difference over all the selected ACFDs. It indicates how unethical the dataset is over P according to the selected dependencies, underlining the bias w.r.t. P.

These summarizing metrics entirely depend on the specific ACFDs selected by the user: By selecting different sets of dependencies, the user can highlight different aspects of the dataset.

Note also that the framework allows the user to see some exemplar tuples impacted by the selected ACFDs.

Example 8

(ACFD User Selection and Scoring).

The user chooses 8 ACFDs that are interesting according to her needs among the dependencies obtained after the ranking step. The number of tuples involved in these ACFDs is 10,755, while the total number of tuples in the dataset is 30,169; this results in a Cumulative Support of 0.36. The Difference Mean is 0.13. These two scores indicate that a considerable number of tuples, 35%, show a behavior that is rather different, on average 13%, from the fair one. Finally, the P-Difference Mean metrics confirm that the dataset is unfair with respect to all the protected attributes; the most discriminated groups are: “Female,” “Black,” “NC-Hispanic,” and “Amer-Indian-Eskimo.” Table 5 reports a few possible user-selected ACFDs.

Table 5.
ACFDSupportDifference
(Sex = “Female”) \( \rightarrow \) Income = “\( \le \)50K”0.2870.135
(Race = “Black”) \( \rightarrow \) Income = “\( \le \)50K”0.0810.119
(Race = “Amer-Indian-Eskimo”) \( \rightarrow \) Income = “\( \le \)50K”0.0080.129
(Native-Country = “NC-Hispanic”) \( \rightarrow \) Income = “\( \le \)50K”0.0440.158

Table 5. A Few User-selected Dependencies from the U.S. Census Adult Dataset

4.6 Data Bias Discovery in the Titanic Dataset

In this section, we present the Data Bias discovery procedure applied to the Titanic dataset.2 This dataset contains information about the passengers on the Titanic, a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of 15 April, 1912, after striking an iceberg during her maiden voyage from Southampton to New York City.

The version that we considered contains 891 samples and 12 attributes, 8 of which are categorical and 4 are numerical. The target variable “Survived” is a binary categorical variable. The other attributes are: “Class”: a categorical variable representing if the passenger embarked in the first class (1), in the second class (2), or in the third class (3); “Sex”; “SibSp”: a numerical variable representing the number of siblings and eventually the spouse who embarked with the passenger; “Parch”: a numerical variable representing the number of parents and children who embarked with the passenger; “Fare”: a numerical variable representing the price of the ticket owned by the passenger; “Port”: a categorical variable representing the port of embarkation of the passenger.

Data Preparation and Exploration: After Data Acquisition, because there is only one source, we do not need to perform Data Integration; we instead apply basic Data Cleaning techniques. We continue performing Data Discretization on the numerical attribute “Fare,” grouping it into 5 bins: “0–8,” “9–20,” “21–40,” “41–80,” “81–500.” Table 6 reports the first three tuples of the prepared dataset.

Table 6.
SurvivedClassSexSibSpParchFarePort
03Male100–8S
11Female1041–80C
13Female000–8S

Table 6. First Three Tuples of the Titanic Dataset after Data Preparation Phase

ACFD Discovery and Filtering: In this stage, we establish the protected attributes: “Class,” “Sex,” and apply the ACFD Discovery algorithm with minimum support = 0.03, minimum confidence = 0.8 and maximum antecedent size = 2 that returns 46 ACFDs. Then, we discard all the dependencies that are not constant and that do not contain any protected attribute, obtaining 36 ACFDs.

ACFD Selection and Ordering: In this phase, the algorithm selects the dependencies that actually reveal unfairness in the dataset. We put the minimum threshold \( \delta \) equals to 0.1; compared to the previous example, we noticed that the ACFDs identify larger groups, as a result, we decided to slightly increase this parameter. Table 7 presents the first three ACFDs in a ordered list of dependencies that satisfy the aforementioned criteria.

Table 7.
ACFDSupportDifference
(Sex = “Female,” Class = 1) \( \rightarrow \) Survived = 10.1030.579
(Sex = “Female,” Class = 2) \( \rightarrow \) Survived = 10.0890.532
(Fare= “0–8,” Sex = “Female”) \( \rightarrow \) Survived = 10.0380.461

Table 7. A Few Dependencies from the Titanic Dataset Ordered Using the Difference-based Criterion

User Selection and Scoring: The user, among the dependencies obtained after the ranking step, chooses 7 ACFDs that are interesting according to her needs. As a result, the system returns the values of the summarizing metrics: Cumulative Support equals to 0.86 and Difference Mean equals to 0.35. These two scores indicate that a considerable number of tuples, precisely the 86%, show discrimination problems. Finally, the P-Difference Mean metrics (Sex-Difference = 0.29, Class-Difference = 0.12) confirm that the dataset is unfair with respect to all the protected attributes; specifically, the groups more discriminated are: “Male” and “Third-Class” passengers. Table 8 reports few user-selected ACFDs.

Table 8.
ACFDSupportDifference
(Sex = “Female,” Class = 1) \( \rightarrow \) Survived = 10.1030.579
(Survived = 0, Sex = “Female”) \( \rightarrow \) Class = 30.0820.332
(Sex = “Male”) \( \rightarrow \) Survived = 00.5180.197

Table 8. A Few User-selected Dependencies from the Titanic Dataset

Skip 5DIVERSITY Section

5 DIVERSITY

Diversity is a fundamental concept in Data Science Ethics and thus closely related to fairness. Recently, researchers have recognized that in data science projects, it is not enough for the training data to be representative: They also have to include enough items from less popular “categories” to ensure correct learning of the model and enhance data equity [11]. As a result, Diversity is a critical aspect of systems based on data, not only from an ethical aspect (identify a possible risk of group exclusion), but also from a data quality perspective: Ensuring diversity makes algorithms more accurate and complete.

Determining the conditions under which fairness leads to diversity, and those where diversity leads to fairness, is an open research question that needs further investigation. Besides fairness, E-FAIR-DB studies the diversity of a dataset; specifically, starting from the previously discovered dependencies, it studies the variety of the elements in the privileged and discriminated groups (Figure 3). Different applications may pose different diversity objectives: Some of them study diversity in selection tasks (e.g., ranking) [15], while others consider it in recommendation tasks. In this section, we analyze the diversity aspect as additional facet of data equity with respect to fairness. The list of dependencies extracted by E-FAIR-DB in the Data Bias discovery phase helps us control also the diversity. Indeed, the set of dependencies can be divided into two subsets based on the value of the target variable: the set P, corresponding to the positive outcome (e.g., income \( \gt \)50K), and the set N, corresponding to the negative outcome (e.g., income \( \le \)50K). Each of these two sets can be studied with respect to diversity, observing how diverse each subset is with respect to the protected attributes. Specifically, for each protected attribute a, we plot three pie charts, presenting:

Fig. 3.

Fig. 3. Diversity flow.

  • the distribution of the values of a with respect to the global dataset;

  • for each value of a, the percentage of tuples affected by the ACFDs in P;

  • for each value of a, the percentage of tuples affected by the ACFDs in N.

We now apply the diversity study to the U.S. Census and Titanic datasets.

5.1 Diversity in the U.S. Census Adult Dataset

Considering the user-selected ACFDs computed for the U.S. Census dataset found in Table 5, we report in Table 9 some examples of ACFDs belonging to the set P, i.e., the ones that regard only people that earn more than 50K dollars/year; similarly, Table 10 contains some ACFDs belonging to N. Figures 4, 5, and 6 present the diversity evaluation for each of the three protected attributes: “Sex,” “Native-country” and “Race.” In Figure 4, the first pie chart reports the distribution of “Sex” in the original dataset, showing that two-thirds of people in the dataset are men. The second pie chart shows the percentage of tuples affected by the ACFDs that interest privileged groups (the P set); this chart is entirely composed by males, because no dependencies in P involve females. The third pie chart shows the composition of the group of people affected by the ACFDs in N, so that earn less than 50K $/years; the majority of people (88.9%) are females and only the remaining 11.1% are males. Ideally, to ensure diversity, the distribution of the values in the three pie charts should be very similar (i.e., charts (b) and (c) should have a 70% of men and a 30% of women). In this example, as can be seen in Figure 4, the protected attributes do not conform to this property.

Fig. 4.

Fig. 4. Protected attribute “Sex” distribution: (a) in the total dataset, (b) in \( P, \) and (c) in N.

Fig. 5.

Fig. 5. Protected attribute “Race” distribution: (a) in the total dataset, (b) in \( P, \) and (c) in N.

Fig. 6.

Fig. 6. Protected attribute “Native-Country” distribution: (a) in the total dataset, (b) in \( P, \) and (c) in N.

Table 9.
\( \phi _1: \) (Sex = “Male”) \( \rightarrow \) Income = “\( \gt \)50K”
\( \phi _2: \) (Sex = “Male,” Race = “White”) \( \rightarrow \) Income = “\( \gt \)50K”
\( \phi _3: \) (Sex = “Male,” Race = “Asian-Pac-Islander”) \( \rightarrow \) Income = “\( \gt \)50K”

Table 9. Examples of ACFDs in P from the U.S. Adult Census Dataset

Table 10.
\( \phi _4: \) (Sex = “Female”) \( \rightarrow \) Income = “\( \le \)50K”
\( \phi _5: \) (Sex = “Female,” Race = “Black”) \( \rightarrow \) Income = “\( \le \)50K”
\( \phi _6: \) (Native-Country = “NC-Hispanic”) \( \rightarrow \) Income = “\( \le \)50K”

Table 10. Examples of ACFDs in N from the U.S. Adult Census Dataset

In Figure 5, the pie chart (a) reports the distribution of the “Race” attribute: The majority of people are “White” and there are four minorities represented by “Asian-Pac-Islander,” “Black,” “Amer-Indian-Eskimo,” and “Other.”3 The pie chart (b) shows us that the privileged groups are mainly “White” and “Asian” people: No dependencies in P involve “Black” or “Amer-Indian-Eskimo” people, which instead compose the majority of the plot showing discrimination (c).

Figure 6 presents the diversity evaluation, analyzing the “Native-Country” protected attribute. Since the “NC-US,” “NC-Asian-Pacific,” and “NC-Non-US-Hispanic” groups are composed of a variety of people of different origin, coming both from rich and poor countries, they are represented both in P and N; however, the “NC-Hispanic” group, composed of people coming from developing countries, is present only in pie chart (c), meaning that, in general, the “NC-Hispanic” group has a low income. Finally, even though the groups are well represented, this is not a guarantee of diversity; in fact, the user may notice that even if people coming from U.S. are well distributed, jointly inspecting Figure 6 and Figure 4, it is evident that in plot (b) there are only men, and in plot (c) there are mainly females, thus, overall, we can conclude that the dataset does not respect diversity.

These pie charts help the user not only to illustrate and better understand the previous results, but also to highlight the different representation of the protected categories by showing low diversity in the favored and discriminated groups.

5.2 Diversity in the Titanic

Given the list of ACFDs previously computed from the Titanic dataset, we divide the output of the Data Bias discovery procedure in the two sets: Table 11, which contains some examples of ACFDs in P (which in this case study regards the people that survived the wreckage), and Table 12, which contains few ACFDs in N (i.e., people that did not survive the wreckage).

Table 11.
\( \phi _1: \) (Survived = 1, Class = 1) \( \rightarrow \) Sex = “Male”
\( \phi _2: \) (Sex = “Female,” Class = 1) \( \rightarrow \) Survived = 1
\( \phi _3: \) (Sex = “Female,” Class = 2) \( \rightarrow \) Survived = 1

Table 11. Examples of ACFDs in P from the Titanic Dataset

Table 12.
\( \phi _4: \) (Sex = “Male”) \( \rightarrow \) Survived = 0
\( \phi _5: \) (Survived = 0, Sex = “Female”) \( \rightarrow \) Class = 3
\( \phi _6: \) (Sex = “Male,” Class = 3) \( \rightarrow \) Survived = 0

Table 12. Examples of ACFDs in N from the Titanic Dataset

As in the previous example, Figures 7 and 8 present the study for each protected attribute: “Sex” and “Class.” Figure 7 compares the distribution of females and males in the Titanic dataset (pie chart (a)) with respect to the favored groups (pie chart (b)) and to the discriminated groups (pie chart (c)). The variety of the tuples affected by the dependencies in P and N is completely unbalanced, in proportion there is a large majority of females in pie chart (b), and vice versa, in pie chart (c) the 86.3% are males. This is an evident example of a majority group (men) that is discriminated and underrepresented with respect to a minority one (women).

Fig. 7.

Fig. 7. Protected attribute “Sex” distribution: (a) in the total dataset, (b) in \( P, \) and (c) in N.

Fig. 8.

Fig. 8. Protected attribute “Class” distribution: (a) in the total dataset, (b) in \( P, \) and (c) in N.

Figure 8 presents the diversity analysis of the “Class” protected attribute. First-class passengers represent the largest group in pie chart (b), while it is the smallest group in (c), confirming the results obtained in the previous phase; however, in pie chart (c), the third-class passengers compose almost the 70% of the tuples, while originally they represented only 56% of the passengers. Also, in this case, the pie charts show different distributions, this is particularly evident in Figure 7; this stands from the fact that “Sex” is the most important factor to determine if a person survived or not; as a result, the dataset seems more diverse on the “Class” attribute compared to the “Sex” attribute.

To conclude, also in the Titanic dataset there is not an adequate representation of individuals, specifically, the distribution is skewed and it is evident that the groups are treated differently, thus diversity and data equity should be enhanced also here.

Skip 6AN EXAMPLE OF MODEL EVALUATION: THE TITANIC DATASET Section

6 AN EXAMPLE OF MODEL EVALUATION: THE TITANIC DATASET

At this point, a possible work pattern consists in cleaning the dataset with the ACFD-Repair system, so the algorithmic decision procedure works on cleaned data: We deal with this in Section 7.

However, it may happen that the data unfairness need not be corrected, because it does not affect the learned model, therefore it might be useful—if possible—to evaluate it before deciding to clean the dataset.

To see an example of evaluation of the learned model, we decided to use the Titanic dataset,4 because, if compared with the U.S. Census dataset, it has a smaller number of tuples and the protected attributes have smaller domains, and thus the results are more easily interpretable. Specifically, the most relevant features for the outcome of the learned model will be compared with the unfair attributes previously discovered, to understand whether the model complies, or not, with the ethical standards relevant for the application.

Consider a possible user that uses the Titanic dataset to train a model that should support the decision on whether or not, during a shipwreck, a passenger should be saved. In this specific case, we opt for a classification based on Logistic regression and focus on Binary Classification, where the model should predict the value of the target variable as one of the two probable classes (e.g., 1 or 0): in our case, if a passenger should survive or not.

In the Logistic Regression, the importance of each feature is given by the coefficient that the model associates to each feature during the learning phase: These coefficients indicate how much each attribute value impacts the final output. To evaluate the model w.r.t. bias, it is therefore sufficient to compare the feature coefficients with the understandings derived from the Data Bias discovery phase, to control whether the important attributes of the model are also the ones that present unfair behaviors, and therefore establish whether the learned model is fair or not.

Figure 9 displays the coefficients assigned to each value of the protected attributes during the leaning process of the logistic regression trained on the Titanic dataset.

Fig. 9.

Fig. 9. Titanic feature importance.

As we can see from the plot, the discriminated groups discovered in previous phase (males and the people embarked with a third-class ticket) present negative coefficients. At this point, the user, looking at the plot and comparing it with the coefficients assigned to the other features, can decide whether this model is coherent with her specific goal or a repair phase is in order. In this regard, a negative coefficient assigned by the logistic regression to a value of an attribute means a higher probability of a passenger sharing that characteristic (value) of not surviving: As a result, the model will discriminate third-class passengers and males.

We already discovered that the Titanic dataset is discriminatory against males and people embarked in the third class, and the above control confirms that the model has learned to be unfair. As a result, to obtain a fair decision procedure, the dataset should be repaired.

Skip 7ACFD-REPAIR: SOLVE BIAS AND ENHANCE EQUITY Section

7 ACFD-REPAIR: SOLVE BIAS AND ENHANCE EQUITY

We are by now rather convinced that algorithmic decision systems should work on cleaned data to obtain fair results, since learning from historical data might mean to learn also traditional prejudices that are endemic in society, producing unethical decisions. Data equity problems may bring significant social and business impacts, and for this reason a data-bias discovery phase, most of the times, should be followed by a data repair step to solve unfairness and improve diversity.

However, as already mentioned, ensuring data equity is strictly related to the final goal of the data science application at hand: For example, in a study aiming at predicting the onset of a tumor, eliminating a gender bias should not be an admissible choice, since the gender information, rather than denoting a bias, might carry insights that are useful for the research, and in this case modifying the dataset would be a mistake.

The final objective of the repair phase is to create a fair dataset where the unethical behaviors identified in the discovery phase are greatly mitigated or completely removed. The procedure has to conform to a fundamental requirement, i.e., since the repaired dataset has to be used later on in a data science pipeline, the number of modifications, consisting of deletion or addition of tuples, has to be minimized to protect the original distributions of values.

Note also that, given the variety of decision algorithms, the repair procedure must work also for attributes with non-binary values, as in the cases of “Native-Country” and “Race.”

Figure 10 presents the ACFD-Repair methodology: We now give a detailed description of each phase using the U.S. Census Adult dataset as running example.

Fig. 10.

Fig. 10. ACFD-Repair methodology.

7.1 Unfair Tuple Count Computation

We propose to try to reduce, or eliminate, unfairness, starting from the list of ACFDs computed in the Data Bias discovery step. The repair is performed by bringing the value of the Difference of each user-selected dependency below the minimum threshold \( \delta \); this can be achieved in two ways:

  • Remove the tuples matching the dependency so, after removal, the Difference value of the dependency will become lower than \( \delta \);

  • Add the tuples that, combined with the elements that match the dependency, if added, will lower its initial Difference value below \( \delta \).

The Unfair Tuple Count computation step is therefore responsible for finding, for each dependency, the number of tuples that has to be added or removed. For each ACFD \( \phi \), we define its Unfair Tuple Count as F(\( \phi \)) = (a,b) where a represents the number of tuples that should be added, and b the number of tuples that should be removed to repair the discrimination behavior shown by the dependency.

As already seen in the diversity study, the selected ACFDs could belong to the set P (positive) or to the set N (negative); in the repair phase, we only focus on the negative ones, because solving them also implies solving the positive ones. For example, if we have two ACFDs \( \phi _1: \) (Sex = “Female”) \( \rightarrow \) Income = “\( \le \)50K” and \( \phi _2: \) (Sex = “Male”) \( \rightarrow \) Income = “\( \gt \)50K,” then we do not need to solve both \( \phi _1 \) and \( \phi _2 \): Solving the discrimination expressed by \( \phi _1 \) automatically solves \( \phi _2 \), because males are no more privileged. Therefore, we will analyze only negative ACFDs in the repair procedure.

To explain how the Unfair Tuple Count value of each dependency is computed, we make use of the dependency: \( \phi _1: \) (Sex = “Female,” Workclass = “Private”) \( \rightarrow \) Income = “\( \le \)50K.” The Difference of \( \phi _1 \) can be computed as (Section 4.3): \( {\it Diff}(\phi _1)= Conf(\phi _1) - {\it Conf}(\phi _1^{\prime }) \), where \( \phi _1^{\prime } \): (Workclass = “Private”) \( \rightarrow \) (Income = “\( \le \)50K”). Rewriting the confidence as a ratio of supports, we get the following formula: \( \begin{equation*} {\it Diff}(\phi _1)= \frac{Sup(\phi _1)}{Sup(LHS(\phi _1))} - \frac{Sup(\phi _1^{\prime })}{Sup(LHS(\phi _1^{\prime }))} = \frac{a}{b} - \frac{c}{d}. \end{equation*} \)

As already mentioned, to repair \( \phi _1 \), we can either add or remove tuples: Let us first focus on the former. To remove the unfair behavior highlighted by \( \phi _1: \) (Sex = “Female,” Workclass = “Private”) \( \rightarrow \) Income = “\( \le \)50K,” intuitively, we can either add males that earn less than 50K dollars/year and work in the private sector or add females that earn more than 50K dollars/year and work in the private sector. Therefore, we have to add to the dataset tuples that satisfy the following two dependencies:

  • \( \phi _2: \) (Sex = “Male,” Workclass = “Private”) \( \rightarrow \) Income = “\( \le \)50K”

  • \( \phi _3: \) (Sex = “Female,” Workclass = “Private”) \( \rightarrow \) Income = “\( \gt \)50K”

The two dependencies above define the tuples that, if added to the initial dataset, will lower the initial Difference value below the threshold \( \delta \); \( \phi _2 \) and \( \phi _3 \), being the “antagonists” of \( \phi _1 \), form the Opposite Set (OS) of \( \phi _1 \). Specifically, \( \phi _2 \) has been obtained by changing, one at the time, the values of the protected attributes: We dub the set of such ACFDs Protected attribute Opposite Set (POS). In the example, we have only one, binary protected attribute, therefore the POS will contain only the dependency \( \phi _2 \); in case the protected attribute is non-binary, or there is more than one protected attribute, the POS would include more than one dependency. However, \( \phi _3 \) has been obtained by reversing the value of the target class. This ACFD is in the Target attribute Opposite Set (TOS), that, since the target attribute is required to be binary, will always contain only one dependency.

We now show the computation of the number of tuples to be added for both types of dependencies belonging to the OS. We start by analyzing \( \phi _2 \), i.e., the dependency in the POS, obtained by changing the value of the protected attribute. Adding k tuples that satisfy \( \phi _2 \) will impact the Difference of \( \phi _1 \) in the following way: \( \begin{equation*} {\it Diff}(\phi _1)= \frac{a}{b} - \frac{c+k}{d+k}. \end{equation*} \) Now, to repair \( \phi _1 \), we have to impose that \( {\it Diff}(\phi _1) \) be lower than \( \delta \) and find the number of tuples that need be added by solving for k, that is: \( \begin{equation*} k \gt \frac{ad - \delta bd -cb}{b - a + \delta b}. \end{equation*} \) Since the POS might be composed by many dependencies, and being k the total number of tuples to be added, the number of tuples that should be added for each dependency in the POS is determined by a partition of k; this number might simply be equal to \( \frac{k}{|POS|} \) or might be proportional to the original attribute values distribution.

We continue by analyzing \( \phi _3 \), i.e., the dependency in the TOS, obtained by reversing the value of the target attribute. Adding k tuples that satisfy \( \phi _3 \) will impact the Difference of \( \phi _1 \) in the following way: \( \begin{equation*} {\it Diff}(\phi _1)= \frac{a}{b+k} - \frac{c}{d+k}. \end{equation*} \) Now, to repair \( \phi _1 \), we have to impose that \( {\it Diff}(\phi _1) \) be lower than \( \delta \) and, similarly to what we have done for the POS, find the number of tuples that need be added by solving for k.

In case we have multiple discriminated minorities, adding tuples that satisfy the dependencies contained in the POS could result in enhancing discrimination. If we consider \( \phi : \) (Race= “Black”) \( \rightarrow \) Income = “\( \le \)50K,” the POS contains the ACFDs that involve all the other values of the protected attribute “Race”: “White,” “Asian-Pac-Islander,” “Other,” and “Amer-Indian-Eskimo”; therefore, we would add tuples referring to “Amer-Indian-Eskimo” people that earn less than 50K dollars/year to the dataset, thus aggravating their already discriminatory condition.

Finally, given what we have just presented and the fact that we want to repair the dataset by improving the condition of discriminated groups, we decide to add, for each negative ACFD only the tuples that satisfy the dependency contained in each respective TOS.

Now that we demonstrated how to find the number of tuples to be added to repair a dependency, we show how to find the number of tuples that need be removed to accomplish the same task. To repair an ACFD by removing tuples, we simply need to remove tuples that satisfy the dependency. To compute the number of tuples to be deleted from the dataset, we start by noticing that removing from the dataset k tuples that satisfy \( \phi _1 \) will modify the Difference of \( \phi _1 \) in the following way: \( \begin{equation*} {\it Diff}(\phi _1)= \frac{a-k}{b-k} - \frac{c-k}{d-k}. \end{equation*} \) Now, to repair \( \phi _1 \), we have to impose that \( {\it Diff}(\phi _1) \) has to be lower than \( \delta \) and, similarly to what we have done in the previous two cases, find the number of tuples that should be removed by solving for k.

Example 9

(Unfair Tuple Count Computation).

For the repair phase, we continue using the U.S. Census Dataset. Table 5 reports some examples of ACFDs in N that can be selected to repair the dataset. The user should select the ACFDs according to her research aims and should also include the ones that regard the discriminated minorities identified in the diversity study. In fact, to enhance data equity, we need representation equity [11]. Table 13 reports, for each selected ACFD, the number of tuples that should be added according to the TOS computation, or removed to compensate the Difference value.

Table 13.
ACFDTuples to addTuples to remove
(Sex = “Female”) \( \rightarrow \) Income = “\( \le \)50K”1,1414,883
(Race = “Black”) \( \rightarrow \) Income = “\( \le \)50K”185852
(Race = “Amer-Indian-Eskimo”) \( \rightarrow \) Income = “\( \le \)50K”2196
(Native-Country = “NC-Hispanic”) \( \rightarrow \) Income = “\( \le \)50K”162734

Table 13. A Few Examples of the Selected ACFDs with the Corresponding Values of Tuples in the U.S. Census Adult Dataset

7.2 Greedy Hit-count Algorithm

A basic solution to repair the dataset could simply consist in adding, for each negative ACFD, the tuples that satisfy its TOS, using, for each tuple, its Unfair Tuple Count as multiplicity. Unfortunately, this is not a valid option, because it would result in adding too many tuples to the dataset, which would not respect the key requirement to limit the modifications of the original dataset.

To add tuples to the initial dataset in an optimized way, we use a modified version of the Greedy Hit-count algorithm algorithm presented in Reference [2]. From each dependency in a TOS, we generate a pattern, which is an array whose dimension is equal to the number of attributes in the dataset and where each cell represents the value of a specific attribute. Given a dataset D with d categorical attributes and an ACFD \( \phi \), a pattern P is a vector of size d generated from \( \phi \), where \( P[k] \) is either X (meaning that its value is unspecified) or the value of the corresponding attribute in \( \phi \).

The pattern computation step is useful for two reasons: (i) to transform an ACFD into a tuple and decide the value of the free attributes (i.e., the non-instantiated attributes in the dependency); and (ii) to add tuples to the dataset in an optimized way, exploiting the Greedy Hit-count algorithm.

The Greedy Hit-count algorithm takes as input the set of patterns computed at the previous step and returns the set of tuples needed to repair the dataset. The algorithm operates by covering each pattern received as input: Covering a pattern means finding a combination of attribute values that match it. The idea behind this algorithm is that a value combination (i.e., a generated tuple) can cover multiple patterns simultaneously, allowing us to add fewer tuples than the basic solution, and thus minimizing the changes in the final, repaired dataset. Using a data tree structure that contains all the possible combinations of attribute values and a table of indexes representing whether the ith pattern has been covered or not, at every iteration the process selects the value combination that hits the maximum number of uncovered patterns. The algorithm stops when all the patterns are covered, returning the values combinations that hit the maximum number of uncovered patterns. To summarize, given the set of uncovered patterns as input, this step finds the minimum set of tuples to repair the dataset, along with the indication of which patterns each tuple can cover.

Finally, to decide how many copies of each tuple we insert in the dataset, we take the average of the Unfair Tuple Count of the patterns covered by that tuple.

Note that, while the Greedy Hit-count algorithm is used in Reference [2] to enhance coverage, we use it not only to solve discrimination but also to increase diversity: In fact, the identification of the smallest set of tuples to resolve unfairness also covers the regions of data that are under-represented and, by adding such tuples, we increase the quality of representation, mitigating also the lack of diversity.

The current implementation of the algorithm does not prevent the generation of unrealistic tuples (e.g., combining “Sex”=“Male” with “Pregnant”=“Yes”), since there is no constraint on the value combinations. We plan, in a future work, to improve the system with the addition of a validation oracle [2] to identify and prevent the insertion of inconsistent tuples.

Example 10

(Greedy Hit-count Algorithm).

For each ACFD in Table 13, the system determines the TOS, then, from these new ACFDs, it generates the corresponding patterns (reported in Table 14). Note that, in the patterns, the target attribute value is changed to compensate the Difference of the original ACFDs. The computed set of patterns is the input to the Greedy Hit-count algorithm. Table 15 reports the output composed by 2 tuples, with their cardinalities, that should be added to the dataset. The first tuple derives from 6 uncovered patterns and the second tuple derives from 2 uncovered patterns. The 2 tuples mined from the algorithm have no inconsistencies: They involve the discriminated minorities promoting their status and, at the same time, enhance the representation of the dataset. The final length of the dataset is 30,780 tuples.

Table 14.
WorkclassRaceSexHours-Per-WeekNCAge-RangeEducationIncome
XXFemaleXXXX>50K
XBlackXXXXX>50K
XA-I-EXXXXX>50K
XXXXNC-HispXX>50K

Table 14. Pattern Generation in the U.S. Census Adult Dataset

Table 15.
TupleAdd
“Private,” “Black,” “Female,” “0–20,” “NC-Hispanic,” “75–100,” “HS-College,” “>50K”427
“Private,” “Amer-Indian-Eskimo,” “Female,” “0–20,” “NC-US,” “75–100,” “Assoc,” “>50K”184

Table 15. The Tuples with Their Cardinality Mined from the Algorithm to Repair the U.S. Census Adult Dataset

7.3 Correction Algorithm

The aim of the Greedy Hit-count algorithm phase was to try to enhance the fairness of the initial dataset by adding a very low number of external elements. Using a greedy approach, and setting a limit to the maximum number of tuples that can be added/removed, we prevent the overloading of the final dataset, but this could potentially not be enough to repair the dataset completely. During this phase, we apply the Data Bias discovery procedure to the dataset obtained from the previous step to check whether any of the initial ACFDs is still present; if so, then we remove from the dataset the tuples matching the negative dependencies. To find the number of tuples to remove, for each negative ACFD still present in the dataset, we use b, the second value of the Unfair Tuple Count.

Example 11

(Correction Algorithm).

We apply the Data Bias Discovery procedure to the dataset obtained by adding tuples in the previous step (for brevity, halfway dataset); only one ACFD that was previously selected, \( \phi : \) (Sex = “Female,” Education = “Assoc”) \( \rightarrow \) Income = “\( \le \)50K,” is still mined from the halfway dataset, however, its Difference metric is a bit lower now: From a value of 0.126, it has become 0.094. The user can decide, according to the seriousness of the discrimination expressed by that ACFD and to the final metrics of the repaired dataset, if it is necessary to perform the Correction algorithm step to solve the ACFDs still present. If the user chooses to apply the Correction algorithm, then the system computes again, according to \( \delta \), the number of tuples to remove for each ACFD and generates the corresponding patterns that, converted into tuples are removed from the dataset. In this case, to solve \( \phi \), we need to remove 1,733 tuples, obtaining a final dataset with 29,088 tuples.

7.4 Statistics Computation

We now present a set of metrics to analyze the quality of the repair procedure. Specifically, for each ACFD given as input and still present in the repaired dataset, we compute:

  • Cumulative Support;

  • Mean Difference;

  • Mean P-Difference computed for all the protected attributes;

  • Inequity score, defined as: \( \begin{equation*} \sum _{i=1}^{n} Sup(i) * {\it Diff}(i), \end{equation*} \) where n is the number of initial dependencies still present in the final dataset.

Moreover, to make sure that the dataset can still be used for data science tasks, we compare the distribution of values of each attribute in the initial dataset with its counterpart in the repaired one.

Finally, we apply to the repaired dataset the Data Bias discovery procedure to inspect the ghost ACFDs, i.e., new ACFDs that were not mined from the original dataset and might appear now after this phase due to the added tuples inserted to compensate the Difference of the negative dependencies, and compute for each ghost ACFD difference, support, and confidence to check whether the repair phase introduced any new unfair behavior.

Example 12

(Statistics Computation).

Using the final repaired version of the dataset, we perform once again the Data Bias discovery procedure; we do not have any mined ACFDs in common with the ones selected from the input dataset, so the dataset can be considered as repaired. Figure 11 presents the main statistics for the original, halfway, and final dataset. The metrics are actually lower: The cumulative support, representing the percentage of tuples originally involved (around a 35%) after the repair procedure is 0, the Mean Difference goes from 0.13 to 0, and also all the Mean P-Difference measures computed for each protected attribute are 0. We compare the value distribution of each attribute of the initial dataset with its counterpart in the repaired one (we do not report the plots for brevity) and the distribution remains almost unchanged, apart from a slight increase in the number of females and private workers. Moreover, no unfair ghost ACFDs are discovered.

Fig. 11.

Fig. 11. Census statistics summary.

7.5 Titanic Dataset: ACFD-Repair

In this section, we illustrate the methodology of ACFD-Repair applied to the Titanic dataset.

Unfair Tuple Count computation: For each of the six dependencies selected by the user for the repair procedure, the algorithm computes the number of tuples that should be added or removed from the dataset to compensate their Difference value. Table 16 contains three of these ACFDs with their respective Unfair Tuple Count computation.

Table 16.
ACFDTuples to addTuples to remove
\( \phi _1: \) (Sex = “Male”) \( \rightarrow \) Survived = 0307391
\( \phi _2: \) (Sex = “Male,” Class = 3) \( \rightarrow \) Survived = 0166245
\( \phi _3: \) (Sex = “Male,” Fare = “0–8”) \( \rightarrow \) Survived = 047146

Table 16. A Few Examples of the Selected ACFDs with the Corresponding Values of Tuples in the Titanic Dataset

Greedy Hit-count algorithm: for each ACFD the algorithm computes the opposite set over the target class, and each new dependency is transformed into a pattern (Table 17).

Table 17.
SurvivedClassSexSibSpParchFarePort
1XMaleXXXX
13MaleXXXX
10MaleXX0-8X

Table 17. Pattern Generation in the Titanic Dataset

In this phase, the procedure selects the minimum number of tuples that have to be added to the final dataset to cover the entire set of patterns. Specifically, the algorithm selects two tuples to cover the negative ACDFs, with multiplicity equal to 174 and 49, overall adding 233 tuples to the dataset (Table 18).

Table 18.
TupleAdd
“1,” “3,” “Male,” “1,” “0,” “0–8,” “S”174
“1,” “2,” “Male,” “1,” “0,” “9–20,” “S”49

Table 18. The Tuples with Their Cardinality Mined from the Algorithm to Repair the Titanic Dataset

In a dataset containing 890 tuples, adding 223 tuples seems a big variation; however, as already observed, this dataset does not have a discriminated minority, because it is the majority group (men) the discriminated one. Consequently, to mitigate such bias, a large number of tuples needs to be added.

Correction algorithm: Since no initial dependency is still present in the repaired dataset, the correction procedure is skipped and we consider the dataset as the final one.

Statistics computation: Figure 12 presents the main statistics for the original and final datasets. The metrics have lower values: Originally, almost half of the total number of tuples were involved in some unfair ACFD, while after the repair procedure none of the original ACFDs is mined from the final dataset. As a result, the Cumulative Support, the Mean Difference, the Mean P-Difference for each protected attribute P, and the Inequity are equal to 0. Adding a considerable number of tuples to the initial dataset provoked a change in the distribution of the attribute values: In particular, the number of survived males and, in general, passengers is much higher, showing that solving unfairness and enhancing diversity in such an unbalanced and biased dataset needs to perform a substantial modification.

Fig. 12.

Fig. 12. Titanic statistics summary.

Skip 8COMPARISON Section

8 COMPARISON

We now compare the results, obtained by E-FAIR-DB on the U.S. Census Adult Dataset, with the ones obtained from the state-of-the-art systems presented in Section 2. We compared our framework on all the phases of the methodology: Data Bias discovery, Diversity study, and ACFD-repair. The competitor frameworks do not include in their pipeline all the phases (each system has only one or two of them), therefore, we do not perform a general comparison but focus only on the specific phases, one at a time.

8.1 Ranking Facts

As already said in Section 2, Ranking Facts [15] is a collection of visual widgets that perform different tasks. In this section, we compare our Data Bias discovery phase and Diversity study with the Fairness and Diversity widget included in Ranking Facts. Overall, the results obtained by our framework are in complete accordance with the ones obtained by Ranking Facts.

The comparison proceeded as follows: We chose the numerical attributes “Age,” “Education-num,” and “Hours-per-week” to specify the ranking function; then, to perform the fairness check, the algorithm needs at least one binary attribute, so we chose the “Sex” attribute. The fairness measures analyze one binary attribute at a time, using a statistical test to verify the group fairness with three different definitions: FA*IR, pairwise comparison, and proportion. A ranking is considered unfair with respect to the values of a binary attribute when the p-value of the corresponding statistical test falls below 0.05. To test Ranking Facts, we compute the fairness measure using the binary attribute “Sex”: the three measures indicate that group fairness is verified for males and it is not verified for females. This result on group fairness over the attribute “Sex” is confirmed also by our framework analysis. For the diversity check, we choose three categorical attributes: “Sex,” “Race,” and “Native-Country.” Ranking Facts confirmed our results: Analyzing the diversity of “Sex,” males are predominant in the ranking; for “Race,” the diversity widget presents “White” and “Asian-Pac-Islander” as main groups, and on the other side, “Black” and “Amer-Indian-Eskimo” are under-represented; finally, for the “Native-Country” attribute, the “NC-White” group is predominant and the “NC-Hispanic” group is under-represented.

Our tool discovers ACFDs that could involve more than one group at a time and can report information about subgroups. For example, one rule can consider both women and the black minority, obtaining more insights about the bias present in the dataset. The Fairness widget of Ranking Facts can check fairness only on one binary attribute at the time, therefore the results do not contain information about categorical, non-binary attributes. To compute fairness E-FAIR-DB, differently from Ranking Facts, does rely on statistical tests that involve a classifier previously computed on the dataset. Finally Ranking Facts is not equipped with a repair procedure.

8.2 AI Fairness 360

The main difference between E-FAIR-DB and AI Fairness 360 [3] is that the latter framework needs a classifier at its core. We believe that this is a limitation of the system, in fact, building a classifier to solve a fairness problem requires the policy to be application-oriented, greatly limiting the applicability of the system to scenarios where other tasks are needed.

The results obtained with AI Fairness 360 on the U.S. Census dataset are in complete accordance with the ones obtained by our framework. For the attribute “Sex,” four of the five statistical metrics detect bias: Males are identified as privileged and women as discriminated. Similarly, for the “Race” attribute two of the five measures indicate “White” people as privileged group.

The system checks the fairness property only for one binary attribute at a time (therefore, the attribute “Native-Country,” if not converted into binary format, cannot be analyzed), while our framework can deal with multiple categorical attributes of any cardinality. Moreover, AI Fairness 360 is not equipped with a Diversity step.

For the repair phase of AI Fairness 360, we applied the Reweighing pre-processing technique to the attributes “Sex” and “Race.” On the final dataset, for the first attribute, two of the five metrics still indicate bias for the previously discriminated groups, and the same happens for the “Race” attribute, where two metrics still indicate bias for the previously discriminated groups.

To summarize, we can observe that, in this example, the Reweighing technique can mitigate the discrimination present in the dataset, but cannot solve completely the bias. Furthermore, solving fairness with this framework does not imply to enforce data equity, since it does not improve diversity.

8.3 Pedreschi et al.

The last comparison we present is with the system by Pedreschi et al. [10, 13], a preprocessing technique to discover and solve discrimination in datasets.

In the discovery phase, the system, exploiting the concept of classification rules and discrimination measures used in the legal literature, can identify potentially discriminatory itemsets.

The process uses classification rules, a specific type of association rules that constrain the target class to be only in the RHS of the rule. This results in forcing the potentially discriminatory itemset to be only in the LHS of the discovered rules, while our approach does not impose this constraint on the ACFDs found. In fact, we can find interesting rules that have the target class in the LHS: One of such examples is the following dependency, mined from the Titanic dataset: \( \phi \) : (Survived = 0, Sex = “Female”) \( \rightarrow \) Class = 3.

To select the discriminatory itemsets, the authors propose measures that are similar to our Difference metric; however, we also provide the P-Difference, which works on a single protected attribute at a time, showing more insights on each ACFD. Moreover, this system does not involve user interaction, therefore, the user cannot discard the rules that are not interesting for the specific investigation or that involve attributes that make the rule fair; for example, in the U.S. Census dataset, not all the ACFDs that involve the attributes “Hours-Per-Week” or “Age-range” are discriminatory as \( \phi : \) (Hours-per-week = “0–20,” Sex = “Female”) \( \rightarrow \) Income = “\( \le \)50K.”

Finally, E-FAIR-DB provides as a final step a set of summary metrics that describes the overall degree of unfairness of the dataset and the diversity study. Other differences regard the repair procedure: The authors try to guarantee anti-discrimination and privacy, while we focus on solving unfairness and enhancing diversity. To solve discrimination and enforce privacy, Pedreschi et al. compensate the discrimination measures of the mined association rules: This technique is similar to ours, but they work on adding itemsets and not tuples to the dataset. To validate their repair procedure, they compare the accuracy results of two models trained on the original and final datasets; however, we check that the original selected ACFDs can no more be mined from the final dataset, compute statistics, and compare the attribute distribution to analyze the quality and the effectiveness of the repair procedure.

Skip 9CONCLUSIONS AND FUTURE WORK Section

9 CONCLUSIONS AND FUTURE WORK

We presented E-FAIR-DB, a novel framework that, through the extraction of Approximate Conditional Functional Dependencies (ACFDs), allows to discover and solve data bias, enhancing data equity while minimizing the modification on the original dataset. We illustrated our methodology, tested the system on two real-world datasets, and compared our framework with its main competitors both theoretically and experimentally. The main benefit of using functional dependencies is that our system does not need a classifier to detect data bias and identify unfairness; in fact, we do not use statistical measures based on a classification model, but a novel metric: the Difference. Furthermore, since our goal is data equity, we exploit the computed functional dependencies also to study other facets as diversity. Finally, the ACFDs are crucial for the repair phase, because they are used as a guide to mitigate the bias present in the dataset. The comparison with other similar systems highlighted that E-FAIR-DB provides very precise information about the groups treated unequally, and also that, in comparison with other existing tools, it studies two fundamental aspects of data equity: fairness and diversity, allowing to mitigate data bias according to the resulting insights.

Future work will include: (i) the study of the other facets of equity: access and outcome equity [11], (ii) the development of a validation oracle to identify and solve possible inconsistencies in the tuples that are added during the repair phase, (iii) improving and expanding diversity step, (iv) the development of a graphical user interface to facilitate the interaction of the user with the system, and (v) the study of other classes of functional dependencies that are possibly interesting for ethical purposes [4].

Footnotes

  1. 1 https://archive.ics.uci.edu/ml/datasets/Adult.

    Footnote
  2. 2 https://www.kaggle.com/c/titanic.

    Footnote
  3. 3 Note that the Census Bureau collects racial data in accordance with guidelines provided by the U.S. Office of Management and Budget, and these data are based on self-identification: People belonging to the “Other” minority see themselves as being of mixed origin, most commonly European and American Indian or European and African, so do not identify themselves based on ethnic or national origin ( https://www.census.gov/topics/population/race/about.html).

    Footnote
  4. 4 One-hot-encoding was used to handle categorical attributes.

    Footnote

REFERENCES

  1. [1] Adebayo Julius A. et al. 2016. FairML: ToolBox for Diagnosing Bias in Predictive Modeling. Ph.D. Dissertation. Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  2. [2] Asudeh A., Jin Z., and Jagadish H. V.. 2019. Assessing and Remedying Coverage for a Given Dataset. arxiv:1810.06742Google ScholarGoogle Scholar
  3. [3] Bellamy Rachel K. E. et al. 2019. AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. In IBM Journal of Research and Development. IBM.Google ScholarGoogle Scholar
  4. [4] Caruccio Loredana, Deufemia Vincenzo, and Polese Giuseppe. 2015. Relaxed functional dependencies-A survey of approaches. IEEE Trans. Knowl. Data Eng. 28, 1 (2015), 147165.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Chouldechova Alexandra. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 5, 2 (2017), 153163. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Criscuolo Fabio Azzalini Chiara and Tanca Letizia. 2021. FAIR-DB: FunctionAl DependencIes to discoveR data bias. Workshop Proceedings of the EDBT/ICDT’21.Google ScholarGoogle Scholar
  7. [7] Drosou Marina, Jagadish H. V., Pitoura E., and Stoyanovich Julia. 2017. Diversity in big data: A review. Big Data 5 2 (2017), 7384.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Dua Dheeru and Graff Casey. 2017. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  9. [9] Fan Wenfei, Geerts Floris, Li Jianzhong, and Xiong Ming. 2010. Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23, 5 (2010), 683698.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Hajian Sara, Domingo-Ferrer Josep, Monreale Anna, Pedreschi Dino, and Giannotti Fosca. 2015. Discrimination-and privacy-aware patterns. Data Mining Knowl. Discov. 29, 6 (2015), 17331782.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Jagadish H. V., Stoyanovich Julia, and Howe Bill. 2021. The many facets of data equity. In Proceedings of the Workshops of the EDBT/ICDT 2021 Joint Conference, Nicosia, Cyprus, March 23, 2021(CEUR Workshop Proceedings, Vol. 2841), Costa Constantinos and Pitoura Evaggelia (Eds.). CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-2841/PIE+Q_6.pdf.Google ScholarGoogle Scholar
  12. [12] Rammelaere Joeri and Geerts Floris. 2018. Revisiting conditional functional dependency discovery: Splitting the “C” from the “FD.” In Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Google ScholarGoogle Scholar
  13. [13] Ruggieri Salvatore, Pedreschi Dino, and Turini Franco. 2010. Data mining for discrimination discovery. ACM Trans. Knowl. Discov. Data 4, 2 (2010), 140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Saxena Nripsuta Ani et al. 2020. How do fairness definitions fare? Testing public attitudes towards three algorithmic definitions of fairness in loan allocations. Artif. Intell. 283 (2020), 103238. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Stoyanovich Julia and Howe Bill. 2019. Nutritional labels for data and models. IEEE Data Eng. Bull. 42, 3 (2019), 1323. Retrieved from http://sites.computer.org/debull/A19sept/p13.pdf.Google ScholarGoogle Scholar
  16. [16] Tramer Florian et al. 2017. FairTest: Discovering unwarranted associations in data-driven applications. In IEEE European Symposium on Security and Privacy.Google ScholarGoogle Scholar
  17. [17] Verma Sahil and Rubin Julia. 2018. Fairness definitions explained. In Proceedings of the International Workshop on Software Fairness, FairWare@ICSE 2018, Gothenburg, Sweden, May 29, 2018, Brun Y., Johnson B., and Meliou A. (Eds.). ACM, 17. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. E-FAIR-DB: Functional Dependencies to Discover Data Bias and Enhance Data Equity

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image Journal of Data and Information Quality
              Journal of Data and Information Quality  Volume 14, Issue 4
              December 2022
              173 pages
              ISSN:1936-1955
              EISSN:1936-1963
              DOI:10.1145/3563905
              Issue’s Table of Contents

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 23 November 2022
              • Online AM: 4 August 2022
              • Accepted: 4 July 2022
              • Revised: 15 April 2022
              • Received: 31 July 2021
              Published in jdiq Volume 14, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format