Skip to main content
Log in

Making clusterings fairer by post-processing: algorithms, complexity results and experiments

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

While existing fairness work typically focuses on fair-by-design algorithms, here we consider making a fairness-unaware algorithm’s output fairer. Specifically, we explore the area of fairness in clustering by modifying clusterings produced by existing algorithms to make them fairer whilst retaining their quality. We formulate the minimal cluster modification for fairness (MCMF) problem, where the input is a given partitional clustering and the goal is to minimally change it so that the clustering is still of good quality but fairer. We show that for a single binary protected status variable, the problem is efficiently solvable (i.e., in the class P) by proving that the constraint matrix for an integer linear programming formulation is totally unimodular. Interestingly, we show that even for a single protected variable, the addition of simple pairwise guidance for clustering (to say ensure individual-level fairness) makes the MCMF problem computationally intractable (i.e., NP-hard). Experimental results using Twitter, Census and NYT data sets show that our methods can modify existing clusterings for data sets in excess of 100,000 instances within minutes on laptops and find clusterings that are as fair but are of higher quality than those produced by fair-by-design clustering algorithms. Finally, we explore a challenging practical problem of making a historical clustering (i.e., zipcodes clustered into California’s congressional districts) fairer using a new multi-faceted benchmark data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. For the convenience of readers, a short introduction to ILP formulations is provided in Sect. A.1 of the appendix.

  2. https://www.cs.cmu.edu/~mccallum/bow/.

  3. Theory and applied literatures use different terms for the same algorithm. We use the k-medoid MATLAB algorithm which is referred to the k-medians algorithm in the theory literature.

  4. This figure appears in “Appendix C”.

References

  • Abbasi M, Bhaskara A, Venkatasubramanian S (2021) Fair clustering via equitable group representations. In: Proceedings of FAccT, p 11

  • Ahmadi S, Galhotra S, Saha B, Schwartz R (2020) Fair correlation clustering. CoRR https://arxiv.org/abs/2002.03508

  • Ahmadian S, Epasto A, Kumar R, Mahdian M (2020) Fair correlation clustering. In: The 23rd international conference on artificial intelligence and statistics, AISTATS 2020, 26–28 August 2020, Online [Palermo, Sicily, Italy], pp 4195–4205

  • Backurs A, Indyk P, Onak K, Schieber B, Vakilian A, Wagner, T (2019) Scalable fair clustering. In: Proceedings of 36th ICML, pp 405–413

  • Ballotpedia: Ballotpedia. (2020). Retrieved 2 September 2020, from (2020). https://ballotpedia.org/Redistricting_in_California

  • Barocas S, Selbst AD (2016) Big data’s disparate impact. California Law Rev 671:671–732

    Google Scholar 

  • Barocas S, Hardt M, Narayanan A (2017) Fairness in machine learning. NeurIPS tutorial

  • Basu S, Davidson I, Wagstaff K (2008) Constrained clustering: advances in algorithms, theory and applications. CRC Press, Cambridge

    Book  MATH  Google Scholar 

  • Bera SK, Chakrabarty D, Negahbani M (2019) Fair algorithms for clustering. ArXiv preprint arXiv:1901.02393

  • Berge C (1972) Balanced matrices. Math Program 2:19–31

    Article  MathSciNet  MATH  Google Scholar 

  • Bureau C (2020a) Bureau, Census. American Community Survey (ACS). Retrieved 2 September 2020, from (2020). https://www.census.gov/programs-surveys/acs

  • Bureau C (2020b) Bureau, Census. ZIP Code Tabulation Areas (ZCTAs). Retrieved 2 September 2020, from (2020). https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html

  • Chen X, Fain B, Lyu L, Munagala K (2019) Proportionally fair clustering. In: Proceedings of the 36th international conference on machine learning, ICML 2019, 9–15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, pp 1032–1041

  • Chhabra A, Mohapatra P (2020) Fair algorithms for hierarchical agglomerative clustering. CoRR https://arxiv.org/abs/2005.03197

  • Chhabra A, Masalkovaite K, Mohapatra P (2021) An overview of fairness in clustering. IEEE Access 9:130698–130720

    Article  Google Scholar 

  • Chierichetti F, Kumar R, Lattanzi S, Vassilvitskii S (2017) Fair clustering through fairlets. In: Proceedings of NeurIPS, pp 5036–5044

  • CNMP: Center for New Media & Promotion (CNMP), U (2020) My Congressional District. Retrieved 2 September 2020, from (2020). https://www.census.gov/mycd/?st=06

  • Commission UEEO (2007) Employment tests and selection procedures. Retrieved 2 September 2020, from. https://www.ncsl.org/research/redistricting/election-dates-for-legislators-governors-who-will-do-redistricting.aspx

  • Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 2nd edn. MIT Press and McGraw-Hill, Cambridge

    MATH  Google Scholar 

  • Davidson I, Ravi SS (2007) The complexity of non-hierarchical clustering with instance and cluster level constraints. Data Min Knowl Discov 14(1):25–61

    Article  MathSciNet  Google Scholar 

  • Davidson I, Ravi SS (2020) Making existing clusterings fairer: algorithms, complexity results and insights. In: The thirty-fourth AAAI conference on artificial intelligence. AAAI New York, NY, USA. AAAI Press, pp 3733–3740

  • Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Dwork C, Hardt M, Pitassi T, Reingold O, Zemel RS (2012) Fairness through awareness. In: Innovations in theoretical computer science 2012, Cambridge, MA, USA, January 8–10, 2012, pp 214–226

  • Feldman M, Friedler SA, Moeller J, Scheidegger C, Venkatasubramanian S (2015) Certifying and removing disparate impact. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, Sydney, NSW, Australia, August 10–13, 2015, pp 259–268

  • Flores NJ (2019) Fair algorithms for clustering. Dartmouth Computer Science Technical Report TR2019-867

  • Friedler SA, Scheidegger C, Venkatasubramanian S (2016) On the (im)possibility of fairness. CoRR http://arxiv.org/abs/1609.07236

  • Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman & Co., San Francisco

    MATH  Google Scholar 

  • Gurobi optimizer reference manual. Available from https://www.gurobi.com/documentation/9.1/refman/index.html (2020)

  • Kleindessner M, Awasthi P, Morgenstern J (2019) Fair \(k\)-center clustering for data summarization. In: Proceedings of ICML, pp 3448–3457

  • Kleindessner M, Samadi S, Awasthi P, Morgenstern J (2019) Guarantees for spectral clustering with fairness constraints. In: Proceedings of ICML, pp 3458–3467

  • Kuo CT, Ravi SS, Dao TBH, Vrain C, Davidson I (2017) A framework for minimal clustering modification via constraint programming. In: AAAI, pp 1389–1395

  • Mahabadi S, Vakilian A (2020) Individual fairness for \(k\)-clustering. In: Proceedings of the 37th international conference on machine learning, ICML 2020, 13–18 July 2020, Virtual Event, pp 6586–6596

  • NCSL: NCSL (2019) Election Dates for Legislators and Governors Who Will Do Redistricting. Retrieved 2 September 2020, from (2019). https://www.ncsl.org/research/redistricting/election-dates-for-legislators-governors-who-will-do-redistricting.aspx

  • Rösner C, Schmidt M (2018) Privacy preserving clustering with constraints. ArXiv preprint arXiv:1802.02497

  • Schaffer C (1994) A conservation law for generalization performance. In: Proceedings of ICML, pp 259–265. Elsevier, New York

  • Schrijver A (1998) Theory of linear and integer programming. Wiley, New York

    MATH  Google Scholar 

  • Thanh BL, Ruggieri S, Turini F (2011) \(k\)-NN as an implementation of situation testing for discrimination discovery and prevention. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, USA, August 21–24, 2011, pp 502–510

  • Vaidya PM (1989) Speeding-up linear programming using fast matrix multiplication (extended abstract). In: 30th annual symposium on foundations of computer science, Research Triangle Park, North Carolina, USA, 30 October–1 November 1989, pp 332–337

  • von Luxburg U (2006) A Tutorial on spectral clustering. Tech. Rep. TR-149, Max Planck Institute for Biological Cybernetics, Germany

  • Vazirani VV (2001) Approximation algorithms. Springer, New York

    MATH  Google Scholar 

  • Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on on innovative applications of artificial intelligence, July 30–August 3, 2000, Austin, Texas, USA, pp 1097–1192

  • Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evolut Comput 1(1):67–82

    Article  Google Scholar 

  • Xu R, Wunsch DC (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

  • Ziko IM, Yuan J, Granger E, Ayed IB (2021) Variational fair clustering. In: Proceedings of thirty-fifth conference on artificial intelligence (AAAI). AAAI Press, pp 11202–11209

Download references

Acknowledgements

We thank the referees for carefully reading the paper and providing very helpful feedback. We also thank Professor Seshadhri Comandur (University of California, Santa Cruz) for pointing out that the sufficient condition that we use for establishing the TU property of the constraint matrix in Sect. 3 is called the Ghouila-Houri characterization in the literature. This work was supported in part by NSF Grants IIS-1908530 and IIS-1910306 titled: “Explaining Unsupervised Learning: Combinatorial Optimization Formulations, Methods and Applications”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ian Davidson.

Additional information

Responsible editor: Toon Calders, Salvatore Ruggieri, Bodo Rosenhahn, Mykola Pechenizkiy and Eirini Ntoutsi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this paper appeared in AAAI-2020 (Davidson and Ravi 2020).

Appendices

Additional material for Sect. 3

1.1 A short introduction to integer linear programs

Many combinatorial optimization problems can be expressed as Integer linear programs (ILPs) (Garey and Johnson 1979; Schrijver 1998). An ILP is specified by a set of variables that are constrained to take on integer values, a linear objective function of these variables and a collection of linear constraints that each solution must satisfy. In general, unless P = NP, no efficient algorithms are possible for solving ILPs (Garey and Johnson 1979). However, the availability of well known software tools (e.g., Gurobi 2020), which incorporate many heuristic search methods, make it possible to use ILPs to solve problems of moderate size arising in practice. We now provide an example of a combinatorial problem that can be formulated as an ILP.

Example:  We consider the Knapsack problem, where we are given a collection of n objects. Each object \(o_i\) has a weight \(w_i\) (pounds) and value \(d_i\) (dollars), \(1 \le i \le n\). We are also given a knapsack whose total capacity is W (pounds). The goal is to choose a subset of the items so that the total weight of the items is at most W and the total value of all the items is a maximum subject to this constraint. This problem is known to be NP-complete (Garey and Johnson 1979). An ILP for this problem can be developed as follows.

Our ILP formulation uses n variables denoted by \(x_1\), \(x_2\), \(\ldots \), \(x_n\). Each of these variables takes on a value from \(\{0,1\}\) with the following interpretation: object \(o_i\) is added to the knapsack iff \(x_i = 1\). With these variables, the optimization goal can be expressed as:

Maximize   \(\sum _{i=1}^{n} d_i x_i\)

Since each \(d_i\) is a (given) constant, this objective function is linear.

We now discuss the constraints. First, the total weight of the chosen items must be at most the capacity of the knapsack. This constraint can be expressed as follows:

\(\sum _{i=1}^{n} w_i x_i ~\le ~ W\)

Note that each \(w_i\) is a given constant. Thus, this constraint is linear. The other constraint, which restricts the value of each \(x_i\), is as follows:  \(x_i \in \{0,1\}\),   \(1 \le i \le n\).

This completes the ILP formulation of an ILP for the Knapsack problem. Many examples of such formulations are discussed in standard texts on algorithms and related topics (e.g., Garey and Johnson 1979; Cormen et al. 2009; Vazirani 2001). Many methods for solving ILPs are discussed in Schrijver (1998).

1.2 An example to illustrate the ILP formulation for MCMF

We present an example to show the constraint matrix that arises in the ILP formulation of MCMF presented in Sect. 3. For simplicity, we will construct this example assuming that we need strict fairness.

In this example, we have a set \(S = \{s_1, s_2, s_3, s_4, s_5\}\) consisting of 5 instances. Of these, instances \(s_1\), \(s_2\) and \(s_3\) are special (i.e., their PSV values, denoted by \(p_1\), \(p_2\) and \(p_3\) respectively, are all 1) while \(s_4\) and \(s_5\) are not special (i.e., \(p_4 = p_5 = 0\)). The initial clustering of S has two clusters \(C_1\) and \(C_2\), where \(C_1 = \{s_1, s_2, s_3\}\) and \(C_2 = \{s_4, s_5\}\).

Since the number of special items \(N_x = 3\) and the number of clusters \(K = 2\), under strict fairness, each cluster must have either \(\lceil N_x/K \rceil \) = \(\lceil 3/2 \rceil = 2\) special items or \(\lfloor N_x/K \rfloor \) = \(\lfloor 3/2 \rfloor \) = 1 special item. Thus, the given clustering (which has all the three special items in \(C_1\)) is not strictly fair, and we need to modify it to achieve fairness.

From the discussion in Sect. 3, let \(z_{i,j}\) denote the {0,1}-valued variable that is set to 1 if in the modified clustering, instance \(s_j\) is assigned to cluster \(C_i\), \(1 \le j \le 5\) and \(i = 1,2\). (Otherwise, \(z_{i,j}\) is set to 0.) As discussed above, the upper and lower bounds on the number of special items in each cluster are 2 and 1 respectively. Let \(u_i\) and \(l_i\) denote the slack variables for Cluster \(C_i\), \(i = 1, 2\). Using Eqs. (2) through (4) for the two clusters, and noting that only \(p_1\), \(p_2\) and \(p_3\) are 1, we get the following constraints:

$$\begin{aligned} z_{1,1} + z_{1,2} + z_{1,3} + u_1= & {} 2 \\ -z_{1,1} - z_{1,2} - z_{1,3} + l_1= & {} -1 \\ z_{2,1} + z_{2,2} + z_{2,3} + u_2= & {} 2 \\ -z_{2,1} - z_{1,2} - z_{2,3} + l_2= & {} -1 \\ z_{1,1} + z_{2,1}= & {} 1 \\ z_{1,2} + z_{2,2}= & {} 1 \\ z_{1,3} + z_{2,3}= & {} 1 \\ z_{1,4} + z_{2,4}= & {} 1 \\ z_{1,5} + z_{2,5}= & {} 1 \\ \end{aligned}$$

Note that the constraint matrix uses only the coefficients on the left side of each constraint above. As indicated in Sect. 3, we use the following ordering of the variables so that each constraint (which becomes a row of the constraint matrix) can be expressed as a linear combination of the variables in this order:

$$\begin{aligned} \langle z_{1,1}, z_{1,2}, z_{1,3}, z_{1,4}, z_{1,5}, z_{2,1}, z_{2,2}, z_{2,3}, z_{2,4}, z_{2,5}, u_1, u_2, l_1, l_2\rangle \end{aligned}$$

Thus, the resulting constraint matrix C has 9 rows (one corresponding to each constraint) and 14 columns (one corresponding to each variable). Variables that don’t appear in a constraint have their coefficients as 0 in the constraint matrix. From the above constraints, we get the following constraint matrix C with 9 rows (corresponding to the constraints) and 14 columns (corresponding to the variables).

1

1

1

0

0

0

0

0

0

0

1

0

0

0

\(-\) 1

\(-\) 1

\(-\) 1

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

1

1

1

0

0

0

1

0

0

0

0

0

0

0

\(-\) 1

\(-\) 1

\(-\) 1

0

0

0

0

0

1

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

An alternative algorithm for modifying a given clustering to achieve strong fairness

In Sect. 3, we mentioned that it is possible to develop a faster algorithm for achieving strong fairness. In that section, we also presented the basic ideas behind the algorithm. Here, we provide a description of the algorithm.

Notation used in the description of the algorithm: In specifying this algorithm, we assume that we need to only deal with special data items. (Data items that are not special play no role in determining strong fairness.) Thus, the input to the algorithm is an arbitrary partition \(\Pi \) of \(D_x\) have \(k \ge 1\) clusters denoted by \(C_1\), \(C_2\), \(\ldots \), \(C_k\), with cluster \(C_j\) containing \(\beta _j\) special items, \(1 \le j \le k\). We also assume that the clusters are numbered 1 through k so that \(\beta _1 \ge \beta _2 \ge \cdots \ge \beta _k\). (This can be ensured in \(O(k\log {k})\) time by sorting the clusters.) The output of the algorithm is a partition \(\Pi '\) of \(D_x\) into k clusters such that \(\Pi '\) is strongly fair with respect to the protected attribute x. The algorithm constructs \(\Pi '\) by moving the minimum number of special items between clusters. It first moves the minimum number of special items into a temporary container T and then redistributes those items to clusters which need additional special items to satisfy the fairness condition. The steps of our algorithm (which we call OPT-Modification) for the minimal modification problem are described below.

Steps of Algorithm OPT-Modification:

  1. 1.

    For each cluster \(C_j\) in \(\pi \) if \(\beta _j = \lceil N_x/k \rceil \) or \(\beta _j = \lfloor N_x/k \rfloor \), then output\(\pi \) is strongly fair” and stop.

  2. 2.

    Let \(N_x = qk + r\), where \(q \ge 0\) and \(0 \le r \le k-1\). Use Case 1 or Case 2 depending upon the value of r. Case 1: \(r = 0\). Here, \(N_x = qk\). (In this case, the algorithm must ensure that each cluster has exactly \(N_x/k\) special items.)

  1. (a)

    Let clusters \(C_1\), \(\ldots \), \(C_t\) have \(> N_x/k\) special items. (Other clusters have \(\le N_x/k\) special items.)

  2. (b)

    From each cluster \(C_j\), \(1 \le j \le t\), move \(\beta _j - N_x/k\) special items into a temporary container T.

  3. (c)

    For each cluster \(C_p\) such that \(\beta _p < N_x/k\), move \(N_x/k -\beta _p\) special items from T into \(C_p\).

Case 2: \(r > 0\). Here, \(N_x = qk+r\). (In this case, as required by Lemma 2, the algorithm must ensure that exactly r clusters have \(\lceil N_x/k \rceil \) special items and \(r-k\) clusters have \(\lfloor N_x/k \rfloor \) special items.)

  1. (a)

    Partition the clusters into 4 groups \(\Gamma _1\), \(\Gamma _2\), \(\Gamma _3\) and \(\Gamma _4\) as follows. (Some of the groups may be empty.)

  • Let \(\Gamma _1\) consist of clusters \(C_1\), \(\ldots \), \(C_t\) with \(> \lceil N_x/k \rceil \) special items.

  • Let \(\Gamma _2\) consist of clusters \(C_{t+1}\), \(\ldots \), \(C_p\) with exactly \(\lceil N_x/k \rceil \) special items. (Thus, groups \(\Gamma _1\) and \(\Gamma _2\) together have p clusters.)

  • Let \(\Gamma _3\) consist of clusters \(C_{p+1}\), \(\ldots \), \(C_m\) with exactly \(\lfloor N_x/k \rfloor \) special items.

  • Let \(\Gamma _4\) consist of the remaining clusters, that is, \(C_{m+1}\), \(\ldots \), \(C_k\) with \(< \lfloor N_x/k \rfloor \) special items.

  1. (b)

    Use one of Cases 2.1, 2.2 or 2.3 depending upon the comparison between p and r.

Case 2.1:  \(p < r\) (i.e., Groups \(\Gamma _1\) and \(\Gamma _2\) together have \(< r\) clusters).

  1. (i)

    From each cluster \(C_j\) in \(\Gamma _1\), move \(\beta _j - \lceil N_x/k \rceil \) special items into a temporary container T.

  2. (ii)

    For each of the first \(r-p\) clusters \(C_j\) in \(\Gamma _3 \cup \Gamma _4\), move \(\beta _j - \lceil N_x/k \rceil \) special items from T into \(C_j\).

  3. (iii)

    For each of the other clusters \(C_j\) in \(\Gamma _3 \cup G_4\), move \(\beta _j - \lfloor N_x/k \rfloor \) special items from T into \(C_j\).

Case 2.2:  \(p = r\) (i.e., Groups \(\Gamma _1\) and \(\Gamma _2\) together have exactly r clusters).

  1. (i)

    From each cluster \(C_j\) in \(\Gamma _1\), move \(\beta _j - \lceil N_x/k \rceil \) special items into a temporary container T.

  2. (ii)

    For each of the clusters \(C_j\) in \(\Gamma _4\), move \(\beta _j - \lfloor N_x/k \rfloor \) special items from T into \(C_j\).

Case 2.3:  \(p > r\) (i.e., Groups \(\Gamma _1\) and \(\Gamma _2\) together have \(> r\) clusters). Use one of the subcases 2.3.1, 2.3.2 or 2.3.3 depending on how t compares with r. Case 2.3.1:  \(t > r\) (i.e., group \(\Gamma _1\) has more than r clusters).

  1. (i)

    From each cluster \(C_j\) in \(\Gamma _1\), move \(\beta _j - \lceil N_x/k \rceil \) special items into a temporary container T.

  2. (ii)

    From each cluster \(C_j\), \(r \le j \le p\), move \(\beta _j - \lfloor N_x/k \rfloor \) special items into T.

  3. (iii)

    For each cluster \(C_j \in \Gamma _4\) move \(\beta _j - \lfloor N_x/k \rfloor \) special items from T into \(C_j\).

Case 2.3.2:  \(t = r\) (i.e., group \(\Gamma _1\) has exactly r clusters).

  1. (i)

    From each cluster \(C_j \in \Gamma _1\), move \(\beta _j - \lceil N_x/k \rceil \) special items into the temporary container T.

  2. (ii)

    From each cluster \(C_j \in \Gamma _2\), move \(\beta _j - \lfloor N_x/k \rfloor \) = 1 special item into T.

  3. (iii)

    For each cluster \(C_j \in \Gamma _4\), move \(\beta _j - \lfloor N_x/k \rfloor \) special items from T into \(C_j\).

Case 2.3.3:  \(t < r\) (i.e., group \(\Gamma _1\) has \(< r\) clusters).

  1. (i)

    From each cluster \(C_j \in \Gamma _1\), move \(\beta _j - \lceil N_x/k \rceil \) special items into the temporary container T.

  2. (ii)

    From each cluster \(C_j\), \(r-t+1 \le j \le p\), move \(\beta _j - \lfloor N_x/k \rfloor \) = 1 special item into T.

  3. (iii)

    For each cluster \(C_j \in \Gamma _4\), move \(\beta _j - \lfloor N_x/k \rfloor \) special items from T into \(C_j\).

  1. 3.

    Output the modified partition \(\Pi ' = \langle C_1, C_2, \ldots , C_k\rangle \).

Running Time of Algorithm OPT-Modification: As mentioned earlier, sorting the list of clusters so that their sizes are in non-increasing order can be done in \(O(k\log {k})\) time. The remaining part of the algorithm consists of several cases, exactly one of which is executed (depending on the values of \(N_x\) and k). In each case, the algorithm moves the excess special data items from some of the clusters into a temporary container and redistributes the items from that container to clusters that are deficient with respect to special data items. Since there are at most n special items, the time needed for redistribution step is O(n). Thus, the running time of this algorithm is \(O(n + k\log {k})\).

Fig. 7
figure 7

California’s 53 congressional districts and 1769 ZCTAs

Limitations of the Above Algorithm: Algorithm OPT-Modification has two limitations. First, the algorithm handles only strong fairness. Second, it minimizes the number of special data items moved from one cluster to another; it cannot handle more general minimization objectives mentioned in Table 1. The LP-based algorithm discussed in Sect. 3 overcomes both of these limitations but has an asymptotically larger running time.

Other details about California’s congressional districts dataset and our experimental settings

1.1 Background on congressional districts

One of the most important aspects of any election is how the voters are represented. How lines are drawn to separate congressional districts can heavily impact an election and how well everyone is represented. As an example, consider the 2016 presidential election where the boundaries of the electoral college led to the election of a candidate even though they did not win the popular vote. California is divided into 53 congressional districts (CDs) (CNMP 2020), and these congressional districts are redrawn every decade (NCSL 2019). The most recent congressional districts are shown in Fig. 7a. The selection committee that creates these lines focuses on ensuring the relatively equal sizes of populations across the 53 CDs (Ballotpedia 2020). However, based on this strategy, the congressional district lines are unfair for some protected status variables (PSVs) as shown in Table 9.

In this paper, we explore the fairness of California’s congressional districts with respect to a number of PSVs collected from the Census Bureau’s 2018 American Community Survey (Bureau 2020a). After identifying the PSVs with respect to which the CDs are unfair, we attempt to redistribute the population to correct this unfairness.

As previously mentioned, California is divided into 53 CDs. Each CD consists of a subset of the 1769 ZCTAs (Zip Code Tabulation Areas) as shown in Fig. 7b (Bureau 2020b). Each ZCTA consists of multiple census blocks. ZCTAs are similar but not identical to zip codes used the United States Postal Service (USPS). The latter are managed by the USPS and can be arbitrarily changed, while ZCTAs only change every decade. Additionally, ZCTAs are only used by the Census Bureau. Typically, each of the 1769 distinct ZCTAs in California is assigned exclusively to one CD, but some ZCTAs on the boundaries of CDs can be shared by multiple CDs.

1.2 Protected status variable creation

The data utilized during the case study is from the Census Bureau’s ACS (American Community Survey) 2018 Data (Bureau 2020a), which provides highly detailed socioeconomic, housing, and demographic information of a given population (Bureau 2020a). This data forms the basis of our protected status variable (PSVs). In an effort to protect people’s privacy, the PSVs are presented in an aggregated manner via population numbers (e.g., how many females are in a ZCTA). There was difficulty finding this information at the census block level, hence information at a ZCTA level was used instead. Additionally, a relationship file is used to map ZCTA membership in a congressional district. If a ZCTA is shared by multiple congressional districts, its population is distributed equally amongst those CDs to create multiple ZCTA-Instances.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Davidson, I., Bai, Z., Tran, C.M. et al. Making clusterings fairer by post-processing: algorithms, complexity results and experiments. Data Min Knowl Disc 37, 1404–1440 (2023). https://doi.org/10.1007/s10618-022-00893-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00893-6

Keywords

Navigation