Skip to main content
Log in

Hierarchical codes: A flexible trade-off for erasure codes in peer-to-peer storage systems

  • Published:
Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

Abstract

Redundancy is the basic technique to provide reliability in storage systems consisting of multiple components. A redundancy scheme defines how the redundant data are produced and maintained. The simplest redundancy scheme is replication, which however suffers from storage inefficiency. Another approach is erasure coding, which provides the same level of reliability as replication using a significantly smaller amount of storage. When redundant data are lost, they need to be replaced. While replacing replicated data consists in a simple copy, it becomes a complex operation with erasure codes: new data are produced performing a coding over some other available data. The amount of data to be read and coded is d times larger than the amount of data produced, where d, called repair degree, is larger than 1 and depends on the structure of the code. This implies that coding has a larger computational and I/O cost, which, for distributed storage systems, translates into increased network traffic. Participants of Peer-to-Peer systems often have ample storage and CPU power, but their network bandwidth may be limited. For these reasons existing coding techniques are not suitable for P2P storage. This work explores the design space between replication and the existing erasure codes. We propose and evaluate a new class of erasure codes, called Hierarchical Codes, which allows to reduce the network traffic due to maintenance without losing the benefits given by traditional erasure codes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. There are several codes derived from LDPC codes, such as Tornado-Codes, LT-Codes etc., see [10] for a brief survey.

  2. A Galois Field or Finite Field is an algebraic structure with a finite number of elements. The main property of a Galois Field is that all the operations applied to its elements results in an element within field itself.

  3. Note that replacing the identity matrix with random coefficients transforms the code from a systematic one to an unsystematic one.

  4. This last constraint is unnecessary, since G 4,1 represents in this case the whole code.

  5. In the example of Fig. 3b, if b 1 needs to be repaired and either b 2 or b 3 is not available, the repair degree must be d = 4.

  6. It is β = 2, which means that every object consumes a space twice its size.

  7. For Hierarchical Codes, reintegration of this block is in some cases possible and would increase significantly our efficiency. However, identifying such cases is not trivial and it is left as future work.

  8. This instance corresponds to the instance ‘A’ of Fig. 4a, adapted to have k = 32 and h = 32.

  9. To be matched with a node b o outside \(G_{d_s,1}\), the repair degree of b o must be bigger than d s , which would violate the condition of the lemma.

References

  1. Acedacnski S, Deb S, Medard M, Koetter R (2005) How good is random linear coding based distributed networked storage? In: NETCOD

  2. Adya A, Bolosky W, Castro M, Cermak G, Chaiken R, Douceur J, Howell J, Lorch J, Theimer M, Wattenhofer R (2002) Farsite: federated, available and reliable storage for an incompletely trusted environment. In: 5th symposium on OSDI. Boston, 9–11 December 2002

  3. Ahlswede R, Cai N, Li S-YR, Yeung RW (2000) Network information flow. IEEE Trans Inf Theory 46(4):1204–1216

    Article  MATH  MathSciNet  Google Scholar 

  4. Dabek F et al (2001) Wide-area cooperative storage with CFS. In: Proc. SOSP, Banff, 21–24 October 2001

  5. Dimakis AG, Godfrey B, Wu Y, Wainwright MJ, Ramchandran K (2008) Network coding for distributed storage systems. Computer research repository (CoRR). arXiv:0803.0632v1

  6. Duminuco A, Biersack E (2009) A practical study of regenerating codes for peer-to-peer backup systems. In: 29th intl conference on distributed computing systems (ICDCS), Montreal, 22–26 June 2009

  7. Godfrey B (2006) Repository of availability traces. http://www.cs.berkeley.edu/~pbg/availability/

  8. Haeberlen A, Mislove A, Druschel P (2005) Glacier: highly durable, decentralized storage despite massive correlated failures. In: NSDI05

  9. Li S-YR, Yeung RW, Cai N (2003) Linear network coding. IEEE Trans Inf Theory 49(2):371–381

    Article  MATH  MathSciNet  Google Scholar 

  10. Mitzenmacher M (2004) Digital fountains: A survey and look forward. In: IEEE information theory workshop

  11. Plank JS (1997) A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Softw Pract Exp 27(9):995–1012

    Article  Google Scholar 

  12. Rodrigues R, Liskov B (2005) High availability in DHTs: erasure coding vs.replication. In: IPTPS05

  13. Steiner M (2007) Kad traces. http://www.eurecom.fr/~btroup/kadtraces/

  14. Weatherspoon H (2006) Design and evaluation of distributedWide-area on-line archival storage systems. PhD thesis, University of California, Berkeley

  15. Weatherspoon H, Kubiatowicz JD (2002) Erasure coding vs. replication: a quantitative comparison. In: Proceedings of IPTPS’02, Cambridge

Download references

Acknowledgement

The first author was partially supported by a PhD Scholarship from Microsoft Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessandro Duminuco.

Appendices

Appendix

A Proofs

A.1 Preliminary Proofs

Lemma 2

Consider an Information Flow Graph for a generic (k,h)-code at time step T. Consider a selection of k blocks \(B_{1}^{k}\). Assume that there exists a condition C on this selection that guarantees that the original fragments can be reconstructed.

If for any time step t ≤ T, any selection of \(B_{t}^{k}\) that fulfills the condition C can be perfectly matched with a selection of k blocks \(B_{t-1}^{k}\) in time step t − 1 that in turn fulfills the condition C,

Then any selection \(B_{T}^{k}\) that fulfills the condition C allows the reconstruction of the original fragments.

Proof

We proceed by steps:

  1. step 1

    Consider a selection \(B_{1}^{k}\) that fulfills the condition C. By assumption we know that the selection allows the reconstruction of the original fragments. This means, thanks to Lemma 1, that nodes in \(B_{1}^{k}\) have k distinct paths towards the original fragments F.

  2. step 2

    Consider a selection \(B_{2}^{k}\) that fulfills the condition C. By assumption we know that the nodes in this selection can be perfectly matched with a selection \(B_{1}^{k}\) that in turn fulfills the condition C. Thanks to previous step, we know that nodes in \(B_{1}^{k}\) have k distinct paths towards the original fragments F. This means that we can concatenate the perfect matching between \(B_{2}^{k}\) and \(B_{1}^{k}\) and the k distinct paths between \(B_{1}^{k}\) and F, obtaining k distinct paths between \(B_{2}^{k}\) and F.

The last step can be repeated until the time step T, where thanks to Lemma 1, the lemma is proved. □

Lemma 3

Consider a code graph of a Hierarchical Code. Consider a group \(G_{d_s,i}\) and denote as \(F_{d_s,i}\) the subset of original fragments that are connected with nodes in this group \(G_{d_s,i}\). Consider a selection of nodes \(B_{1}^{k}\) and consider the subset of this selection that belongs to the group considered: \(A_{d_s,i} = B_{1}^{k} \cap G_{d_s,i}\).

If \(|A_{d_s,i}| \leq d_s\) and \(\forall j: G_{d_{{s-1},j}} \subseteq G_{d_s,i}\) , the nodes in \(A_{d_{s-1},j}\) have already been perfectly matched with \(|A_{d_{s-1},j}|\) nodes in \(F_{d_{s-1},j}\) ,

Then it is possible to find a perfect matching between the nodes in \(A_{d_s,i}\) and the nodes in \(F_{d_s,i}\).

Proof

Consider the nodes in \(A_{d_s,i}\) that do not belong to the subgroups \(G_{d_{s-1},j}\subseteq G_{d_s,i}\) and denote them as \(\hat{A}\). Consider the fragments in \(F_{d_s,i}\) that have not been matched with the nodes in the subgroups \(G_{d_{s-1},j}\subseteq G_{d_s,i}\) and denote them as \(\hat{F}\). The nodes in \(\hat{A}\) are connected with all the nodes in \(F_{d_{s},i}\) and can be thus all matched with nodes in the subset \(\hat{F}\), as long as \(|\hat{A}|\leq|\hat{F}|\). Since nodes in the subgroups have already been matched, then \(|A_{d_s,i}|-|\hat{A}| = |F_{d_s,i}|-|\hat{F}|\), where \(|F_{d_s,i}|=d_s\). This implies that whenever \(|A_{d_s,i}|\leq d_s\), \(|\hat{A}|\leq|\hat{F}|\) and the perfect matching is possible. □

Lemma 4

Consider an Information Flow Graph of a hierarchical code at time step t. Consider a selection \(B_{t}^k\) that fulfills the condition (C2). Assume that a subset of α nodes \(B_{t}^{\alpha} \subset B_{t}^k\) has already been perfectly matched with nodes in the previous step \(B_{t-1}^{\alpha}\) that in turn fulfill the condition (C2). Consider a node \(b \in B_{t}^{k} \setminus B_{t}^{\alpha}\), i.e. that belongs to the selection but has not yet been matched.

If all the repairs in the graph are done fullfilling condition (C3) and condition (C4), and all the blocks \(b_i \in B_{t}^{\alpha}\) are such that |R(b i )| ≤ |R(b)|,

Then it is possible to augment \(B_{t-1}^{\alpha}\) with another node that is matched with b, without violating the condition (C2) on the augmented set \(B_{t-1}^{\alpha+1}\)

Proof

Let us use the following notation: \(A_{d,i}=B_{t-1}^{\alpha} \cap G_{d,i}\) and R d,i = R(b) ∩ G d,i. Assume that \(G_{d_s,1}\) is the group in which condition (C3) is fulfilled. This condition requires that \(|R_{d_s,1}| = |d_s|\). Note that all the nodes in \(B_{t}^{\alpha}\) have a repair degree d ≤ d s , which implies that all the nodes in \(A_{d_s,i}\) are necessary matched with nodes in \(B_{t}^{\alpha} \cap G_{d_s,1}\).Footnote 9 Since \(b \in G_{d_s,1}\), thanks to condition (C2), \(|B_{t}^{\alpha} \cap G_{d_s,1}|<d_s\), which in turn implies \(|A_{d_s,i}|<|d_s|\).

Consider two alternative cases:

  1. case 1:

    \(\exists j: 1\leq j \leq g_s, |R_{d_{s-1},j}|>|A_{d_{s-1},j}|\): This means that there is a subgroup of the group \(G_{d_s,1}\) (that belongs to G(b)) that has at least one free node that can be matched with the block b. Since \(|R_{d_{s-1},j}|\leq |d_{s-1}|\), this node can be added without violating condition (C2) and the lemma is proved.

  2. case 2:

    \(\forall j: 1\leq j \leq g_s, |R_{d_{s-1},j}|\leq|A_{d_{s-1},j}|\): This means that there are no free nodes in the subgroups. This implies that: \(\sum_{j=1}^{g_s} |R_{d_{s-1},j}|\leq \sum_{j=1}^{g_s} |A_{d_{s-1},j}|\). Consider the nodes in \(A_{d_s,1}\) that do not belong to the subgroups and denote them as \(\hat{A}\) (they are among the h s additional nodes), then consider the nodes in \(R_{d_s,1}\) that do not belong to the subgroups and denote them as \(\hat{R}\). We can write \(\sum_{j=1}^{g_s} |A_{d_{s-1},j}|=|A_{d_s,i}|-|\hat{A}|\) and \(\sum_{j=1}^{g_s} |R_{d_{s-1},j}|=|R_{d_s,i}|-|\hat{R}|\). Since \(|R_{d_s,1}| = |d_s|\) and \(|A_{d_s,i}|<|d_s|\), we have that \(|\hat{R}|>|\hat{A}|\). This means that there is at least one free node in \(\hat{R}\) that can be matched with the blocks b without violating condition (C2) and the lemma is proved.□

A.2 Proof of Proposition 1

Proof

Thanks to Lemma 2, proving Proposition 1 corresponds to prove that in a generic time step t, only if repairs are done with a repair degree d ≥ k, then any selection of nodes \(B_{t}^{k}\) can be perfectly matched with a selection \(B_{t-1}^{k}\).

Consider a repaired node \(b \in B_{t}^{k}\). All the other k − 1 nodes in \(B_{t}^{k}\) can be matched at most with k − 1 nodes in B t − 1. If b has been repaired with a degree d < k, it is possible that all the nodes in R(b) have already been matched with the k − 1 nodes in \(B_{t}^{k}\), preventing the matching of b. If d ≥ k there is at least one free node that can be matched with b. This can be repeated for all the repaired blocks proving, thanks to Lemma 2, the proposition. □

A.3 Proof of Proposition 2

Proof

Thanks to Lemma 2, proving Proposition 2 corresponds to prove that if a selection \(B_{1}^{k}\) is done fulfilling condition (C2), then it is possible to find a perfect matching between the nodes in \(B_{1}^{k}\) and the original fragments in F. This can be proved using iteratively the Lemma 3 from the innest group that nodes in \(B_{1}^{k}\) belong to, to the outest one.

A.4 Proof of Proposition 3

Proof

Thanks to Lemma 2, proving Proposition 3 corresponds to prove that in a generic time step t, where repairs are done fulfilling the condition (C3) and condition (C4), any selection of nodes \(B_{t}^{k}\) that fulfills the condition (C2) can be perfectly matched with a selection \(B_{t-1}^{k}\) that in turn fullfills the condition (C2).

Thanks to Lemma 4, \(B_{t-1}^{k}\) can be found matching one by one the nodes in \(B_{t}^{k}\) proceeding from the nodes with the lowest repair degree to the nodes with the highest one. □

B Computation of failure probability

Let us consider a Hierarchical (k,h)-code and assume that l losses occurred in this code, where 0 ≤ l ≤ (k + h). We first define the probability P(k′|l), which is the probability that, given that l losses occurred, k′ is the maximum number of alive fragments in the code, which fulfills the condition condition (C2). Note that the definition implies that P(k′|l) exists only for 0 ≤ k′ ≤ k. Given these probabilities, computing the failure probabilities is straightforward:

$$ P(\emph{failure}|l)=1-P(k'=k|l) $$

To compute the failure probability we proceed as follows:

  1. 1.

    We compute the probabilities P 0(k′|l) for the Hierarchical (k 0,h 0)-code, represented by the level 0 (the innest) in the hierarchy as explained in Appendix B.1.

  2. 2.

    We compute the probabilities P s (k′|l) for the Hierarchical (d s ,H s )-code, represented by the generic level s, using the probabilities P s − 1(k′|l) computed for the hierarchical (d s − 1,H s − 1)-code, represented by the level s − 1, as explained in Appendix B.2.

B.1 Probabilities for level 0

At the level 0 the probability computation is straightforward:

$$ P_0(k'|l) = \left\{ \begin{array}{l l} 1 & \forall k'=k_0+h_0-l , k'<k_0 \\ 1 & \forall k'>k_0+h_0-l , k'=k_0 \\ 0 & \emph{otherwise} \end{array} \right. $$

B.2 Probabilities for level s

If we have l losses in a generic hierarchical (d s ,H s )-code associated with the s-th level of the hierarchy we have many different ways in which these losses can be distributed among the g s groups \(G_{d_{s-1},i}\) this code is made of and the h s fragments associated with this level s. Let us define the Loss Configuration of l losses denoted as \(\emph{LC}_{l}\) as a vector of g s  + 1 elements \(\emph{LC}_l=\{l_0,l_1,\dots,l_{g_s}\}\), where each element l i indicates how many losses occur in the group \(G_{d_{s-1},i}\) except from l 0, which indicates how many losses occur among the h s fragments of the level s. The constraints of \(\emph{LC}_l\) are:

$$ \left\{ \begin{array}{l l} l_0\leq h_s \\[1pt] l_i\leq d_{s-1}+H_{s-1}, & \forall i=1,\dots,g_s \end{array} \right. $$

We denote as \(P(\emph{LC}_l)\) the probability that this configuration take place given that l losses occurred and we compute it as explained in Appendix B.3.

Every of the last g s values in the configuration (all of them except l 0) indicates a number of losses l i in a given subgroup and thus denotes a set of probabilities P s − 1(k′|l i ), with 0 ≤ k′ ≤ d s − 1, which express the probability that k′ is the maximum number of alive fragments from the subgroup i that fulfill the condition (C2) for the level (s − 1).

If, for each of this subgroup we select a specific value k i , we define a Fragment Configuration of \(K'=\sum_{i=1}^{g_s} k_i\) alive fragments denoted as \(\emph{FC}_{K'}\), whose probability is denoted as \(P(\emph{FC}_{K'}) \) and given by:

$$ P(\emph{FC}_{K'})=\prod\limits_{i=1}^{g_s} P_{s-1}(k'_i|l_i) $$

The probability \(P(\emph{FC}_{K'})\) represents one of the components of the probability that, given the configuration analyzed \(\emph{LC}_l\), K′ is maximum number of alive fragments taken from the subgroups such that the condition (C2) is fulfilled for the level s. To obtain the maximum number of alive fragments from the whole group that fulfill the condition (C2), K′ must be augmented with h s  − l 0 alive fragments of the level s, with the constraint: K′ + h s  − l 0 ≤ d s .

Putting the pieces together we can finally define the probability P s (k′|l):

$$ P_s(k'|l) = \sum\limits_{\forall \emph{LC}_l} P(\emph{LC}_l) f(k',\emph{LC}_l) $$

where the auxiliary function \(f_s(k',\emph{LC}_l)\) is defined as follows

$$ \begin{array}{l} f_s(k',\emph{LC}_l) = \\ \left\{ \begin{array}{l l} 1 & k'<h_s-l_0 \\ \sum_{\forall \emph{FC}_{k'-(h_s-l_0)}} P\left(\emph{FC}_{k'-(h_s-l_0)}\right) & k'<k , k'\geq h_s - l_0 \\[4pt] \sum_{j=0}^{k_s-l_0} \sum_{\forall \emph{FC}_{k'-j}} P\left(\emph{FC}_{k'-j}\right) & k'=k , k'\geq h_s - l_0 \end{array} \right. \end{array} $$

B.3 Loss configuration probability

We can map the loss configuration problem to the following balls and bin problem. Consider a set of g s  + 1 colors, for each color i there are n i balls, which are inserted in a bin. We extract form the bin a total of l balls, which will form a color configuration described by a vector of g s  + 1 elements, where each element l i indicates how many balls of color i have been extracted. Considering the original loss configuration problem, our objective is to compute the probability of a given color extraction, where n 0 = h s and n i  = k s − 1 + h s − 1 for 1 ≥ i ≥ g. This probability can be computed dividing the number of possible configurations corresponding to the extraction by the total number of possible configurations, which gives:

$$ P(\emph{LC}_l)=\frac{\prod_{i=0}^{g_s}{n_i\choose l_i}}{{\sum_{i=0}^{g_s}n_i\choose l}} $$

where \(\emph{LC}_l=\{l_0\dots l_g\)} and:

$$ {n\choose k} = \frac{n!}{k!(n-k)!} $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duminuco, A., Biersack, E.W. Hierarchical codes: A flexible trade-off for erasure codes in peer-to-peer storage systems. Peer-to-Peer Netw. Appl. 3, 52–66 (2010). https://doi.org/10.1007/s12083-009-0044-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12083-009-0044-8

Keywords

Navigation