AVCOL: Availability-aware information aggregation in large distributed systems under uncollaborative behavior

doi:10.1016/j.comnet.2009.03.006

Computer Networks

Volume 53, Issue 13, 28 August 2009, Pages 2360-2372

https://doi.org/10.1016/j.comnet.2009.03.006 Get rights and content

Abstract

Aggregation of system-wide information in large-scale distributed systems, such as p2p systems and Grids, can be unfairly influenced by nodes that are selfish, colluding with each other, or are offline most of the time. We present AVCOL, which uses probabilistic and gossip-style techniques to provide availability-aware aggregation. Concretely, AVCOL is the first aggregation system that: (1) implements any (arbitrary) global predicate that explicitly specifies any node’s probability of inclusion in the global aggregate, as a mathematical function of that node’s availability (i.e., percentage time online); (2) probabilistically tolerates large numbers of selfish nodes and large groups of colluders; and (3) scales well with hundreds to thousands of nodes. AVCOL uses several unique design decisions: per-aggregation tree construction where nodes are allowed a limited but flexible probabilistic choice of parents or children, probabilistic aggregation along trees, and auditing of nodes both during aggregation as well as in gossip-style (i.e., periodically). We have implemented AVCOL, and we experimentally evaluated it using real-life churn traces. Our evaluation and our mathematical analysis show that AVCOL satisfies arbitrary predicates, scales well, and withstands a variety of selfish and colluding attacks.

Introduction

Distributed applications aggregate various kinds of data from large populations of nodes. Resource utilization information is collected about nodes in order to enable resource discovery for Grid applications [22], [31]. Statistics of system performance are collected [20], [33], e.g., max, min, top-k, or bottom-k of CPU utilization. Votes may be collected from nodes, and the majority of answers used to make a go-no-go decision, e.g., for leader election or replication [21].

The above systems use aggregation within the network in order to scalably and efficiently compute the aggregate, and deliver it to a sink node. However, in environments where nodes have varying degrees of contribution to the system, one often desires to collect biased aggregates so that nodes that have contributed more to the system have a bigger say in the final aggregate. In such a biased aggregation, for each node, the probability that a global aggregate will include that node’s own value (henceforth we call this inclusion probability), increases with that node’s contribution to the system. Notice that the inclusion probability for a node is calculated only across global aggregates initiated while that node is online. Such biased aggregation can be useful to mitigate freeloading [1] by ensuring that nodes that contribute comparatively less to the system, influence the aggregate less.

In this paper, we consider a specific type of contribution, namely node availability, defined as the fraction of time that the node is online. We allow the inclusion probability of a node’s value to be specified as a mathematical function of that node’s availability. This relation is a global predicate that the application deployer would desire applied to all nodes in the system. We denote this global predicate as f, and we focus only on monotonically non-decreasing predicates. For instance, the deployer might specify a linear predicate, i.e., the inclusion probability of each node x is $f (x) = av (x)$ , where $av (x)$ is that node’s availability. As another example, a quadratic predicate may be desired, e.g., $f (x) = (av (x))^{2}$ , or a bimodal predicate, e.g., if $(av (x) > 0.5)$ $f (x) = 1.0$ else $f (x) = 0.0$ .

Unlike other approaches that implicitly scale node benefit approximately according to its contribution [13], [30], our approach allows us to explicitly specify this relation as a mathematical function, thus giving a better control over the quality of aggregation. This control enables the use of the aggregation protocol for various purposes. For instance, one can calculate the average available disk space throughout the system, by using the linear predicate along with the “average” aggregation function over the disk space attribute at nodes. If concerns over data durability increase, then the previous aggregation could use the quadratic predicate, thus resulting in a disk space measurement more biased towards what’s available at higher availability nodes. Another example is using the bimodal (or quadratic) predicate along with the “min” aggregation function over ids of nodes. This produces a leader election protocol where only high availability nodes can become leaders. In general the bimodal or quadratic predicate can be used to penalize low availability nodes – compared to the use of the linear predicate – and thus provide incentive for them to improve their availability.

The problem of using local and distributed actions at nodes to achieve an arbitrary and emergent global predicate is a challenging one. There is a need to scale to systems with hundreds or thousands of nodes, as well as to withstand churn, i.e., arrival, departure and failure of nodes. Further challenges come from the fact that nodes may be uncollaborative. This means that: (1) many nodes may be selfish – a selfish node takes actions that increases its own inclusion probability, independent of its availability, (2) groups of nodes may be colluding – they aim to increase their own inclusion probabilities, independent of their availabilities. Our uncollaborative model is restricted to increasing one’s inclusion probability only, and not Byzantine behaviors such as arbitrarily modifying the value of an aggregate or influencing other non-colluding node’s inclusion probability.

Selfish and colluding behavior can arise from node gaming [27] or multiple administrative domains (MADs) [2]. This behavior can adversely influence the predicate satisfaction in any solution to our aggregation problem. For instance, in the above examples, selfish nodes may end up unfairly biasing the measured average available disk space. A group of colluders may end up influencing the leader election to always elect one of them as leader (regardless of their availability).

We present AVCOL, an availability-aware aggregation service that implements arbitrary global predicates for biased aggregation. AVCOL works in environments where nodes may be selfish or colluding. AVCOL uses a novel combination of four techniques: (1) building aggregation trees on-demand, where nodes select parents (or children) based on availability, (2) restricting the choice of valid parents (or children) based on a consistent condition, (3) probabilistic forwarding of child data up to parents at each internal tree node, and (4) periodic (i.e., gossip-style) and per-aggregation auditing to verify correct node behavior and prevent collusion. AVCOL can be seen as incorporating availability dependence with a probabilistic aggregation approach.

AVCOL leverages an availability monitoring service, e.g., [24], a distributed partial membership protocol, e.g., [11], [32], [16], [24], and knowledge about the probability distribution of node availability in the system, e.g., [4]. We analyze the latency and reliability of AVCOL’s aggregation trees, as well as predicate satisfaction at each node. We also present experimental results from a simulation driven by traces of availability variation from real deployed systems, e.g., the Overnet p2p system.

In our previous work, we have built decentralized protocols that implement global predicates for multicast [26] and membership [23], by leveraging our availability monitoring service [24]. The availability-aware aggregation problem addressed in this current paper is a natural follow-up, and extends the idea of global predicates to the aggregation problem. This problem requires an entirely new set of design techniques.

We present our assumptions and problem statement in Section 1.1 and related work in Section 1.2. The probabilistic aggregation in AVCOL trees is described in Section 2, while Section 3 discusses how trees are constructed in spite of selfish and colluding nodes. Section 4 presents our auditing scheme, and we present experimental results in Section 5. We conclude in Section 6.

We make the following assumptions:

(1)
Aggregations occur in rounds, called epochs. Each epoch is uniquely identified by using the sink node’s id and a signed epoch number. Epochs could be initiated either (a) asynchronously, initiated by the sink, or (b) periodically, at synchronized times across all nodes (e.g., helped by NTP). We support both these options. Epochs do not overlap, and inter-epoch time intervals are larger than the typical time to finish an aggregation.
(2)
Each aggregation epoch is associated with a sink node which desires to calculate the aggregate. We assume henceforth for simplicity that the same sink node is used in all epochs; our algorithms work even when the sink differs across epochs. We will also assume that the sink is (i) always online, i.e., has an availability of 1.0, and (ii) is not selfish or collusive with any other node. These assumptions are reasonable because we want aggregation anytime, and at a trusted sink node.
(3)
The aggregation statistic desired by the sink is partially aggregatable within the network, i.e., the tree is used for in-network aggregation. In other words, akin to [22], [31], [33], we assume that combining two partial aggregates into another aggregate, does not increase the size of the message. Some partially aggregatable functions are top-k, bottom-k, max, min, count, sum, and average (aggregated as sum and count).
(4)
Each node has a unique id, and can send messages to any other node. In order to bound latency of aggregation, we assume that a message to a correct (alive) node is received within a time bound. We assume each node can sign messages, and signatures can be verified – without this assumption nodes may masquerade as multiple other nodes and launch a Sybil Attack [9].
(5)
The number of online nodes, N (a system parameter), is stable and changes within a small constant factor in a timeframe of weeks. This assumption is justified, even under system churn. For instance, in the Overnet system [4] the online node population size varies by a factor of 2 over a week and by a factor of 3 over a month. Furthermore, [6], [28] show that the Gnutella system size varies within a factor of 2 per day and per month, and [14] shows that in p2p streaming systems the size varies within a factor of 9 per day and per week. Thus, we can set N to an approximate value in this range, and it can be updated infrequently (e.g., once a month) without hurting scalability. The estimate size can be determined distributedly by existing protocols such as [19].
(6)
The node availability PDF remains fairly stable across time. Just like N, this has been shown to be stable in several deployed p2p systems [28]. Thus, it can be measured and used as a system-wide parameter that would be updated infrequently (e.g., once a month), without affecting scalability. This measurement can be done by the availability monitoring service.

The desired global predicate, that relates a node x’s availability $av (x)$ to the inclusion probability for its data in an aggregate, is denoted by the function $f : [0, 1] \to [0, 1]$ . We make two assumptions about f: (1) f is monotonically non-decreasing, i.e., if $av (x) > av (y)$ for two nodes $x, y$ , then it is true that $f (av (x)) ⩾ f (av (y))$ ; (2) $f (1.0) = 1.0$ , i.e., a node that is always online will desire to have its data appear in all aggregates. For instance, this is true at the sink node.

The problem we address is then, informally, as follows: given an arbitrary desired global predicate f, design an aggregation protocol so that for each node x, x’s contributed value(s) appears in a fraction $f (av (x))$ of the global aggregates at the sink, calculated only across epochs during which x is online.

We would like to achieve this in a uncollaborative setting where nodes may be selfish and colluding, and also join, leave, rejoin, and silently fail from the system. A selfish or colluding node attempts to maximize the inclusion probability of its own value in a global aggregate, but without affecting other nodes’ inclusion probabilities, i.e., selfish or colluding nodes are not malicious or Byzantine. In other words, a node deviates from the specified protocol behavior only when the deviation improves the inclusion probability of either itself, or some of its colluders. Thus, selfish nodes may execute local actions, while colluding nodes may use friends, all to increase their own inclusion probabilities, e.g., by double forwarding of own values. However, nodes never maliciously modify their own values or partial aggregates. We assume an arbitrary number of selfish nodes in the system. We also assume that nodes collude in groups, where all pairs of nodes in the same group collude, with the size of the groups being large.

In order to solve this problem, we leverage two services in a black-box manner: (1) a distributed availability monitoring service [24], and (2) a decentralized probabilistically-shuffled membership protocol [11], [16], [24]. The distributed availability monitoring service keeps track of the availability of nodes, and allows any node’s availability to be queried. The reported availability could be either raw, aged, or window-based (recent). We assume a consistent availability monitoring service, i.e., simultaneous queries (e.g., from multiple nodes) for availability of a given node all return the same value. Our implementation uses the AVMON decentralized monitoring system [24], and our experiments measure the effect of any inconsistencies arising from this use. AVMON’s overhead is fully distributed and our experiments show it is small [24]. We will elaborate on the decentralized shuffled membership protocol in Section 4.2, where it is first used by our design.

These two leveraged services are themselves resilient to uncollaborative nodes. AVMON reports accurate availabilities in spite of large numbers of uncollaborative nodes. As Section 4.2 shows, we use the membership protocol only for selecting children and parents in the aggregation tree – thus an uncollaborative node cannot increase its own inclusion probability by tampering with the membership.

Centralized aggregation schemes based on user scripts or CoMon-like tools [34] collect a lot of information periodically (e.g., once every 10 min) from all nodes, maintaining these in a queriable database. Decentralized aggregation schemes scale better by using in-network aggregation. Many of these build aggregation trees either based on domain layout (e.g., Astrolabe [31] or Ganglia [22]), or by using a structured overlay (e.g., SDIMS [33], PIER [15], or [3]), or randomly on demand (e.g., MON [20]), or based on other techniques. Robust aggregation can be done either via gossip [17], [18] or via multiple paths in sensor networks [25].

However, none of these systems above have addressed the effect of selfish or colluding nodes. Similar to many decentralized approaches, AVCOL builds per-aggregation trees. Yet, unlike them, AVCOL innovates in being the first to satisfy explicit availability-based predicates.

Game theoretic techniques have been applied for systems with arbitrary rational nodes [27], yet they are often too complex and bandwidth-consuming for large distributed systems. The BAR model [2] considers Byzantine, altruistic and rational nodes, but has not been applied to the aggregation problem. In addition, BAR allows rational nodes only to be selfish, but not colluding. While AVCOL does not consider Byzantine (malicious) attacks, it does address aggregation under selfish and colluding behavior.

While traditional protocols typically provide a deterministic bound on the number of attackers, e.g., [5], AVCOL tolerates an arbitrary number of selfish nodes. In addition, it provides a probabilistic tolerance to large numbers of colluders in the system. Finally, auditing mechanisms have been used to ensure replica correctness in spite of attacks and bit-rot in LOCKSS [21], as well as for detection of Byzantine behavior in PeerReview [12]. Similar to PeerReview, AVCOL reports selfish and colluding nodes via signed non-repudiable proofs.

Section snippets

Probabilistic aggregation in AVCOL trees

We first describe how AVCOL trees aggregate data in order to satisfy a given global predicate f. While Section 3 will describe how these trees are constructed in order to combat selfish and colluding nodes, the tree aggregation itself is agnostic to such uncollaborative nodes. In other words, the current section assumes no nodes are selfish or colluding.

AVCOL trees are built so that each node x’s tree parent has an availability $⩾ av (x)$ . In other words, if a node y is a tree parent of a node x,

AVCOL tree construction

In a realistic setting with node churn, and with selfish and colluding nodes, the static AVCOL trees of Section 2 may be ineffective. This is due to many reasons. Firstly, if parent–child relationships are static, then for each node x that is offline during an epoch, all of x’s tree descendants will have their values not included in the global aggregate. This will happen for all epochs when x is offline. Secondly, during any epoch, a node x may send its data to more than one parent, thus

Auditing and discovery

This section describes how nodes carry out per-epoch audit operations (Section 4.1), how nodes discover children or parents (Section 4.2), and periodic audit operations (Section 4.3). The audit operations detect selfish nodes and small collusion groups eventually, while probabilistically preventing large collusion groups from having an impact.

Experimental results

We implemented AVCOL in C, and evaluate it using trace-driven discrete-event simulations. AVCOL is built atop AVMON [24], which provides both the availability monitoring service – itself resistant to selfish and colluding nodes – and the probabilistically-shuffled membership protocol that we require (Sections 3 AVCOL tree construction, 4 Auditing and discovery). In order to measure AVCOL’s performance under real availability traces, we use the churn traces collected by Bhagwan et al. [4] from

Conclusions

We have presented AVCOL, the first probabilistic aggregation system to support availability-based global predicates that relate each node’s inclusion probability in an aggregate, explicitly to the node’s availability. AVCOL’s decentralized mechanisms allow arbitrary predicates, and address both selfish nodes and colluding groups of nodes, attempting to increase their inclusion probability. Our analysis of AVCOL shows that it tolerates large numbers of selfish nodes, and large groups of

Ramsés Morales is currently a PhD student in the Computer Science department of the University of Illinois at Urbana-Champaign. He received his MS in Computer Science from the same university in 2005 supported by a Fulbright Fellowship. Research interests include P2P systems, Distributed Protocols with Self-∗Behavior, and Grid Computing.

References (34)

D. Kostoulas et al.
Active and passive techniques for group size estimation in large-scale and dynamic distributed systems
Elsevier Journal of Systems and Software
(2007)
M.L. Massie et al.
The ganglia distributed monitoring system: design implementation and experience
Parallel Computing
(2004)
E. Adar et al.
Free riding on Gnutella
First Monday
(2000)
A.S. Aiyer, L. Alvisi, A. Clement, M. Dahlin, J.-P. Martin, C. Porth, BAR fault tolerance for cooperative services, in:...
M. Bawa, H. Garcia-Molina, A. Gionis, R. Motwani, Estimating Aggregates on a Peer-to-Peer Network, Technical Report,...
R. Bhagwan, S. Savage, G. Voelker, Understanding availability, in: Proceedings of the IPTPS, February 2003, pp....
M. Castro et al.
Practical Byzantine fault tolerance and proactive recovery
ACM Transactions on Computer Systems
(2002)
J. Chu, K. Labonte, B. Levine, Availability and locality measurements of peer-to-peer -+lesystems, in: Proceedings of...
C. Cooper et al.
The size of the largest strongly connected component of a random digraph with a given degree sequence
Combinatorics, Probability and Computing
(2004)
A.J. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, Epidemic algorithms for replicated database maintenance, in...

J.R. Douceur

The sybil attack

P.T. Eugster, R. Guerraoui, S. Handurukande, A.-M. Kermarrec, P. Kouznetsov, Lightweight probabilistic broadcast. in:...

A.J. Ganesh, A.-M. Kermarrec, L. Massoulie, SCAMP: peer-to-peer lightweight membership service for large-scale group...

A. Haeberlen, P. Kouznetsov, P. Druschel, Peerreview: practical accountability for distributed systems, in: Proceedings...

M. Haridasan, I. Jansch-Porto, R. van Renesse, Enforcing fairness in a live-streaming system, in: Proceedings of the...

X. Hei et al.

A measurement study of a large-scale p2p iptv system

IEEE Transactions on Multimedia

(2007)

R. Huebsch, J.M. Hellerstein, N. Lanham, B.T. Loo, S. Shenker, I. Stoica, Querying the internet with PIER, in:...

Cited by (1)

Efficient random walk sampling in distributed networks
2015, Journal of Parallel and Distributed Computing
Citation Excerpt :
on peer-to-peer membership management for gossip-based protocols. Morales and Gupta [40,41] have several more papers in their line of work on AVMON system and similar systems that use several continuous node samples as a way to monitor distributed systems. Speeding up distributed algorithms using random walks has been considered for a long time.
Performing random walks in networks is a fundamental primitive that has found numerous applications in communication networks such as token management, load balancing, network topology discovery and construction, search, and peer-to-peer membership management. While several such algorithms are ubiquitous, and use numerous random walk samples, the walks themselves have always been performed naively.
In this paper, we focus on the problem of performing random walk sampling efficiently in a distributed network. Given bandwidth constraints, the goal is to minimize the number of rounds and messages required to obtain several random walk samples in a continuous online fashion. We present the first round and message optimal distributed algorithms that present a significant improvement on all previous approaches. The theoretical analysis and comprehensive simulations of our algorithms show that they perform very well in different types of networks of differing topologies.
In particular, our results show how several random walks can be performed continuously (when source nodes are provided only at runtime, i.e., online), such that each walk of length $ℓ$ can be performed exactly in just $\tilde{O} (\sqrt{ℓ D})$ rounds¹ (where $D$ is the diameter of the network), and $O (ℓ)$ messages. This significantly improves upon both, the naive technique that requires $O (ℓ)$ rounds and $O (ℓ)$ messages, and the sophisticated algorithm of Das Sarma et al. (2013) that has the same round complexity as this paper but requires $Ω (m \sqrt{ℓ})$ messages (where $m$ is the number of edges in the network). Our theoretical results are corroborated through extensive simulations on various topological data sets. Our algorithms are fully decentralized, lightweight, and easily implementable, and can serve as building blocks in the design of topologically-aware networks.

Indranil Gupta completed his PhD in Computer Science from Cornell University in 2004. Indranil received the NSF CAREER award in 2005 and the Xerox Award in 2008. He has previously worked in IBM Research and Microsoft Research. He obtained his B.Tech (Computer Science) from the IIT-Madras, in 1998. He is a member of ACM and IEEE.

^☆: This work was supported in part by NSF CAREER Grant CNS-0448246 and in part by NSF ITR Grant CMS-0427089.

^☆☆: This work is previously unpublished.

View full text

Computer Networks

AVCOL: Availability-aware information aggregation in large distributed systems under uncollaborative behavior☆,☆☆

Abstract

Introduction

Section snippets

Probabilistic aggregation in AVCOL trees

AVCOL tree construction

Auditing and discovery

Experimental results

Conclusions

Active and passive techniques for group size estimation in large-scale and dynamic distributed systems

Elsevier Journal of Systems and Software

The ganglia distributed monitoring system: design implementation and experience

Parallel Computing

Free riding on Gnutella

First Monday

Practical Byzantine fault tolerance and proactive recovery

ACM Transactions on Computer Systems

The size of the largest strongly connected component of a random digraph with a given degree sequence

Combinatorics, Probability and Computing

The sybil attack

A measurement study of a large-scale p2p iptv system

IEEE Transactions on Multimedia

Efficient random walk sampling in distributed networks