Abstract
Applying Distributed Ledger Technologies to securely manage intercommunicated data between IoT applications has recently been adopted on an enormous scale. They enable data integrity, privacy, and robustness to public, open, permission-less P2P networks. Voting-based consensus algorithms proved high efficiency even with limited computing and less power IoT devices. Moreover, they can identify legitimate information and isolate malicious attackers through repetitive voting queries to adjacent peers asking their opinions about the validity of each transaction. Several lightweight validation models are introduced to enrich IoT networks with better performance and higher security. Nevertheless, the current algorithms struggle to find adequate parameters that balance network security and operability, in addition to balancing fairness in distributed environments. This paper introduces an Autonomous Lightweight Ledger Constructor to resolve common defects and threats. Based on Reinforcement Learning, it can dynamically construct a valid distributed ledger in limited-computing systems under several adversarial conditions. The validity of transactions in this approach is calculated based on their cumulative weights and the issuer’s reputation, which are inferred subjectively by a lightweight Bayesian-like function. A new simulator is developed to evaluate ALLC performance and security. The experimental results demonstrate reasonable performance and high resistance against known compromises targeting Distributed Ledger Technologies.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
IoT networks typically comprise smart sensors and actuators on top of the systems to collect and process massive raw data, supporting decision-making intelligently and autonomously. IoT networks may include constrained devices featuring low computing resourcing, low processing power, and short battery life. Adversarial manipulation of information or data records may severely affect the operations, causing tremendous risks such as system disruption, production shutdown, environmental disaster, and human injuries or even death. Therefore, Distributed Ledger Technology (DLT) is used nowadays to manage data securely in various applications. It helps to store a consistent ledger distributed among all participating peers in different IoT domains, allowing transparency, fault-tolerance, and trust. The distributed ledger is locally constructed, validated, and stored by intelligent constructors and shared with all nodes by P2P network protocols. The set of methods and techniques that ensure all participants consistently share the same ledger are generally called consensus algorithms. Consensus Algorithms are classified into Proof-Based and Voting-Based [1, 2]. Proof-based algorithms allow one participant who exerts some effort to win a competition or prove certain superiority over the others so that it can append new transactions to the ledger and get a reward. Data ledger is normally stored as a linearly linked list of ordered blocks of encapsulated transactions, validated by hashing chains between blocks. On the other hand, Voting-based algorithms allow any participants issue and broadcast transactions to the network. Data ledger in this paradigm is typically stored in a Directed Acyclic Graph (DAG) format allowing parallel generation and validation for transactions. Replacing the popular blockchain linearly linked structure by the DAG significantly enhance network throughput, and increase transaction approval time. Yet, the concurrency of issuance endangers the ledger with conflicting transactions and double-spend attack.
Voting-based consensus algorithms are commonly used in IoT to protect the integrity of the ledger, where the honest majority safeguards environments against malicious minorities based on repeated iterative queries to adjacent neighbors asking their opinions about the validity of each transaction [3, 4]. The lightweight nature of these algorithms allows tiny IoT devices with low computing to interact with the DLT as full nodes that can participate in the consensus independently and store a partial ledger.
Most of the existing works allow IoT systems to interact with DLT through intermediate clients with sufficient computing and storing capabilities that act as gateways or edge systems to interpret IoT device’s events and logs into DLT transactions. They participate in the consensus process and data-storing activities on behalf of the IoT Devices [5,6,7], as shown in Fig. 1. Another promising setup is shown in Fig. 2, where all IoT devices can act as DLT nodes and participate in the consensus and data storing as well. This setup deploys lightweight software code to IoT devices, which allows them to participate in the consensus and store partial ledgers according to their hardware capabilities.
Directed Acyclic Graph (DAG) [8] is a common data structure employed in voting-based DLT instead of the traditional linear blockchain. According to the Tangle [9], transactions are constructed chronologically, represented as vertices linked to each other by directed edges from posterity to ancestors, as seen in Fig. 3. Each new transaction has to approve (refer) k number (typically two) of previous ones. The DAG starts with a Genesis, followed by ordered transactions. The un-referred transactions that have no children or followers are called tips. A Tip Selection Algorithm (TSA) is a typical ledger constructor that decides which tips are elected for a new transaction to refer to. Once a transaction is attached to the ledger, its validity increases as much as it is selected (directly or indirectly) by other transactions. Typically, transaction validity is denoted by its cumulative weight, calculated according to its location in the DAG. Transactions get confirmed as they sink deeper into the data graph as the ledger expands. Referencing previous parts of the DAG enables data certainty, trust, and confidence.
The concept of parallel validation through P2P votes was first introduced by IOTA to overcome validation issues of traditional hashing-based blockchain such as expensive requirements and resource greediness, which contradicts the constrained and limited capabilities of most IoT devices. It also introduced fee-less transactions and eliminated previously known problems such as miner races, centralizations, miner extractable value, and negative externalities [10]. Moreover, it provides more Transactions per Second (TPS) to cope with industrial IoT delay-sensitive and mission-critical applications, such as power grids, manufacturing assembly lines, and smart cities.
To append a new transaction to the ledger, a tip selector moves across the data tree (DAG) from Genesis until it reaches a tip to refer by a newly issued transaction. It takes action by moving left or right in a random walk style until it selects the best tip. The probability of random walk selection depends on the implemented TSA. Early algorithms, such as Uniform Random Tip Selection (URTS), chose a tip entirely at random, but it was highly susceptible to various adversaries [11]. More secure algorithms were consequently developed, such as Random Walk Monte Carlo (RWMC) and Markov Chain Monte Carlo (MCMC) [12]. They both provide a certain tendency towards transactions with higher cumulative weights. The cumulative weight of each transaction in the DAG is calculated based on its location in the tree. Transaction’s cumulative weight is the summation of the transaction’s own weight and the weights of its parents backward until Genesis. The degree of the tendency towards higher cumulative weight increases the probability of selecting honest tips and, therefore, increases the cost of manipulating the network. However, deciding the tendency degree for tip selectors is a very rich research topic. Table 1 summarizes the progressive updates for the common TSAs, challenges, contributions, and limitations.
Applying constant weight biasing parameters is an insufficient approach due to the changeability of environmental conditions in terms of the number of participants, transactions’ propagation rates, and the percentage of malicious nodes. This necessitates the need for an intelligent learning algorithm that dynamically adapts its weight biasing parameters according to the varying network conditions.
Reinforcement Learning (RL) has been used recently in voting-based consensus to control such parameters dynamically, resolving efficiency and security issues in validating the Distributed Ledger [18, 19]. RL algorithms efficiently model uncertain decision-making for complex, dynamic, real-life tasks where the agent explores the uncertain environment to learn new actions and also exploits its previous knowledge so that it can maximize the cumulative gain. They should balance between exploration and exploitation. They turn the tip selection methods from objective calculations into new subjective paradigms that depend on the agent’s perception of the environment according to historical activities and incidents. To construct a valid ledger in open-access networks, the algorithm refines its strategy iteratively to improve sequential tip selection and avoid adversaries as well. Similar to the RL agent interacting with the environment and receiving feedback, a tip selector component of the ledger constructors moves through the DAG to append honest transactions to the ledger as it interacts with network participants and perceives their votes. It may gain a reward or get a punishment according to the selections it makes. It aims to learn an optimal policy that decides the best sequential tip selections in various environmental states, maximizing the cumulative reward over time.
Q-learning is a basic RL method adopting the Markov Decision Process (MDP) [20], which is recommended for IoT environments because of its simplicity and memory efficiency. Several Q-learning algorithms are now available to achieve the best results by selecting optimal state-action pairs and minimizing future regret [21]. The Multi-arm bandits (MAB) [22] is one of the best Q-Learning models for studying exploration/exploitation trade-offs in sequential decision problems. It is a stochastic problem solver and a subset of MDPs with no transition states, able to resolve many DLT consensus challenges.
Bayesian Inference [23] is a classical RL model that dynamically updates the parameters for the learning models instead of the constant-parameters models. It is a way of acquiring a common sense of knowledge about certain situations and how to make better guesses. Bayesian Data Analysis [24] is potentially the most information-efficient method for fitting a statistical model when using probability to represent uncertainty in all parts of the model. The parameter values in DLT models are known, and the studies focus on how the data varies given those parameters. Therefore, it was more common to plug the parameter values, run the model many times, and look at how much the data jumps around in a classical Monte Carlo simulation manner. On the other hand, data is known in the Bayesian model, but the study focuses on knowing the reasonable parameter values that could enhance data integrity. It learns the unknown parameter values backward from the known data. Although Bayesian inference is a flexible extension that maximizes likelihood, it is the most computationally intensive method. Therefore, lighter Bayesian-like functions are developed instead of the classical function to fit low-resource systems. Figure 4 highlights the difference between traditional inference and Bayesian-like methods to determine the unknowns.
Adopting a constant high tendency parameter towards transactions with high cumulative weights for constructing the ledger lacks fairness in selecting honest transactions with low cumulative weights. On the other hand, adopting a constant low tendency toward transactions with high cumulative weight increases the susceptibility to various DLT attacks. Therefore, deciding the tendency degree for tip selection in dynamic environmental conditions - in terms of the number of participants, transactions’ propagation rates, and the percentage of malicious nodes- is still an open research topic. Reinforcement learning is one of the promising techniques that dynamically adapts its weight tendency parameters according to the varying network conditions.
This paper presents a new Autonomous Lightweight Ledger Constructor (ALLC) based on reinforcement learning that combines liveness, fault tolerance, and protection. It facilitates DAG creation, assigns validity weight for each transaction, and ensures the integrity and consistency of the distributed ledger through binary-voting. It adopts Posterior Sampling (Thompson Sampling) [25] for solving the stochastic multi-armed bandit problem so that it can dynamically decide scaling parameters that balance efficiency and security with regard to environmental changes. It uses a Bayesian-like function that keeps optimizing its parameters until it reaches the optimal tip selection policy through probabilistic processes. Therefore, transaction-approval speed can scale according to the network density, and IoT environments can also resist the Byzantine Generals Problem [26] in decentralized, public, open-access DLT platforms. Moreover, the parallel data structure and machine learning algorithms direct DLT towards a rich new area of study that is able to build a new generation of AI-based blockchain for various IoT applications. The main contributions of the paper are as follows:
-
An autonomous lightweight distributed Ledger constructor is proposed to dynamically construct a valid distributed ledger for limited-computing IoT networks under several adversarial conditions.
-
A lightweight transaction selection algorithm based on the Q-Learning MAB approach is developed. It validates issued transactions using a binary voting mechanism.
-
A new simulator is developed to evaluate ALLC performance and security. It can visualize the resulting DAG tree according to custom input settings.
-
A comparative study against the most popular transaction selection algorithms is conducted in terms of transaction confirmation time and resilience under various attacks.
After the introduction, Sect. 2 introduces adaptive machine learning consensus for IoT environments, surveying the most popular machine learning algorithms. Section 3 discusses the proposed Ledger constructor, while Sect. 4 evaluates the performance and security of ALLC. Finally, Sect. 5 presents the conclusion and future work.
2 Reinforcement learning in DLT
In RL consensus algorithms, the DAG-based ledger is locally constructed on each IoT node, but all nodes consistently build the same ledger due to the intersection between voting node sets [12]. The tip selection and other ledger constructors algorithms should resist malicious and Sybil attacks [27] while they append new transactions. Previous DLT consensus algorithms had to favor operability and liveness over protection, while others favored protection and threat defense over operability and liveness. Accepting two or more conflicting transactions violates network protection, while disruption and errors in responding to updates violate operability and liveness. Thus, to balance liveness and protection used to be a trade-off decision [28]. Meanwhile, RL can dynamically adapt the tip selection parameters to balance operability and protection. In RL, a tip selector moves across the DAG iteratively on each vertex to choose the best transaction according to an inferred distribution. The probability of choosing is updated each round whenever the network conditions change. After selecting a transaction in a certain round, it asks a group of neighbors about their opinions about the selected transaction, producing a reward counted from the empirical observations of neighbors’ votes that help decide selection in the following round. Neighbors’ opinions about transactions are always sent to the ledger constructor algorithm, and the process repeats until some terminal state (tip) is reached, marking the end of an episode. RL tip selector algorithm selects the next tip from finite possible tips (a1, a2,......, aN) at a certain time step t where ai denotes selecting tip with id i at time t = 1, 2,....., N. Every action At at the time step t produces a reward Rt. The expected (mean) reward of the selection action is referred to as vi and is the value of the selection action ai defined by:
The tip selector chooses transactions while it moves across the DAG by sampling from that distribution with the goal of determining a policy \(\pi \)(a | s), which is a state-dependent distribution of actions. The trajectory of states, actions, and rewards through MDP are represented as: \(S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2\),...., \(R_T, S_T\), where T represents selecting a tip (a leaf of the tree) where the iteration terminates. The selected tip is consequently referred to by a newly issued transaction. The episode ends when finding the sequence of transactions that maximizes the sum of future rewards averaged over some time or many trajectories. Hence, the main objective is to maximize \(R_{\sum N} = R_1 + R_2 + R_3 + R_4 +\).... + \(R_N\) or equivalently minimize \(R_{\sum N}/N\). Generally, the reward is a calculated expected value for selecting transactions from the probability distribution where the action value for selecting transaction i at time step k is:
Ri, j is the reward of action ai at time step j, while ni, k is the number of times the action ai is selected before step k. Therefore, the previous distribution is updated by learning from observed votes through Generalized Policy Iteration (GPI) [29]. For the first round, where the model has not been learned yet, it starts with an arbitrarily initialized policy called \(\pi _0\) and then applies policy evaluation to determine its value function \(V_{\pi _0}\). Next, the policy is slightly improved and evaluated to get the new value function \(V_{\pi _1}\). The policy improvement process repeats to yield an optimal policy \(\pi *\) and optimal value function v* to an entire class of algorithms or optimal class of parameters.
Several DLT models adopted the classical MAB for selecting tips while building the distributed ledger. MAB is generally illustrated by a player trying to select an arm from a gambling slot machine to gain an associated unknown reward. The reward is declared only after the action is taken. The player has to take sequential actions that achieve the highest rewards. Greedy algorithm, Epsilon-Greedy, UCB, and Thompson Sampling are popular MAB algorithms that dynamically adapt their parameters during transaction approval. The greedy approach is a mere exploitation scheme that selects actions with the highest reward from a sampled space. Epsilon-Greedy tackles the exploration-exploitation trade-off by applying some degree of tendency during the selection. The Uber Confidence Bound (UCB) is an easy-to-implement algorithm with rich theoretical analysis on its regret upper bound. Aligning with the philosophy of optimism in the face of uncertainty, its learning mechanism maintains tight overestimates of the expected rewards for individual actions. It picks the action with the highest optimistic estimate at any given step. Thompson sampling (posterior sampling) is another algorithm that samples the probability of the reward for the validity of each transaction by iteratively constructing posterior distributions based on the collected opinions. It follows the Bayesian posterior distribution for the expected reward of every transaction. At any given step, an independent sample from each posterior is generated, and the transaction with the highest value of sampled positive votes is taken. Major MAB algorithms that autonomously construct the ledger by adapting the constructor parameters are highlighted below:
2.1 Greedy approach
Greedy algorithms [30] choose transactions with the highest expected reward value. Calculating the total reward for certain action ai at time step k requires storing all rewards before time k to compute the expected value. Therefore, it becomes computationally unfeasible in IoT applications since the amount of rewards becomes too large to store as the time steps progress. Therefore, a recursive MDP function is applied for averages instead of equation (1). The new equation can be derived as follows:
Consequently, we obtain the following equation for the update of the average:
Function (2) memorizes the previous mean only to get the mean estimate for the time step K. Thus, our decision for selecting particular action ai at the time step k depends on the following criterion:
The greedy algorithm is the best lightweight approach for resolving DLT performance issues, such as the speed of confirming transactions and the possibility of allowing IoT devices with low computing and low power to participate in the consensus. Nevertheless, it cannot defend the network from known attacks because adversaries can predict its behavior.
2.2 Epsilon greedy
To mitigate the security risks of the Greedy exploitation method, the Epsilon-Greedy action-value method applies some probability (\(\epsilon \)) that deliberately diverges from selecting the optimized value in function (3) as per the following conditions:
-
Select a real number (\(\epsilon \)) where 0 \(> \epsilon < 1\).
-
Draw a random value p from the uniform distribution on the interval 0 to 1.
-
If p \(> \epsilon \), select the actions by maximizing function (3).
-
If p \(< \epsilon \), then randomly select a random action ai from the set of all possible actions (a1, a2,..., aN).
MCMC is the most popular \(\epsilon \)-Greedy TSA variant. It moves across the DAG ledger from the Genesis downward towards leaves of the DAG tree in order to select two or more tips for each new transaction. The tip selection in MCMC depends on a weight biasing random parameter \(\alpha \) bounded by \(\Delta \) constant that limits upper and lower biasing values. The probability of selecting a child u for transaction v, where u \(\in \) C(v) and \(\alpha \) is the weight biasing parameter, equals: [12]
Although Epsilon-Greedy methods enabled a significant protection level, they lacked the fairness of selecting some honest transactions, causing the "permanent tips" problem [14]. The selection probability of tips having low-weight parents becomes very small since they have low cumulative weights.
2.3 Upper confidence bound
UBC assigns a discrete weight value for a transaction that equals the mean value of all previous transactions plus the highest probability of estimating the error value [31, 32]. The mean value of transactions is calculated as \( \hat{\mu }_{i,t} = \frac{\sum _{s=1}^{t}:I_e = _ir_s}{n_{i, t}}\), the mean value is the calculated mean value for all previous votes about a certain transaction. The upper confidence bound \(UCB_{i,t} = \hat{\mu }_{i,t} + \sqrt{\frac{4 \ln t}{n_{i, t}}}\), is constructed by getting the empirical mean plus the high probability bound on the error. The added error to the estimate means that the UCB is always above the true mean with high probability. Thus, the algorithm will always select the highest upper confidence bound. Therefore, if a transaction has not been chosen sufficiently, even if the empirical mean is small (\(n_{i,t}\)), the algorithm boosts its probability of selection. The benefit of the doubt implemented in the UCB methods by increasing the probability of selection for transactions having less knowledge helped resolve the "permanent tips problem" and allowed higher fairness and more decentralization. However, they are not recommended in delayed feedback environments since they may get stuck in an early bad decision during the delayed results.
2.4 Thomson sampling algorithm
TS is a classical heuristic model [25] that incentivizes exploration to solve stochastic bandit problems. It can easily figure out complex real-life problems through prior distributions. It samples the current actions according to the previous optimal probability. It proved high computational efficiency in IoT applications [22, 33], as well as high performance in MDP environments [34], where an agent derives the posterior distribution of unknown information from assumed prior distribution and observed information [35]. It achieves better results in delayed feedback environments than other methods. It is unlikely to get stuck in an early bad decision. It also allows cumulative weight design based on Bernoulli distribution [36], allowing reputation-based DLT techniques. O-TS and O-TS+ are different variants of Thompson Sampling. The O-TS algorithm is more optimistic and was first empirically evaluated in [37], while O-TS+ is a more optimistic randomized version of UCB. TS has two forms: Gaussian and Beta. Gaussian samples (\(-\infty , \infty \)), while Beta works on the interval (0, 1). In Gaussian prior, the mean value (\(\mu \)) and the Standard deviation (\(\sigma \)) play a crucial role in determining the shape and spread of the distribution. The higher the standard deviation, the more flat prior in Bayesian statistics. On the other hand, Beta prior depends on a function of two variables \(\alpha \) and \(\beta \) that control the shape of the distribution. The probability density function (PDF) of a beta random variable can take many different forms depending on the values of the two parameters.
-
Gaussian: \(f(x)= \frac{1}{\sigma \sqrt{2\pi }}e^{-\frac{1}{2}(\frac{x-\mu }{\sigma })^{2}}\)
-
Beta:\(f(x;\alpha ,\beta )=\frac{1}{B(\alpha ,\beta )}x^{\alpha -1}(1-x)^{\alpha -1}\)
In the classical Bandit (Gaussian distribution), the distribution is calculated based on the unknown parameter, which could be the mean of the reward distribution \(\mu \) of every arm. In the Beta distribution model, a Beta function returns a random number in the range (0, 1) extracted from a beta distribution. The probability density function of a beta random variable can take many different forms depending on the values of the two parameters \(\alpha \), \(\beta \) (some authors use p and q). Those shape parameters power functions of the variable and its reflection (1-x). Therefore, the biasing parameters \(\alpha \) and \(\beta \) converge to the best estimate values as long as the network has fixed dense. Otherwise, they continue updating as long as the network conditions vary. The B in the preceding constant \(\frac{1}{B(\alpha ,\beta )}\) is a normalization constant to ensure that the total probability is 1.
3 Autonomous lightweight ledger constructor
ALLC adopts Thompson Sampling, Beta function, for solving the stochastic multi-armed bandit problem so that it can dynamically decide scaling parameters that balance operability and security with regard to environmental changes. It uses a Bayesian-like function that keeps optimizing its parameters until it reaches the optimal tip selection policy through probabilistic processes. ALLC chooses a sequence of honest transactions from the set of all issued transactions to maximize network safety and minimize potential risk. Beta Distribution function determines an optimal selection policy that dynamically updates the weight-biasing parameters. It features an outstanding empirical performance and incomparable computational efficiency. The selection policy improves as much as the function learns from participants’ votes and environmental conditions.
3.1 ALLC model
The introduced model is suitable for representing random behavior of the percentage of "likes" and the proportion of voting nodes to build a DAG-based ledger. Figure 5 portrays a high-level flowchart for constructing the DAG ledger. It starts by selecting a sequence of transactions from a set of new transactions (left block). According to some algorithmic iterations, a new list of N transactions is generated (approved transactions list). Consequently, transactions are ordered by their weights (shown at the top right block). A transaction’s weight is calculated based on how much it is selected. The DAG ledger is created by ordering transactions according to their cumulative weights (displayed at the downright block). ALLC is able to decouple the process of assigning transaction weights from the process of creating the ledger. After performing several approval rounds, a group of confirmed transactions with the highest cumulative weights are appended to the DAG. Hence, the probability of permanent and lazy tips are significantly decreased.
3.2 ALLC algorithm
Algorithm 1 depicts the process of a complete algorithmic ALLC episode. An episode comprises c number of rounds. Confirmation cycles occur at the final round of each episode, where the highest weighted transactions are confirmed and marked as immutable. During non-confirmation rounds, new transactions are appended to the "input transactions list" to be processed by the selector. The selector iteratively moves across the whole list of input transactions and appends honest transactions to the "approved transaction list". Each transaction may be appended several times as much as it is selected in several voting rounds. During each voting round t, the ledger constructor asks N number of neighbors for their opinion about each transaction d. Different groups of neighbors are queried for each round to ensure more trust, higher validation, and data consistency. Transaction selection takes place in each round based on the posterior probability distribution. Posterior probabilities \(\theta \)* are updated given the past observations of empirical neighbors’ votes about each transaction. In the first round, the constructor forecasts a prior distribution since it hasn’t received any observations yet. Commonly, it starts with equal probability for all transactions, generating almost a flat line or assigning as small bias as possible. There might be some inaccuracy initially, but after more rounds, the selection parameters are updated, and the selection accuracy increases. Hence, the distribution density line gets tighter around the most likely value, and the standard deviation gets smaller. The likelihood measures the satisfaction distribution for selecting a particular transaction. Transaction weights are counted based on how much each transaction is selected. During confirmation rounds, the algorithm moves the confirmed transactions that satisfy the required confirming conditions from "Approved list" to "confirmed list".
For better illustration, Fig. 6 depicts one episode of consecutive rounds representing voting replies by N number of neighbors to validate a d number of transactions. The number of rows denotes the list of voting peers, and the number of columns denotes the list of transactions. Each cell represents a specific peer’s opinion about certain transactions, where 1 denotes "like", and 0 denotes "dislike". During every round, each validating peer receives a new table presenting neighbors’ votes about transactions. The local algorithm iterates on all nodes to check the validity of each transaction. The voting replies help the constructor to sample a probability distribution and a mean acceptance value for each transaction based on the total votes received from different voting groups in several backward rounds. The more selected transactions, the higher the cumulative weight it acquires. The algorithm confirms transactions holding the highest weights at the last round of an episode (C number of rounds).
The episodic validation is processed locally on each node by querying a set of random (50) neighbors during each round. Dividing network peers into random subgroups of quorums reduces the network complexity and enables the ledger’s consistency through the intersection between quorums.
The validation process proceeds in five main steps:
-
Initial Opinion: At the first round of every episode, each node generates a Bayesian prior with random values (zero or one) denoting an initial opinion about each transaction. Each transaction is assigned two numbers, Ni1(n) and Ni0(n). The first counts the number of "likes", and the second counts the number of "dislikes" a transaction gets until round K. At the first iteration of every episode, it generates a local Bayesian prior by forecasting N neighbors’ opinions about each transaction.
-
Iterative queries: for each consecutive round, the autonomous constructor queries K number of neighbors (quorum size) about their opinions for each transaction. Then, it applies the Thompson Sampling function to update the posterior:
\(\theta _i(n) = \) Beta \(N_{i}1(n) + 1\), Beta \(N_{i}0(n) +1\)
The hyper-parameters \(\alpha \) and \(\beta \) are determined by the number of observed "likes" and "dislikes" during each round. Hence, the selection probability of a transaction changes during each round according to empirical observation of \(\alpha \) and \(\beta \). Figure 7 displays different selection probabilities distribution varying values of \(\alpha \) and \(\beta \). Each node selects a different k number of random neighbors for the same transactions until some number of episodes.
-
Transaction Selection: The tip selector selects the best transaction with the highest \(\theta _i\)(n) (most honest) from the previous Bayesian posterior.
-
Weight update: An implemented counter calculates the number of selections for each transaction. The more a transaction is selected, the higher its cumulative weight. Older honest transactions in the DAG typically have higher cumulative weights.
-
Confirmation: A transaction is confirmed at the end of each episode after c number of rounds once it is awarded the required cumulative weight. At this point, a node will stop asking neighbors’ opinions about that transaction.
The flowchart in Fig. 8 summarizes ALLC’s overall validation process for new transactions. When a group of transactions X are received by a group of network peers, they undergo some voting iterations I, for some-times T, by applying parameters \(\Theta \). Neighbors’ votes are queried at the end of each round and treated as the observer reward, which is subsequently used to update the posterior distribution. At the first round i=1 the forecasted prior is used by the algorithm, while the posterior distribution is used for all consecutive rounds. The posterior and the biasing parameter \(\theta _i\) are updated at the end of all rounds based on the incoming votes. Once a transaction x is selected during round i, it is appended to the approved transactions list. The more a transaction is selected, the higher the cumulative weight it gets. The approved transactions list extends as transactions are added. After c rounds indicating the end of the current episode, approved transactions are counted based on how each transaction \(x_i\) is selected. The highest weighted transactions are confirmed at the end of each episode. The number of confirmed transactions should equal the "rate" at which new transactions enter the system.
4 ALLC evaluation
A newly developed simulator is introduced to measure the constructor’s performance and security. It configures a P2P network of N nodes that validate some d transactions in different adversarial environments. When a new transaction \(d_i\) is broadcasted by a node to the P2P network, it’s automatically added to the "new transactions list" of all participants. Each participating node considers a new non-conflicting transaction "honest", tagging it with "like". It also considers a transaction that conflicts with a previous one "malicious", tagging it with "dislike". The algorithm runs consecutive episodes to confirm propagated transactions. During each episode, a group of peers contribute to validate some transactions. Each transaction is selected according to the inferred probability of how many "likes" it would get during each round. The more selection for the same transaction in consecutive rounds, the higher its cumulative weight and the more validity it gets. Each episode runs C rounds, while each round runs d\(\times \)N iterations. The construction algorithm queries N peers about their opinion to approve d messages. The simulator tracks the weight of each message to decide whether it is confirmed, approved, or just a tip. Confirming a message is made after running multiple episodes. Confirmed transactions can hardly be updated or deleted in the future. Thompson Sampling’s total earnings can measure the network’s overall health. The more iterations, the more accuracy of selection and higher profits. The algorithm continues validating new transactions and moves confirmed transactions from the validation list to the confirmed list. Unapproved transactions should always be stationary, keeping confirmed transactions at an equal propagation rate. ALLC simulator provides a web visualization tool displaying the ledger graph for better analysis of how transactions are linked. It also colors each message according to its validity state. Each episode generates d\(\times \)N/k levels of the directed ledger graph, where k is the number of transactions selected during each round. Transaction cumulative weight is typically counted by the height of the representing vertex in a tree, not by how many edges a node gets (degree of node). Four types of transactions are represented in the network graph: Genesis, confirmed, approved, and tips. The algorithm links all the new approved tips to the current ledger and updates all weights after each episode. Several testing scenarios are simulated to evaluate the main features of ALLC.
4.1 Scenario#1
To determine the best priors, a P2P network of 10000 nodes is configured to approve 1000 transactions in an honest environment. The simulator propagates the transactions and then starts the ALLC algorithm to build the data tree. The scenario is launched three times with three different forecasted first-round priors successively: all "likes", all "dislikes", and uniform random priors (likes and dislikes). The mean approval time is also measured for the three tests and compared to Greedy and \(\epsilon \)-Greedy.
Figures 9A, 9B, and 9C display transaction weight histograms of all-ones, all-zeros, and uniform random prior respectively. They show that ALLC efficiently assigns appropriate weights for all transactions. The three scenarios ended up with a minimum of \(80\%\) cumulative weight for all transactions in a safe environment. The overall results indicate satisfying algorithmic validation and termination outcomes irrespective of applied prior. A separate test has been made to evaluate the best \(\epsilon \) value that achieves the highest system health (total cumulative reward) using \(\epsilon \)-Greedy algorithm. Figure 10 shows the impact of different \(\epsilon \) values on the system’s overall health. Apart from \(\epsilon \) = 0 (greedy method), the lower the \(\epsilon \) value, the better the system’s health, especially in large-scale networks with more participants and more validation rounds. The value of \(\epsilon \) = 0.05 weight biasing parameter in \(\epsilon \)-Greedy achieved the best system health. Figure 11 shows the mean approval time for ALLC with the three priors, MCMC, \(\epsilon \)-Greedy, and Greedy. The greedy algorithm proved the fastest mean approval time (3.274 s), and \(\epsilon \)-Greedy with 0.05 weight biasing parameter completed in 5.533 s while MCMC completed in 17.342 s. ALLC Approved all transactions in 14.6879 s when applying all-zero prior. It took 14.49256 s when applying All-ones, while uniform random prior took 13.9108 s. The average time for ALLC message approval in honest environments is slower than the simple form of Greedy and \(\epsilon \)-Greedy in different propagation rates but faster than MCMC. No remaining tips are left behind, but malicious and lazy transactions must acquire the lowest weight. The more transaction propagation, the shorter the time for a transaction to be kept as a tip.
4.2 Scenario#2
To evaluate ALLC resilience and its adaptation to adversarial conditions, five popular attacks [38] have been independently launched. The first two are cautious which maintain the same opinion for all consecutive rounds. The other three attacks are cautious and may respond with different opinions to different queries. A P2P network of 1000 nodes is configured that approves 1000 transactions setup is used. P0 (Prior) distribution is set to 0.95 for "likes" and 0.05 for "dislikes" during all simulation sets. \(20\%\) of peers are configured to act maliciously performing the attack strategies listed in Table 2. The agreement rate and the average approval time are measured during all attacks. The agreement rate denotes the percentage of participating peers who reach the same ledger state at the end of the algorithm during one episode.
Figure 12 displays the agreement rate under the five attacking strategies. ALLC approved all honest transactions, achieving the best agreement rate for the 5 attacks. The environmental adaptive parameters allowed ALLC to exceed \(98\%\) agreement rate for almost all testing scenarios. Figure 13 shows the mean approval time during the attacks. The approval process was completed in about 21.139 s during attack-1 with overall algorithm health 6503. It took approximately 20.73 for approval during attack 2 with a total health of 5022. About 20 s and 8054 health during attack-3. Attack 4 consumed about 20.7 s with health 7340. The algorithm consumed about 20.97, and 7291 resulted in health in attack 5. The overall results indicate adaptive resistance against all types of attacks with reasonable performance.
4.3 Scenario#3
Continuous approval for all transactions is measured through several episodes. In this scenario, the overall performance of ALLC is evaluated by repeating several episodes. Whenever some transactions fulfill the required validation requirements, they are ordered with regard to their gained weights.
Figure 14 displays a histogram of confirming transactions over time in a simple elaborative simulation. Launching multiple episodes shows the increase in transactions’ cumulative weights as much as episodes repeat. Older transactions’ cumulative weight increases as the ledger grows. The weight increases as much as transaction issuance increases and new transactions are appended to the ledger. Figure 15 displays the input transaction rate. It shows the count of processed transactions in a simulated example of 5 transactions "propagation rate", and 10 rounds for an "episode". Thus, the episodic algorithm confirms the best transactions every 10 rounds. To initialize the DLT, the first confirmation takes place after doubling the "episodic rounds" so that the first confirming round starts at round number 20, where 50 transactions are confirmed. Confirming 50 transactions happens at the final round of all consecutive episodes.
Figure 16 shows the transaction confirmation rate for the episodic scenario. Confirmed transactions increase similarly to the issuance rate as algorithmic rounds proceed. Keeping input transactions in a stationary state for each episode enables ALLC to achieve a high degree of scalability and reliability.
4.4 Comparing ALLC to recent RL-based DLT Approaches
Table 3 compares recent RL approaches applied in Voting-Based DLT. The adoption of RL allow the DLTs to dynamically update various parameters to optimize Scalability, Performance, active algorithms, security and/or QoS. Voting-based consensus applies election for either a transaction or a leader node. In transaction election, participating nodes vote for the validity of each transaction, while in leader election, participants vote for electing a leader that is granted a permission update the ledger by appending new transactions. Distributed ledger is stored either in blockchain or DAG structures. Although most of the existing RL studies for voting-based DLT argue that, their introduced model can resist Byzantine Generals Problem and Sybil Attacks, none of them measured how consistent is the ledger and network resilience under different attacking scenarios. In this paper, the agreement rate denotes the percentage of nodes who achieved consensus and constructed the same ledger. Resilience denotes the protection of the ledger against attacks over a period. Thus, resilience can be measured by measured by the mean confirmation time under different attacking scenarios. Mean confirmation time estimates how long does it take for the selection algorithm to reach consensus while adversarial data manipulation is launched.
5 Conclusion
Adopting voting-based consensus algorithms to secure industrial IoT applications is flourishing very fast. Efficient DLT consensus has to confirm more transactions as fast as possible while suppressing the formation of malicious transactions with less consumption of power and computing resources. This paper proposes Autonomous Lightweight Ledger Constructor (ALLC) that resolves performance issues while maintaining network security. It approves transactions based on a reinforcement learning model that adopts the Multi-Armed Bandit conceptual approach in producing an optimal policy for assigning validity weights to IoT transmitted messages. Transactions in this model are approved subjectively according to probability distribution inferred by several rounds of empirical votes about each transaction. The level of uncertainty in delayed feedback models plays a vital role in preserving the integrity of information and protecting the network from behavioral spikes and observed anomalies initiated by malicious actors. ALLC aims to keep the balance between security and operability by adjusting weight-biased parameters as the ledger grows. In addition, RL enables ALLC model to decouple the process of assigning transaction weights from building the ledger. This permanently eliminates the occurrence of permanent and lazy tips problems.
A new simulator is introduced to test the performance of the autonomous constructor that visualizes the ledger in a web format for further analysis. The comparative study shows that ALLC enables fairness with satisfying approval time. The transaction selection algorithm achieves faster confirmation time under several attacking scenarios.
In the future, more reinforcement-learning techniques should be investigated to enhance DLT dynamics for optimizing data storage. A new method should also be introduced to autonomously optimize the appropriate rounds per episode. More adversarial acts and malicious strategies may be simulated to evaluate the robustness of the episodic confirmation model.
Data availability
No datasets were generated or analysed during the current study.
References
Tiwari PK, Agarwal N, Ansari S, Asif M (2024) Comprehensive analysis of blockchain algorithms. EAI Endor Trans Int Things 10:102
Abo-Soliman MA, Al-Qutt MM, Emara K et al (2023) Selection criteria for industrial distributed ledger technology. Int J Intell Comput Inf Sci 23(4):1–18
Rawlins C, Jagannathan S, Nadendla VSS (2023) A reputation system for provably-robust decision-making in iot blockchain networks. IEEE Int Things J 8:23
Verma R, Chandra S (2023) Repute: A soft voting ensemble learning framework for reputation-based attack detection in fog-iot milieu. Eng Appl Artif Intell 118:105670
Kumar R, Kumar P, Aloqaily M, Aljuhani A (2022) Deep-learning-based blockchain for secure zero touch networks. IEEE Commun Magaz 61(2):96–102
Ren Y, Leng Y, Qi J, Sharma PK, Wang J, Almakhadmeh Z, Tolba A (2021) Multiple cloud storage mechanism based on blockchain in smart homes. Future Generat Comput Syst 115:304–313
Abo-Soliman MA, Shabaan E, Al-Qutt M, Emara K (2023) Edge computing and distributed ledger technology for the industrial iot. In: 2023 eleventh international conference on intelligent computing and information systems (ICICIS), pp. 247–252
Kotilevets I, Ivanova I, Romanov I, Magomedov S, Nikonov V, Pavelev S (2018) Implementation of directed acyclic graph in blockchain network to improve security and speed of transactions. IFAC-PapersOnLine 51(30):693–696
Müller S, Penzkofer A, Polyanskii N, Theis J, Sanders W, Moog H (2022) Tangle 2.0 leaderless nakamoto consensus on the heaviest dag. arXiv preprint arXiv:2205.02177
Popov S, Moog H, Camargo D, Capossele A, Dimitrov V, Gal A, Greve A, Kusmierz B, Mueller S, Penzkofer A (2020) The coordicide. Accessed Jan, 1–30
Bu G, Hana W, Potop-Butucaru M (2020) E-iota: an efficient and fast metamorphism for iota. In: 2020 2nd conference on blockchain research and applications for innovative networks and services (BRAINS), pp. 9–16. IEEE
Ferraro P, King C, Shorten R (2019) On the stability of unverified transactions in a dag-based distributed ledger. IEEE Trans Autom Control 65(9):3772–3783
Kusmierz B, Sanders W, Penzkofer A, Capossele A, Gal A (2019) Properties of the tangle for uniform random and random walk tip selection. In: 2019 IEEE International Conference on Blockchain (Blockchain), pp. 228–236. IEEE
Kusmierz B, Gal A (2020) Probability of being left behind and probability of becoming permanent tip in the Tangle v0. 2, 2018. Accessed
Bu G, Gürcan Ö, Potop-Butucaru M (2019) G-iota: Fair and confidence aware tangle. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 644–649. IEEE
Rochman S, Istiyanto JE, Dharmawan A, Handika V, Purnama SR (2023) Optimization of tips selection on the iota tangle for securing blockchain-based iot transactions. Proc Comput Sci 216:230–236
Chen Y, Wang Y, Sun B, Liu J (2023) Addressing the transaction validation issue in iota tangle: a tip selection algorithm based on time division. Mathematics 11(19):4116
Liu M, Yu FR, Teng Y, Leung VC, Song M (2019) Performance optimization for blockchain-enabled industrial internet of things (iiot) systems: a deep reinforcement learning approach. IEEE Trans Ind Inf 15(6):3559–3570
Jameel F, Javaid U, Khan WU, Aman MN, Pervaiz H, Jäntti R (2020) Reinforcement learning in blockchain-enabled iiot networks: a survey of recent advances and open challenges. Sustainability 12(12):5161
Privault N (2013) Understanding markov chains. Examp Appl Publ 357:358
Agrawal S, Goyal N (2017) Near-optimal regret bounds for thompson sampling. J ACM (JACM) 64(5):1–24
Mahajan A, Teneketzis D (2008) Multi-armed bandit problems, pp. 121–151. Springer
Dempster AP (1968) A generalization of bayesian inference. J Royal Stat Soc Ser B (Methodolog) 30(2):205–232
Kruschke JK (2010) Bayesian data analysis. Wiley Interdisc Rev Cognit Sci 1(5):658–676
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4):285–294
Popov S, Buchanan WJ (2021) Fpc-bi: Fast probabilistic consensus within byzantine infrastructures. J Parallel Distrib Comput 147:77–86
Sengupta J, Ruj S, Bit SD (2020) A comprehensive survey on attacks, security issues and blockchain solutions for iot and iiot. J Netw Comput Appl 149:102481
Borowsky E, Gafni E (1993) Generalized flp impossibility result for t-resilient asynchronous computations. In: proceedings of the twenty-fifth annual ACM symposium on theory of computing, pp. 91–100
Zanette A, Lazaric A, Kochenderfer M, Brunskill E (2020) Learning near optimal policies with low inherent bellman error. In: International Conference on Machine Learning, pp. 10978–10989. PMLR
Jungnickel D, Jungnickel D (1999) The greedy algorithm. Graphs, networks and algorithms, pp 129–153
Gittins JC (1979) Bandit processes and dynamic allocation indices. J Royal Statist Soc Ser B Stat Methodol 41(2):148–164
Osband I, Van Roy B (2016) On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732
Lu T, Pál D, Pál M (2010) Contextual multi-armed bandits. In: proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 485–492. JMLR Workshop and Conference Proceedings
Abbasi-Yadkori Y, Szepesvári C (2015) Bayesian optimal control of smoothly parameterized systems. In: UAI, pp. 1–11
Osband I, Russo D, Van Roy B (2013) (More) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems 26
Dai B, Ding S, Wahba G (2013) Multivariate bernoulli distribution. Cornell University Library
Chapelle O, Li L (2011) An empirical evaluation of thompson sampling. Advances in neural information processing systems 24
Capossele A, Müller S, Penzkofer A (2021) Robustness and efficiency of voting consensus protocols within byzantine infrastructures. Blockchain: Res Appl 2(1):100007
Qiu C, Ren X, Cao Y, Mai T (2020) Deep reinforcement learning empowered adaptivity for future blockchain networks. IEEE Open J Comput Soc 2:99–105
Rawlins CC, Jagannathan S (2022) An intelligent distributed ledger construction algorithm for iot. IEEE Access 10:10838–10851
Alam T, Gupta R, Ullah A, Qamar S (2024) Blockchain-enabled federated reinforcement learning (b-frl) model for privacy preservation service in iot systems. Wirel Pers Commun 136(4):2545–2571
Huang C, Liu E, Wang R, Liu Y, Zhang H, Geng Y, Wang J, Han S (2024) Personalized federated learning via directed acyclic graph based blockchain. IET Blockchain 4(1):73–82
Funding
Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB). Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB). No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
Mohamed reviewed the topic literature, and promoted applicable models. Eman performed the analytical study and applied the feasibility study. Mohamed Developed a simulator for evaluation. Mohamed and Karim carried out the experiments. Mohamed and Eman wrote the manuscript with the assisstance of Karim.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Abo-Soliman, M., Shaaban, E. & Emara, K. ALLC: autonomous lightweight distributed ledger constructor for securing IoT information. Computing 107, 79 (2025). https://doi.org/10.1007/s00607-025-01424-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00607-025-01424-z