1 Introduction

Blockchain serves as a public ledger and transactions stored in blockchain are nearly impossible to tamper [1, 2]. Its purpose is to solve the credit problems of both sides of the transaction in a decentralized environment, which can greatly improve transaction efficiency and reduce costs [3, 4]. Then, blockchain has become a widely used technique to enable decentralized financial and business transactions [5].

As one of the most revolutionary and representative blockchain platforms, Ethereum [6] has attracted a large number of participants, including developers and users, and becomes one of the most active communities in the cryptocurrency world [7]. In Ethereum, developers are allowed to develop their own smart contracts using high-level programming languages such as Solidity for various domains [5, 8,9,10], e.g., finance, game and healthcare.

The smart contract is a program that can be triggered to execute any task when specifically predefined conditions are satisfied [11, 12]. The conditions defined in smart contracts, and the execution of the contracts, are supposed to be trackable and irreversible in such a way that minimizes the need for trusted intermediaries [13, 14]. Due to the creditability of smart contract, more than millions of smart contracts have been deployed on the Ethereum until July 6th, 2019.

Since Ethereum is open platform, everyone can access the smart contracts without any constraints. Then, the source code of the existing smart contracts on the Ethereum can be reused by other developers. Meanwhile, the Ethereum applications are highly domain-specific, and the applications can share similar functionalities within the same domain [8], e.g., ERC20 applications implement the same interface for money transfer and balance inquiry [15]. As a result, the nature of Ethereum has provided convenience to create contract clones i.e., copying code from other available contracts.

The impact of contract clone is profound. Since many smart contracts are suffering from serious vulnerabilities, the copy-paste vulnerabilities would be inherited by the cloned contracts [15]. In this paper, we present a large-scale study to characterize the code clone of Ethereum smart contracts. Firstly, we collect a dataset from Ethereum that contains more than 700,000 open source smart contracts, which are deployed from July 30th, 2015 to July 6th, 2019. Then, we employ the Locality-Sensitive Hashing (i.e., LSH) [16] to quickly identify the similar smart contracts from the large-scale dataset. Specifically, we extract the syntactic tokens from the smart contracts in the dataset, and transform contracts into vector representation according to the syntactic tokens. LSH is employed to cluster the similar smart contracts based on the distances between the vectors.

We conduct quantitative analysis and qualitative analysis to characterize the clone practice of the smart contract. Fisrtly, our quantitative analysis reveals that over 96% of the smart contracts have similar contracts on the Ethereum, and this result suggests that the smart contracts on the Ethereum are highly homogeneous. Secondly, we further analyze the reason why smart contracts are similar. Some interesting reasons such as implementing the same “interface” have been found in our qualitative analysis.

The rest of the paper is organized as following. The background about blockchain and smart contract is introduced in Sect. 2. The data collection is presented in Sect. 3. Section 4 describes the LSH methodology we used to cluster the similar smart contracts. The setups and results of experiment are discussed in Sect. 5. We discuss the related works in Sect. 6. Section 7 presents the threats to validity. Section 8 summarizes our approach and outlines directions of future work.

2 Background

2.1 BlockChain and Smart Contract

Blockchain was first introduced by Satoshi Nakamoto in 2008 as the underlying data structure of Bitcoin [1]. As its name suggested, a blockchain is a chain of blocks, in which each block contains a number of transactions which are hashed in a Merkle Tree [17]. By storing the hash value of the previous block, each block refers to its previous block, forming a chain structure. Together with peer-to-peer communication, consensus between miners such as Proof of Work (PoW), asymmetric encryption and digital signature, a blockchain system can provide a temper-proof and immutable value-transfer network without relying on a trusted third party [17]. Hence, many people think blockchain tends to be another technology revaluation of the Internet, due to its unique security, trustworthiness and reliability [18].

In order to make blockchain suitable for more scenarios other than cryptocurrency, Ethereum, a blockchain platform, introduced smart contract which can be constructed with turing-complete programming languages such as Solidity (SolidityFootnote 1 is a contract-oriented, high-level language whose syntax is similar to that of JavaScript). Smart contracts are self-executing contracts where the terms of the agreement between multiple parties are directly written into lines of code [19]. The code and the agreements contained therein exist across a blockchain network. By developing different types of smart contracts, Ethereum can facilitate the construction and execution of complex applications such as financial exchanges, game, social and insurance contracts on the blockchain.

Any user can create a smart contract by publishing a transaction to a blockchain. Once a smart contract’s program code has been deployed on the blockchain, it cannot be changed [20, 21]. Therefore, even when the same contract creators may want to evolve the contract code and create new versions of the smart contracts, the older versions are still kept visible in the blockchain. As a result, the smart contract is similar with its evolving ones, and a code clone case exists on the Ethereum [22, 23].

2.2 Locality-Sensitive Hashing

The Locality-Sensitive Hashing (LSH) algorithm was proposed by Aristides Gionis in 1999 [16]. The basic idea behind LSH is that: if two instances are similar in the original data space, then they have a high similarity after hashing conversion. On the contrary, if they are not similar, they should not be similar after hashing conversion. If a hash function h(.) satisfies these two conditions, it is called a locality-sensitive hashing function. Mathematically, h(.) should satisfy formulas (1) and (2):

$$\begin{aligned} if~d(x,y) \le d_1,~then~P(h(x)=h(y)) \ge p_1 \end{aligned}$$
(1)
$$\begin{aligned} if~d(x,y)\ge d_2,~then~P(h(x)=h(y)) \le p_2 \end{aligned}$$
(2)

where x and y are two instances in the data space, d(xy) represents the distance between x and y. h(x) represents the hashing value of x. P(x) represents the probability of event x, and \((d_1, d_2, p_1, p_2)\) is a set of thresholds. If both formulas (1) and (2) are satisfied, the locally sensitive hash function h(.) is sensitive for thresholds \((d_1, d_2, p_1, p_2)\).

3 Data Collection

Smart contract can be divided into open source and closed source categories. Open source contracts allow any user to download their source code from the Ethereum while closed source contracts only provide bytecode for users. To study why smart contracts are similar, we need to collect the source code of the smart contracts for further analysis. Therefore, we only collect the open source smart contracts as our dataset. We download the smart contracts from the EtherscanFootnote 2, which is an blockchain browser supported by Ethereum, and it provides the real-time transaction query.

Table 1 shows the statistical characteristics of the collected dataset. We collected 146,402 solidity files from Etherscan. There are a total of 703,565 smart contracts, which are stored in a local repository. On average, each smart contract involves around 4.8 individual contracts (ranges from 0 to 36), 20 functions, and 202 lines of code. And these smart contracts deployed on the Ehtereum mainnet from July 30th, 2015 to July 6th, 2019.

Table 1. Collected data

An Ethereum smart contract can be created either by a user, or by another existing contract [6, 7]. Then, we call them user-created contract and contract-created contract to distinguish these two types of contracts. Since we try to study the code clone practice in the two types of contracts, we distinguish the two types of contracts according to the address of the contract creator. If an address of the contract creator points to another contract, then this contract is a contract-created one, otherwise, it is a user-created contract. Table 2 shows the statistical characteristics of the user-created and contract-created contracts.

Table 2. User-created and contract-created contracts

4 Clustering Similar Contracts

In this section, we employ LSH method to cluster the similar smart contract. To measure the similarity of smart contracts, the direct way is to compare the code syntactic similarity [24,25,26,27] between the smart contracts [2]. Therefore, we firstly extract the code syntax from the smart contracts. Then, a smart contract is transformed into a high-dimensional vector representation based on its syntactic tokens. At last, LSH is employed to map the high-dimensional vectors to the clusters in a low-dimensional space. The smart contracts in the same cluster is similar.

4.1 Code Syntactic Tokenizing

To obtain code syntax of a smart contract, we should identify the syntax of each code line containing in the smart contract. We employ the algorithm proposed in our previous study [2] to identify the main syntax tokens of smart contracts, such as MappingExpression, ModifierDeclaration, IfStatement, AssignmentExpression, ReturnStatement, payable, Money. Our algorithm parses abstract syntax tree to obtain the syntactic tokens of each code line. It’s worth noting that a single code line may contain multiple types of syntax tokens. For example, a if code line “if(_to == address(this))” contains three types of syntax tokens: IfStatement, BinaryExpression, and CallExpression.

For all the user-created and contract-created contracts in our dataset, we extract the syntax tokens at code line level. Then, the syntax tokens containing in each code line is a token set, and we regard it as a token unit. For example, the token unit of code line “if(_to == address(this))” is \({<}{} IfStatement , BinaryExpression , and CallExpression {>}\). Then, the token units contained by a contract is the features that can be used to measure the similarity between the contracts.

Similar to the bag of words model [28], we can build a vector for each smart contract according to the token units its contained. Then, for all the contracts, a feature matrix is built. Two vector matrices based on the token units is built for the user-created and contract-created smart contracts, respectively. As Figure 1 shows, there are z user-created smart contracts, and we identify the token units contained in each contract. Then, we use matrix M to represent the token units that each contract contains. If a contract contains a certain token unit, it is labeled as 1 in the matrix. The matrix M is \(z \times m\), and m is the number of the distinct token units.

Fig. 1.
figure 1

Transforming token units into vectors

4.2 LSH Clustering

We can use the LSH method to cluster the similar contract based on the feature matrix M. Specifically, we firstly randomly generate a zero-one matrix V with \(m \times r\) dimensions. Then, we multiply matrices M and V, and obtain third matrix H. Each element H(i.j) in H represent the product between the feature vector of a smart contract \(c_i\) and a random zero-one vector. If H(i.j) is greater than a threshold t, the locality-sensitive hashing value \(h(c_i)\) of the smart contract is 1. Otherwise, \(h(c_i)\) is 0. Repeating the previous steps r times, we can get r locality-sensitive hashing values. If we splice these values together, and we can get a hashing sequence consisting of 0 and 1 with r length for smart contract \(c_i\), i.e., \(H(c_i)=(h^1(c_i),...,h^r(c_i))\). Figure 2 shows the process of applying LSH to the feature matrix.

Fig. 2.
figure 2

Applying LSH to the feature matrix

According to the locality-sensitive hashing value \(H(c_i)\) of smart contract \(c_i\), we can map the smart contract to a bucket \([b_1...,b_k]\), where \([b_1...,b_k]\) is the existing buckets [16], and k is the number of buckets. As a result, the similar contracts are mapped to the same buckets, and these contracts in the same buckets are likely to involve code clone. We regard the smart contracts in the buckets as a cluster. Figure 3 shows the process of mapping smart contracts to the different buckets.

Fig. 3.
figure 3

Mapping smart contracts to different buckets

5 Results Analysis

When we apply LSH to cluster the similar smart contracts, the parameters t is 3 and r is 13. We cluster the similar smart contracts on the user-created and contract-created datasets, respectively.

5.1 Quantitative Analysis

Our observations from Table 3 show that LSH generates 1,230 clusters for user-created smart contracts. There are 288 unique contracts, which means they do not belong to any of the clusters. The proportion of the unique contracts is 4% (i.e., 288/684,029). This result suggests that 96% of user-created smart contracts can find at least one similar contracts in the dataset. For the contract-created contracts, there are 285 clusters created by LSH, and 93 contract-created smart contracts do not belong to any of the clusters, and this means that 99.5% (i.e., 19,443/19,536) of contract-created smart contracts can find at least one similar contracts in the dataset. Therefore, we can conclude that the code clone is a common practice in both user-created and contract-created smart contracts, and the result also reveals the homogeneity nature of the smart contract on the Ethereum.

Table 3. Clusters for user-created and contract-created contracts
Fig. 4.
figure 4

Top 100 clusters of user-created contracts

Figure 4 shows the top 100 clusters for user-created contracts. We can observe that the biggest cluster contains 22,9224 contracts. In general, the clusters of user-created contracts follows a long-tail distribution considering there are 1,230 clusters in total. For all the user-created contracts, the top 20 clusters account for 87% of the contracts. The results suggest that the distribution of clusters follows a typical Pareto principle rule. Therefore, many smart contracts are concentrated in same cluster, and these contracts have similar code.

Figure 5 shows the top 100 clusters for contract-created contracts. The biggest cluster contains 5,994 contracts. The distribution of clusters also follows a typical Pareto principle rule, i.e., the top 20 clusters account for 90% of the smart contracts.

Fig. 5.
figure 5

Top 100 clusters of contract-created contracts

5.2 Qualitative Analysis

Since all the collected contracts are open source, we manually check these clusters and identify them according to the source code of the smart contracts. The largest clusters mainly fall into the following categories:

ERC Related Clusters. ERC related contracts take the majority of popular clusters. ERC standardFootnote 3 includes ERC-20, ERC-721, ERC-825, ERC-223. For example, to achieve the “issue currency”, the corresponding smart contracts should implement the “interface” of ERC20. If a contract want to implement the ERC20 interface, it needs to implement the 6 functions, i.e., totalSupply(), balanceOf(), transfer(), transferFrom(), approve(), allowance(). As a result, all the smart contracts implements the ERC20 interface have similar source code. The famous tokens implementing the ERC20 interface include: Huobi TokenFootnote 4, FTX TokenFootnote 5, USD CoinFootnote 6, etc.

Gambling Related Clusters. Many clusters are related to the gambling contracts. There are many gambling contracts on the Ethereum, and these gambling contracts often implement very simple and similar logic. Then, developers can directly copied and pasted the original open-source contracts to create similar gambling contract. As a result, the gambling contracts can be clustered together.

Other Clusters. We also observe other types of clusters, such as, game related cluster, social related cluster. These clusters have a strong industry orientation. The contracts belonging to the same industry are more likely to cluster together. These results suggest that the smart contracts on the Ethereum are highly homogeneous.

6 Related Work

The clone detection for smart contract can be divided into static [6, 7, 13, 29] and dynamic ways [8, 30]. He et al. [7] revealed that a large number of smart contracts are similar on Ethereum, which suggests that the smart contract is highly homogeneous. Our study is different from them in the clustering approaches. He et al. clustered any contract pair whose similarity score is greater than 0.7. Then, they build a weighted undirected graph by treating each contract as a node. At last, they traverse the graph and consider each connected component as a cluster. Kiffer et al. [6] found the smart contracts on Ethereum exhibit extensive code reuse. They firstly compute the frequency of the 5-grams in the opcode sequence of a contract. Then, each contract corresponds to vector of 5-grams. The similarity of two contracts can be computed by the cosine similarity of two vectors. Gao et al. [13, 29] utilized code embedding technique to encode the code elements in a smart contract, and each code element is converted into numerical vector with preserving the code syntactic and semantic information. Then, the code embeddings for any code fragment is summing up all the vectors of the possible tokens’ embeddings within it. At last, the similarity between two fragments can be computed by the Euclidean distance between the vectors.

In addition, Liu et al. employed a dynamic approach to detect the code clone in smart contracts [8, 30]. They proposed ECLONE to detect semantic clones for smart contracts. ECLONE extracts a set of critical semantic properties generated from symbolic transaction of a smart contract, and then these semantic properties will be normalized into numeric vector. At last, the clone detection problem is modeled as a similarity computation of the numeric vectors. In summary, our approach is different from the existing studies. We extract the code syntactic tokens from each smart contract, and employ the LSH method to cluster the similar smart contracts and further analysis the code clone in smart contracts.

7 Conclusion and Future Work

Code clone is an essential and vital part of modern software development. Although studying the code clone has a long research history, we are the first to employ the LSH technique to analyze the similarity of the user-created and contract-created contracts, respectively. To evaluate our approach, we collect a datasets that contains more than 700,000 smart contract coming from Ethereum. The quantitative analysis shows that over 96% of the smart contracts are similar. The qualitative analysis reveals that the majority of popular clusters are ERC related contracts. The future research agenda mainly focus on extending the scale of the dataset. Firstly, we will take more open source smart contracts into consideration. Secondly, we will try to identify the code clone in the closed source smart contracts.