1 Introduction

Following the popularity of Bitcoin [2, 9], also other crypto-currencies have experienced a huge increase in acceptance/use (e.g., Ethereum, Litecoin, Ripple, Monero). Hundreds of new crypto-currencies (coins and tokens) have been offered to the market, currently reaching slightly less than two thousands proposals. Crypto-currencies are no longer relegated (only) to darknet marketsFootnote 1 or technology enthusiasts, but are nowadays a matter of discussion and investment products known by a large part of the population who has access to ICT.

However, due to the pseudo-anonymity offered to users, BitcoinFootnote 2 payments have also become an attractive and frequently used means for collecting money from illegal activities perpetrated by criminals. For instance, Bitcoin payments are requested by most of the last ransomware, as WannaCry [5] and PetyaFootnote 3. Other activities are represented by demanding payments for illegal services/goods, as software exploits or Ransomware-as-a-Service (RaaS) targeting a desired victim. A new frontier could be the use of crypto-currencies as tax heavens.

After introducing Bitcoin in Sect. 2, in the remainder of the paper we describe a work-in-progress suite of different software tools, whose aim is to facilitate the analysis of bitcoin flows, and let the forensic scientist extract and visualise useful insights on target (pools of) addresses. Results on specific case studies have been already presented in [4,5,6]. We name the whole suite BlockChainVis, inheriting from the visualisation module [6]. Section 5 wraps up the paper with final conclusions.

2 Bitcoin

The white-paper on BitcoinFootnote 4 appeared in late 2008 [9], under the pseudonym “Satoshi Nakamoto”. It consists of an open-source, peer-to-peer, digital currency. Money transactions do not require a third-party intermediary. The payer and payee directly interact without using their real identities, and no personal information is transferred from one to the other. However, differently from a fully anonymous transaction, a complete transaction record of every bitcoin transfer and every Bitcoin user’s encrypted identity is maintained on a public ledger, called the block-chain. For this reason, Bitcoin transactions are pseudonymous, and not completely anonymous: Bitcoin addresses are pseudonyms of real individuals, and a user may have several pseudonyms.

The only way to create new bitcoins is through the mining process: miners are the nodes that verify the transactions and add them to the block-chain, grouped into “blocks” of information. The amount of bitcoinsFootnote 5 created each time a miner discovers a new block represents a reward for its job. Besides it, also the fees of all the transactions in the mined block go to the miner.

Transactions. Transactions are the basic brick of the Bitcoin network: they represent the mechanism that allows a user to cede money to another user, e.g., from a buyer to a seller. This mechanism is possible thank to Bitcoin addresses. A Bitcoin address is an identifier of 26–35 alphanumeric characters, and it strictly derives from the hash of a generated public key (pubkey in the following) [2]. A private key is a random 256bit number, and the corresponding pubkey is generated through an Elliptic Curve Digital Signature Algorithm (ECDSA [7]).

A transaction input needs to store the proof it belongs to the address who wants to re-transfer the money received in a previous transaction. The output of a transaction contains the next destination of bitcoins instead. Thus, the ownership of the coins is expressed and verified through links to previous transactions. For example, Alice, in order to send 3 bitcoins (BTC) to Bob, must refer to other transactions she has previously received, which amount to at least 3 BTC.

Block-chain. Miners keep the block-chain consistent, complete, and unalterable: they repeatedly verify and collect newly broadcast transactions into a new block of transactions. Each block header contains information that chains it to the previous block in the block-chain, that is the hash of the previous block. Thank to this field, a block (and consequently the block-chain) is computationally impractical to be modified, since every block after it would also have to be regenerated. The remaining field of the header, i.e., the nonce, is obtained from the computation of the proof-of-work by miners. Once the header is filled with the nonce, its hash has to be less than a target number.Footnote 6 This proof is easy to verify (one hash operation), but extremely time-consuming to generate.

Fig. 1.
figure 1

A graphical summary of the BlockChainVis suite of tools.

3 System Design and Implementation

The BlockChainVis architecture is designed to accommodate a modular and expandable framework with the purpose to build complex applications for the forensic analysis of the Bitcoin block-chain. Figure 1 summarises the tools.

The entire database of transactions is stored within PostgreSQLFootnote 7, even if it is possible to use OrientDBFootnote 8 as well. Moreover, we are currently moving the database to AccumuloFootnote 9. The back-end of this suite is implemented on a machine with 512 Gbyte of RAM, 2 processors Intel(R) Xeon(R) CPU E5- 2620 v4 2.10 GHz 8 core (for a total of 32 threads); in particular, the implementation consists of three different virtual machines running, (i) Bitcore, (ii) PostgreSQL, and (iii) software dedicated to visualisation of Web-applications. In the remainder of this section we describe some of the modules in Fig. 1.

3.1 Bitcore Node

Bitcore is a “full node”Footnote 10 Bitcoin client. The raw block-chain can be queried by using Insight API, and the result is presented to the user as a JavaScript Object Notation (JSONFootnote 11) file, which is a simple text-document where the basic structure is a set of name-value pairs and an ordered list of values.

Fig. 2.
figure 2

Getrawtransaction query.

In Fig. 2 we can see the output of getrawtransaction query, which, by having a hash as parameter, allows for receiving all the information about a transaction; for instance, its block number, all the inputs and the outputs, the number blocks following in the block-chain (i.e., confirmations).

3.2 Bitcoin Addresses Scraper

The Bitcoin addresses scraper crawls the Web for Bitcoin addressees to be associated with real users, or to Web URL. The aim is to fully de-anonymise addresses where possible.

We use a set of scrapers [15] that crawl specific data form Web sites connected to the Bitcoin world:

  • user-names on Bitcoin TalkFootnote 12 forum and Bitcoin-OTCFootnote 13 marketplace;

  • physical coins created by Casascius (https://www.casascius.com) along with their Bitcoin value and status (opened, untouched);

  • known scammers, by automatically identifying users that have significant negative feedback on the Bitcoin-OTC and Bitcoin Talk trust system.

  • name tags on block-chain.infoFootnote 14, e.g., “Wannacry ransomware 1”.

The tool helps users build lists of gambling addresses, online wallet addresses, mining pool addresses, and addresses which were subject to seizure by law enforcement authorities. All these addresses are entered in the database and they are used to de-anonymise further addresses by using the heuristics in Sect. 3.6.

3.3 Database of Transactions

Originally we had all block-chain in a OrientDB database. OrientDB is a widely used and open source NoSQL multi-model database. Unlike relational databases, a graph database does not utilise foreign keys or “join” operations. Instead, all relationships are natively stored as vertices of a graph. This results in deep traversal capabilities, increased flexibility and enhanced agility. However, from preliminary tests, this database is quite demanding in terms of RAM usage, which was not sufficient to calculate all the islands of transactions present in the block-chain, i.e., the strongly connected components.

For this reason, we are also testing a PostgreSQL database. Postgres, is an Object-Relational Database Management System (ORDBMS) with an emphasis on extensibility and standards compliance. As a database server, its primary functions are to store data securely and return that data in response to requests from other software applications.

3.4 Mixing Services Detector

Bitcoin is a good way to stay anonymous while making payments. Nevertheless, Bitcoin transactions are never truly anonymous. Bitcoin activities are recorded and available publicly via the block-chain. When a Bitcoin user pay for some service or good, she will of course need to provide her name and address to the seller for billing or delivery purposes. It means that a third party can trace her transactions and associate her address with her name. To avoid this, mixing services (also called tumblers) [13] provide the ability to interrupt a direct money-flow from one user to another by using addresses that do not belong to the original owner. Mixing services are used to mix one’s funds with other people’s money, intending to confuse the trail back to the original source. In traditional financial systems, the equivalent would be moving funds through banks located in countries with strict bank-secrecy laws, such as the Cayman Islands.Footnote 15

The goal of this module (see Fig. 1) is to find mixing services in the Bitcoin network. In particular, to extract related behavioural-patterns in terms of payments, and consequently to understand how a mixing service works. In practice, this allows for tracking a desired bitcoin-flow also through a mixing service.

Table 1. Characteristics of some mixing services.

To experimentally find such patterns, we prepared some real bitcoin-payments using different mixing services: the final goals is to have identify those addresses that belong to tumblers. In Table 1 we can see the characteristics of the used mixing services. We extracted two databases to proceed with the investigation: one with all the transactions sending and receiving money from tumblers addresses, while the other one with all other transactions performed in the same time interval. The features of the two datasets are shown in Table 2.

Table 2. Dataset characteristics.

Finally, we studied the behaviour of these addresses with Machine Learning, and in particular by using hierarchical clustering techniques considering the following nine features: input addresses, output addresses, balance, average balance, transaction ID, time of creation, number of inputs, number of outputs.

Unfortunately, in this way we were not able to spot a different behaviour between the two datasets. Hence, in a second experiment we focused on a Data-mining analysis instead; in the tumblers dataset we noticed that \(4.9\%\) of CoinMixer transactions generates \(89.7\%\) of the edges, and 14 transactions generated more than 1000 output addresses. These transactions have the following features: (i) number of input addresses equal to 2, (ii) number of output addresses in the range [2530, 2534], (iii) they were collected one a day, for 14 consecutive days.

Table 3. Similarity of address sets (first six).
Table 4. Similarity of address sets (second eight).

We decided to compare the sets of output addresses and we noticed that the similarity between the two datasets decreased day by day (see Tables 3 and 4). This feature allowed us to conclude that the output addresses are gradually renewed over time with new addresses that work in the same way as those deleted. These results clearly identify a behavioural pattern of the CoinMixer service, generated by a specific internal algorithm. Hence, through an analysis of transactions, and in particular of their output addresses, it is possible to also recover all similar past and future transactions.

3.5 BlockChainVis (Visualisation)

BlockChainVis [6] is a module dedicated to the visual analysis of flows of Bitcoin transactions. The aim of this module is to help analysing desired transaction flows in deep. The block-chain can be considered as Big Data. For this reason we turned our attention to Visual Analytics [16] (VA), that is the science of analytical reasoning facilitated by interactive visual-interfaces. The main objective of VA is to help the visualisation of problems like size and complexity. The goal is to rapidly visualise only the data of interest. Being VA task-oriented [16], we have identified nine main tasks: (i) find miners; (ii) find transaction sources and understand how they are connected; (iii) find the main addressees of transactions; (iv) find the “richest” and “poorest” addresses; (v) find the addresses with a break-even budget; (vi) find bitcoin flows from an arbitrary address; (vii) find bitcoin flows from a set of different addresses; (viii) filter the block-chain on intervals of time or block identifiers; (ix) filter the block-chain on specific transaction amounts of bitcoins, or on their number of involved addresses. To reach such tasks, in the initial window it is possible to select among three different kinds of visualisation: Single Transaction, Address Transactions, and Archipelago. The first view allows for manually inserting the hash of one desired transaction, and then the tool shows the input addresses (in the following mentioned simply as “inputs”) and the output addresses (in the following, “outputs”) as a graph. The second option is the dual of the former: it is possible to type in an address and the tool shows all the transactions that have such address as output, and all the inputs of these transactions. Hence, the first two options offer a more targeted view: the user already has some initial information. The Archipelago view is the third and most difficult one: it displays all the islands of the archipelago of Bitcoin transactions. An island is a connected component of a graph, where each couple of nodes is connected through a path, and each of the nodes is not connected to any other vertex of the super-graph of block-chain. Currently, we can visualise the Archipelago along time, as shown in Fig. 4. By clicking on any island of the archipelago, a summary of its statistics pops-up. Then, it is possible to enter into an island and visualise all its transactions, as shown in Fig. 3.

Fig. 3.
figure 3

Island visualisation.

Fig. 4.
figure 4

Time visualisation.

3.6 Bitcoin Addresses Clusteriser

The goal of the Bitcoin addresses clusteriser module is to find groups of addresses that belong to the same user. It incrementally reads the block-chain transactions from the DB and generates/updates clusters of addresses by using the following heuristics:

  • The Multi-Input Heuristic [12] considers only transactions with more than one input. All the inputs of this transaction are considered belonging to the same owner. Formally: given a transaction \(t (S_i \rightarrow R_i ) \in T\) and \(S_i = a_1, a_2, \dots ,a_n\) the set of input addresses. Given also the cardinality of the ensemble \(|A_i| = n, n> 1\), then all input addresses belong to the same owner.

  • The Shadow Heuristic [1] exploits the way in which clients manage the change, i.e., every time a transactions has a change, a new address (called shadow) is automatically created and used to collect back the change. This address belongs to same owner of the input address.

  • The Consumer Heuristic [10] uses the concept of “consumer wallet”, i.e., a client that by default allows for sending bitcoins to a single address, so it assembles transactions that have exactly 2 outputs. Given a transaction with 2 outputs, if there is an address appearing only in transactions that have 1 or 2 outputs, then this address and the input addresses are associated to the same owner.

  • The Optimal Change Heuristic [10] is based on the assumption that clients try to use the outputs whose sum is closer to the value to be sent, then the change must be less than values of all other input addresses. Considering a transaction that has more than 2 outputs, if there is only one output whose value is lower than all input values, then this address and the input addresses are associated to the same owner.

  • The One-to-one Heuristic considers only transactions having one input and one output. Given one of these transactions, and given a pool of addresses belonging to an exchange serviceFootnote 16, if the two addresses do not belong to the pool, they are considered belonging to the same owner.

  • The Multisig-one Heuristic is based on multi-signature transactionsFootnote 17. Considering a transaction having a multisig output containing a pool of address where to unlock it, it is necessary to control 1 or all private keys referred by this pool of addresses, then the entire pool belongs to the same owner.

  • The Multisig-two Heuristic considers multi-signature transactions that are already spent. Given a transaction which have a spent multi-signature output containing a set of addresses, where to unlock it is necessary to control more than one private keys, then the subset of addresses, that has actually used to unlock the output, belong to the same owner. This heuristics is not applied to 2-of-3 multi-signature because they could correspond to an escrowFootnote 18.

Table 5. Percentage of Bitcoin addresses that can be clustered.

Table 5 shows how many addresses can be clustered using individual heuristics and their compositions. With only four heuristics, \(76.29\%\) of the total addresses in the block-chain can be clustered (last row in Table 5).

3.7 Transaction Information

The last module, i.e., Transaction information, is focused on providing additional information about transactions. First, it classifies transactions into standard and non-standard types, according to the Bitcore function isStandard()Footnote 19. Then, it shows the distributions of standard and non-standard transactions in the block-chain. Our first results on classification are provided in [4].

This module also allows for interacting with a Bitcoin scripting compiler. The Bitcoin transaction language Script is a Forth-like [11] stack-based execution language. Script requires minimal processing and it is intentionally not Turing-complete (no loops) to lighten and secure the verification process of transactions. An interpreter executes a script by processing each item from left to right in the script. Data is pushed onto the stack, as well as operations, which can push or pop one or more parameters on/from the execution stack, operate on them and possibly push their result onto the stack.

4 Related Works

Analysing and understanding the Bitcoin block-chain is as complicated as interesting and critical: several analysis tools have been developed. BitIodine [15] is a modular framework, which parses the block-chain, clusters and labels addresses and visualizes portions of transactions graph. Bitconeview [3] is a tool for the visual analysis of how and when a flow of Bitcoins mixes with other flows in the transaction graph. Blocksci [8] is an applications of block-chain analysis, that allow to get information from transactions graph. In [14] the authors present visualisation mechanisms for taint propagation in Bitcoin that display how cyber criminals launder money. ChainalysisFootnote 20 is a commercial Bitcoin forensic suites, that allows to detect and investigate cryptocurrency laundering and frauds.

5 Conclusion and Future Work

In this paper, we have provided a preliminary report on the BlockchainVis suite of tools, which is a modular framework to investigate the Bitcoin block-chain, cluster addresses, identify mixing services, visualise information about transactions, and allow for using scripting languages. In simple terms, we are currently developing several integrated tools that simplify the life of the forensic scientist, by automating some of the tasks performed to keep track of money flows and their sources/destinations.

Bitcore Node: We plan to build a graphical interface to display the most interesting information of Bitcore in real time, as the current broadcast transactions (called the memory pool). Some examples of what we want to do are in already-existing tools as bitcoind-statusFootnote 21, MyPHP Bitcoin Node StatusFootnote 22, Satoshi.infoFootnote 23.

Database of Transactions: Since the bitcoin block-chain currently contains 320 million transactions, we have planned to test a third database system to increase the response time of queries: Accumulo (see footnote 9). Apache Accumulo is a highly scalable sorted, distributed key-value store based on Google’s BigtableFootnote 24. Our idea is to have two different databases: a Postgres DB storing all the needed information, and an Accumulo DB with GraphuloFootnote 25 to store the graph structure of transactions. In this way, the queries concerning the topology of the graph do not need to also load useless data in memory.

Mixing Services Detector: At the moment, this module is not fully automatised: in the future we plan to guide the user in reconnecting a broken flow of bitcoins and visualise it by using the tool in Sect. 3.5.

Bitcoin Addresses Clusteriser: We are currently implementing all the aforementioned heuristics and updating the database accordingly: we will highlight all the addresses of the same user in the visualisation.

Transaction Information: We are developing a compiler to study a particular transaction, called pay-to-script-hash (P2SH): transactions are sent to a script hash (address starting with 3) instead of a public-key hash (address starting with 1). The aim is to investigate such scripts.

Miner Analysis: We plan to build a new module that shows information about miners, e.g., the relationship between miners and hashrate.

In addition to what described in this Section, we will also extend the power of BlockchainVis by making it able to analyse not only Bitcoin, but also other crypto-currencies, as Ethereum for example.