Keywords

1 Introduction

Taxes are divided into two types namely, direct taxes and indirect taxes. The major difference between these two is the way in which they are collected. Direct taxes are collected from individuals and corporations. Income tax and gift tax are examples of direct taxes. Indirect taxes are imposed on the goods and services consumed. In this work, we work towards detecting evasion prevailing in the indirect taxation system. Value-added Tax (VAT) [26], and Goods and Services Tax (GST) [5] are indirect taxes. They are collected by a third party (eg., shop keeper) from the consumer who purchases the goods. Finally, it is the consumer who would have to bear the burden of the tax payment.

Recent tax reforms in developing countries opted indirect taxation method to expand their tax base. Determining the “point of levy” is an involved task in indirect taxes. A simple approach is to levy and collect the tax at a single point in the value chain, for example, the point of final consumption. The retail sales tax (RST) in the United States of America is an example. Single point of the levy is easy for administering but it has a few flaws. Many developing countries have a high concentration of informal economic activities at the consumption points. For example, the market share of informal economic activities in India is almost 50%. In these countries, there is a major risk of loosing out tax at the final consumption point. Sales tax can be sidestepped by taking the goods out of the value chain right at the onset. This will result in the creation of a parallel economy by keeping a major part of the value chain outside the regulatory authority’s watch. One approach towards handling this problem is by following a multipoint taxation system, such as the Value-added Tax (VAT) and the Goods and Services Tax (GST) [5]. Goods and Services Tax, which is implemented in India from July 2017, is a comprehensive, multi-stage, destination-based tax that is levied on every value addition. This tax has replaced many indirect taxes that were previously existed in India.

1.1 Multipoint Taxation System (VAT and GST)

In this system, the tax is levied incrementally in each stage of the production depending upon the value added to goods in the corresponding production phase. Tax is levied at each phase of the production, such that tax paid on purchases(input tax) will be given as set-off for that tax levied on the sales (output tax) [8]. Figure 1 shows how the tax is collected incrementally in this system.

Fig. 1.
figure 1

Multipoint taxation system

  • In this example, the manufacturer purchases some raw material of value 1000$ from the raw material dealer, by paying 100$ as tax at 10\(\%\) tax-rate. The raw materials dealer remits to the government the tax amount that he has collected.

  • Then the retailer purchases the processed goods from the manufacturer for, say, 1200$. An amount of 120$ is then paid to the manufacturer as a tax. The manufacturer pays the government the difference between the tax he had collected (from the retailer) and the tax he has paid (to the raw materials producer) (120$ − 100$ = 20$).

  • The consumer then buys the finished goods from the retailer for 1500$ by paying a tax of 150$. By following the same argument as given in the previous steps, the retailer pays 30$ (i.e., \( 150\$ - 120\$)\) to the government.

It can be easily calculated that the total tax received by the government is 150$, and it is indirectly paid completely by the consumer of goods. Hence, raw material dealer, manufacturer, and retailers are representatives of the government to collect the tax.

This method ensures market-driven checks and balances or compliance regime which is difficult to achieve. At each node in the value chain, the purchaser and seller duo would have contradicting goals towards their tax liabilities. The seller tries to understate his sales while the purchaser tries to overstate purchases. This contradicting approach ensures market-driven checks and balances.

1.2 Tax Evasion Methods in Multipoint Taxation System

Circular Trading in GST and VAT: In GST and VAT, the market-driven checks and balances did not work in an expected manner. In the majority of tax evasion cases, business dealers, in their monthly tax-returns, deliberately manipulate their actual business transactions motivated by the amount of profit gained by evading tax. Invoice trading is a method to evade tax [2], where a dealer sells their goods to the end user without issuing the invoice but collecting the tax. Later, he/she issues a fake invoice to a third party, who uses it towards increasing their input tax credit. This will minimize the amount of tax they have to pay in the form of cash(the difference between the tax they collected at the time of sales and the tax they paid at the time of purchases) to the government. These tax manipulations can be spotted by the tax enforcement officials. To hide these manipulations, malicious dealers create a well-entrenched “racket,” where a large number of bogus firms(shell firms) are created to manipulate the title of goods in the first place and then follow it up by making fake transactions among them to outwit easy systemic detection. Malicious dealers, show huge fake sales and purchases among malicious dealers and dummy dealers(shell firms) without any significant value-added as given in Fig. 2.

In Fig. 2, illegitimate transactions corresponding to fake invoices (invoice trading) are shown using red-lines. These are from dealers, x to q, x to z and q to z. Dealers q and z use these fake invoices towards minimizing their tax liability. With the motivation of confusing the tax enforcement officers, these dealers superimpose several fake transactions(dummy transactions) on these illegitimate transactions, which are shown using gray lines. Note that these dealers superimpose fake transactions such that the tax liability of any of the dealer due to the fake transactions is zero, i.e.,  an amount of tax paid on the fake purchases is equal to the amount of tax collected on the fake sales.

Fig. 2.
figure 2

Circular flow of sales/purchases

Since the value-addition due to the fake transactions is equal to zero, they do not pay any tax on these fake transactions, but, rather they create confusion to the tax officials about the illegitimate transactions. It is important to note that there is a huge amount of fake sales and purchases transactions among malicious and dummy dealers when compared to genuine sales and purchases transactions with the others. This type of technique used to evade tax is known as circular trading [9, 10, 21]. Hence the malicious dealers complicate the process of detecting their illegitimate transactions (invoice trading).

Carousel Fraud: Carousel fraud is a method of stealing public money by exploiting the VAT-free trade arrangements between European Union member countries. An organized crime groups will import goods from another country, then sell them by charging VAT to the customer but absconding with VAT instead of passing it to the government. To make this process undetectable, these groups buy and sell the goods multiple number of times between bogus companies before the final transaction where the VAT is stolen [4, 18].

Carousel fraud and circular trading have a lot of characteristics in common. The solutions for circular trading can be extended to carousel fraud. In this paper, we work on circular trading.

1.3 Motivation for This Work

Manually, it is impractical for the tax officials to detect illegitimate transactions in circular trading due to the enormous size of the tax department’s database, complicated sequences of sales and purchases transactions by the malicious dealers, the unknown identity of the traders doing these manipulations, etc. These challenges call for sophisticated big data and graph theoretic techniques. We used the RHadoop framework [6] for implementation.

The following gives a brief account of the paper structure. In Sect. 2, we describe several existing approaches that are used to perform cluster analysis on problems similar to that of ours. In Sect. 3, the problem is formulated as detecting communities in social networks and removing cycles created by fake transactions. In Sect. 4, we outline the experimental setup and results obtained from this work. We implemented these algorithms for the Commercial Taxes Department, Government of Telangana, India.

2 Related Work

Circular trading is a notorious problem in stock markets. In [21], Palshikar et al. proposed a highly customized algorithm for identifying colluding sets in stock trading. In [27], Wang, et al. proposed an algorithm to identify colluding sets in the instrument of future markets. In [13], Islam, et al. had given an algorithm for identifying collusion sets and cross trading collusion sets.

In [20], Nigrini et al. suggested statistical methods which can be used in the initial stages of the auditing. These techniques are based on Benford’s Law, a unique characteristic of tabulated numbers. This law gives the expected probability of the digits in tabulated data. In [1], Arben Asllani et al. proposed a method that can be used by charted accountants to detect accounting fraud.

In [16], Klymko et al. have given an undirected edge weighting method based on directed triangles to detect communities in directed networks. They proposed a new measure on the quality of the communities in social networks depending on the number of 3-cycles that are span across communities. They showed that the resulting communities have fewer 3-cycles cuts. In [14, 23], the author showed the significance of triangles in community detection in an undirected networks. In [15], Khadivi et al. showed that proper assignment of weights to the edges of a social network could improve community detection. They used this weighting as an initial step for the Newman greedy modularity optimization algorithm. In [17], the authors have proposed a method which can identify classification rules to detect fraudulent samples. They discovered spatial relationships of fraud and non-fraud financial statements. In [12], the authors have proposed a clustering based data mining algorithm to find outliers in taxation data. In [11], the authors have used clustering algorithms to identify a group of taxpayers, and then they have used several classification models to detect a potential user of false invoices in a given year.

In [6], Dean and Ghemawat explained MapReduce programming model for processing large data sets. In [3], Behera et al. had explained the implementation of random walk based graph clustering algorithm using Map-Reduce framework. In [25], Rajaraman et al. had given algorithms to handle massive data sets.

3 Problem Statement and Solution

It is impractical for the tax officials to detect illegitimate transactions in circular trading manually. Our objective is to design an algorithm to detect illegitimate transactions and the set of a dealer doing these transactions. We follow the below four-step approach to solve the problem.

  • Step 1: Construct an edge weighted directed graph from the way bill data base, where vertices correspond to dealers, and weights of directed edges are defined by the number of fake transactions, which are identified by Benford’s analysis.

  • Step 2: Convert this edge weighted directed graph into an edge weighted undirected graph.

  • Step 3: Identify the groups of a dealer who perform excessive trade among themselves, as compared to the sales and purchases with other dealers. The problem is formulated as finding fraudulent communities in a social network.

  • Step 4: Remove cycles formed by fake transactions within each group of these dealers.

3.1 Step 1: Construction of Sales Transaction Graphs

Waybill Database: Table 1 is a sample of a waybill data base. Each row corresponds to a sales transaction. Each row contains seller name, purchaser name, time of sales and value of sales.

Table 1. Waybill database

The actual database contains many more details like type of goods, the rate of tax, the quantity of goods, vehicle used for transporting the goods, vehicle number, transporter name, invoice number, UOM (unit-of-measure), inserted date, etc. The data we had taken contains several million rows.

Benford Analysis: Benford’s law, which is also known as the first digit law, is a statistical technique for fraud detection [1, 7, 20]. This law intrigued mathematicians for over a century. This law gives the probability of the leading digit in a naturally occurring numeral data.

The Benford’s law states that for any numerical data with a distribution of numbers spanning several orders of magnitude(an order of magnitude is an approximate measure of the number of digits that a number has in the commonly-used base-ten number system), the probability of a number starting with the digit d is given by \(log_{10}(1+1/d)\), where \(d \in \{1, 2, ... ,9\}\).

Mean absolute deviation (MAD) is a statistical method which can be used to find whether the data’s first digits follow the probability distribution given by Benford’s law. Mean absolute deviation \(MAD = \sum _{j=1}^{m}(OP_j-EP_j)/m\), where \(OP_j\) is the observed probability of \(j^{th}\) bin, \(EP_j\) is the expected probability of \(j^{th}\) bin, and m is the total number of bins (in this case it is equal to 9). Based on the MAD value, we can find the conformity between expected probability and observed probability as given below [19].

  • MAD value between 0.000 to 0.004 says “Close conformity”

  • MAD between 0.004 to 0.008 says “Acceptable conformity”

  • MAD between 0.008 to 0.012 says “Marginally acceptable conformity”

  • MAD greater than 0.012 says “Nonconformity”

Sales Transaction Graph: We use waybill database to construct an edge weighted directed social network denoted by \(G_d=(V_d,E_d)\), where \(V_d\) is the vertex set (each dealer corresponds to a vertex), and \(E_d\) denotes the set of weighted directed edges. We name this social network as sales transaction graph. Below we propose a method to assign weights to the edges.

Let m be the number of sales transactions in the waybill database from dealer vertex x to dealer vertex y and \(v_1,v_2,v_3,\dots ,v_m\) be values of these sales. Let \(\beta (xy)\) be the MAD value of the first digit Benford’s analysis on \(v_1,v_2,v_3,\dots ,v_m\). Based on the value of \(\beta (xy)\), we can establish the conformity between expected and observed distribution.

The weight w(xy) of the edge from vertex x to vertex y in graph \(G_d\) is given by \(w(xy)=(m*\sum _{i=1}^mv_i)/(m+\sum _{i=1}^mv_i)*e^{1000*\beta (xy)}\). Note that lesser edge weights are assigned for the edges with less number of transactions or less sum of the values of the transactions [24]. The weight of the edge xy increases exponentially with \(\beta (xy)\), i.e., more weight is assigned for a lesser conformity between expected distribution and observed distribution.

3.2 Step 2: Construction of Weighted Undirected Graph

Majority of work in community detection has been done on undirected graphs. In this paper, we propose a method to convert an edge weighted directed graph into an edge weighted undirected graph.

There are several metrics to measure the quality of a community. One major idea is that flow tends to stay within the community. Hence, cycles in a graph play an important role in community detection. In detecting communities in circular trading, 2-cycles and 3-cycles play an important role. We propose a weighting scheme to turn an edge weighted directed graph to an edge weighted undirected graph. The weight given for an edge is based on triangles and two cycles in which this edge is involved [15, 16, 23].

In the following, we will explain how to construct an edge weighted undirected graph \(G_u=(V_u,E_u)\) from an edge weighted directed graph \(G_d\) described in Subsect. 3.1. Let \(C=(a,b,c)\) be a cycle in \(G_d\). Cycle C can be any one of the four types of cycles shown in Fig. 3. The weight of cycle C is defined as follows.

Fig. 3.
figure 3

Different types of 3-Cycles

  • If it is of Type a, then the weight of the cycle C is given by \(W(C)=min\{w(ab),w(bc),w(ca)\}*1\)

  • If it is of Type b, then the weight of the cycle C is given by \(W(C)=min\{w(ab),w(bc),w(ca)\}*1.5\)

  • If it is of Type c, then the weight of the cycle C is given by \(W(C)=min\{w(ab),w(bc),w(ca)\}*1.75\)

  • If it is of Type d, then the weight of the cycle C is given by \(W(C)=min\{w(ab),w(bc),w(ca)\}*1.875\)

The weights 1, 1.5, 1.75 and 1.875 are given to 3-cycles of types a,  b,  c,  d respectively. These are chosen empirically based on the clustering performance. The number of reciprocal edges in a triangle conveys the strength of circular trading. Hence, we gave more weight to a 3-cycle with more number of reciprocal edges.

Suppose directed edge ab is in cycles \(C_1,C_2,\dots ,C_m\) and directed edge ba is in cycles \(D_1,D_2,\dots ,D_n\). Then the weight of an undirected edge ab in graph \(G_u\) is given by the maximum among the following three values

  • \(max\{W(C_1),W(C_2),\dots ,W(C_m)\}\)

  • \(max\{W(D_1),W(D_2),\dots ,W(D_n)\}\)

  • \(min\{w(ab),w(ba)\}*2\)

3.3 Step 3: Community Detection

We use WalkTrap algorithm to detect communities. WalkTrap algorithm is a hierarchical agglomerate clustering algorithm and it uses a distance measure based on random walks [22]. It is based on the assumption that a random walker would spend a longer time inside a strong community due to the high density of edges within the community. This algorithm measures the similarity between vertices and between communities by defining a distance between them. This distance measure is calculated from the probabilities that the random walker moves from one vertex to another in a fixed number of steps.

Distance Between Communities. Let us consider random walks of a given length t on graph G. Let \(p^t_{ij}\) be the probability of reaching vertex j from vertex i in a random walk of length t. Value of t should be large enough to capture the community structure of G but not too large to reach a stationary distribution. Generally, the value of t is between three and six. The basic idea behind this algorithm is two vertices of the same community tend to see all the other vertices in the same way. Thus if vertices i and j are in the same community, we can expect that \(\forall k,~p^t{ik} \cong p^t{jk}\). Then the distance between vertices i and j can be defined as \(\sqrt{(}\sum _{k=1}^n\frac{(p^t_{ik}-p^t_{jk})^2}{d(k)})\), where d(k) is the degree of vertex k [22]. One can generalize the distance between vertices to a distance between communities in a straightforward way.

3.4 Step 4: Removing Cycles in Each Cluster

Consider any community (cluster) C given by the community detection algorithm. Using the waybill database explained earlier, we construct a directed edge-labeled multi-graph called sales and purchase graph, denoted by \(G_{sp}=(U,E,\gamma )\), where U is the set of vertices (each vertex corresponds to a dealer in C), E is the set of labeled directed edges (an edge from vertex x to vertex y corresponds to a sales transaction from x to y) and \(\gamma \) is the function that associates a 2-tuple for each labeled edge, where the first element of the tuple is the time of sales of this transaction and the second element is the value of sales of this transaction.

Following are few notations we use. Note that each edge in the graph has two parameters, one is the time of sales and the other is the value of sales. The end_time of a cycle is defined as the time of most recent sales transaction among all the sales transactions corresponding to the edges in the given cycle. The start_time of a cycle is defined as the time of least recent transaction among all the sales transactions corresponding to the edges in the given cycle. The time_gap of a cycle is defined as the difference between end_time and start_time. The maxval of a cycle is defined as the maximum value among values of all the transactions corresponding to the edges in the given cycle. The minval of a cycle is defined as the minimum value among values of the transactions corresponding to the edges in the given cycle. The valgap of a cycle is defined as the difference between maxval and minval. Let the trust score of a cycle is defined as time_gap * valgap.

From in-depth research by taxation authorities, it is observed that time_gap and valgap of any fake sales cycle are very small, which means the trust score of any fake cycle is small. Our motive is to remove all fake cycles from the sales and purchase graph. Then the remaining graph will be a directed acyclic graph (DAG). Note that the resultant directed acyclic graph contains all suspicious transactions. This makes fraud detection process simpler which allows us to do a deeper analysis on suspicious transactions to identify tax evaders. Below we give a brief sketch of the fake cycles removal algorithm.

  1. 1.

    Select a cycle D in sales and purchase graph \(G_{sp}\) with the following conditions:

    • Condition 1: end_time of D is minimum among all the cycles in \(G_{sp}\)

    • Condition 2: With respect to the condition one, trust score of D is minimum

    • Condition 3: With respect to the condition two, length of D is minimum

  2. 2.

    Let y be the minimum of the values of sales of all the edges in D. Subtract y from values of sales of all edges in D.

  3. 3.

    Remove any edge from D whose value of sales becomes zero.

  4. 4.

    Repeat steps one to three, as long as \(G_{sp}\) contains a cycle.

3.5 Algorithms

Detecting and Managing Outliers: According to Benford’s analysis, the probability of nine occurring as the first digit is 0.046 [19]. We need at least twenty-two transactions between any pair of dealers to get a valid Benford’s score. As part of data cleansing, we remove sales transactions between pairs of dealers (vertices) if the number of sales transactions between them is less than twenty-two. If the value of any sales transaction in waybill database is more than third quantile plus 1.5 times the inter-quantile range of values of sales transactions, then replace the value of this sales transactions by third quantile plus 1.5 times the inter-quantile range of values of sales transactions [2].

Algorithms: Algorithm 1 is a community detection algorithm. First, we apply this algorithm to detect communities. Removing outliers and construction of directed graph are highly time consuming operations in this algorithm due to millions of purchase and sales transactions. These operations are parallelized. We used Map-Reduce framework to implement these operations. Later we apply Algorithm 2 to identify illegitimate transactions. Note that both algorithms are polynomial time algorithms.

figure a
figure b

4 Case Study

4.1 Experimental Setup

We used the \( R\) programming language for data mining and Hadoop framework for storing data. We used the \( RHadoop\) open source analytics solution to integrate \( R\) programming language with \( Hadoop\).

4.2 Identifying Communities

In our data set, there are 0.6 million dealers. Size of our data set is 1.5 TB. Figure 4, shows the business among some of these dealers. We applied the Algorithm 1 on this data set and obtained several communities, which are doing heavy circular trade. Figure 5, shows a few communities obtained. We used two measures namely modularity and coverage to validate the clustering. Modularity and coverage are 0.74 and 0.82 respectively.

Fig. 4.
figure 4

Complex network of sales and purchases

4.3 Identifying Illegitimate Transactions

We had taken one community with four dealers which is shown in Fig. 6. These four dealers are doing heavy circular trade among themselves. Their sales, purchase and tax details are shown in Fig. 7. Total tax paid by these four dealers is Indian rupees 0.03 million which are shown in column seven. The tax they collected on sales(output tax) is Indian rupees 367 million as shown in column six. They set-off this entire tax collected with the tax they paid on purchases(input tax) which is shown in column four. In genuine Iron and Steel, the business ratio between the input tax and output tax will be less than 0.95, but here it is almost one. We applied Algorithm 2 on this community to remove fake cycles and identify illegitimate transactions. When the tax authorities physically visited the premises of these companies, they identified that these are shell companies.

Fig. 5.
figure 5

Experimental result

Fig. 6.
figure 6

Cluster of four dealers

Fig. 7.
figure 7

Business details

5 Conclusion

Here we studied a widely practiced tax evasion method in GST called circular trading. Circular trading is a tax evasion practice where a set of malicious dealers do heavy fake sales and purchase transactions among themselves that go around in a circular manner in a very short time-duration without any meaningful value-addition. They practice this technique to hide illegitimate transactions. We addressed the problem of identifying the cluster of dealers who do excessive fake trade among themselves and illegitimate transactions performed by them. We implemented this technique using \( RHadoop\) big data framework for the Commercial Taxes Department, Government of Telangana, India. Our results are helping the tax authorities to effortlessly identify illegitimate transactions and take legal action against those who are doing these transactions. As future work, we plan to work on developing sophisticated algorithms that detect colluding communities by exploiting the different patterns made by the fraudulent dealers.