Keywords

1 Introduction

Corruption is a common problem that is damaging the competitiveness and the economy of many countries. As a flagrant breach of laws, agreements, or codes of conduct, corruption affects and deteriorates the performance of the state institutions and has a negative impact on citizens’ participation (and confidence) in the management of public affairs. In public administrations where corruption is widespread, public procurement usually suffers from additional costs that damage the purchasing conditions of goods and services, deters market competition and, ultimately, the quality of services provided to citizens is significantly damaged.

Transparency InternationalFootnote 1 categorises corruption into three main groups [7]: grand (acts committed at a high level of government), petty (everyday abuses of entrusted power by low- and mid-level public officials in their relations with ordinary citizens) and political (manipulations in procedures or policies). We can assess the impact of all the previous types of corruption in society in a myriad of ways. Actually, the cost of corruption is usually divided into four main categories: political, social, environmental and economic. Regarding the economic category, an analysis from the European Parliament in 2016 revealed that corruption throughout Europe is costing around 1000 billion Euro a year. The astounding sum equates to 6.3% of overall EU-28 GDP [4], making it clear the serious implications for economic development.

On the other hand, Transparency International also includes, in its yearly reports, a number of recommendations aiming at fighting corruption. Among others, the organisation suggests to join representatives from government, business and civil society work together to develop standards and procedures to reduce and mitigate frauds in public administration. Key initiatives aimed at supporting the latter are the implementation of electronic administration services. Electronic administration (e-administration), and electronic government (e-government) have widely expanded in public administrations in order to provide efficient and transparent public services to citizens [3, 6].

Public transparency and open government is only possible when citizens have the right to access the documents and proceedings of the government to allow for effective public oversight. This is also possible because, nowadays, even the smallest detail concerning public procedures must be registered and stored in data bases and other electronic information repositories. This massive amount of data provides a magnificent source of knowledge when the appropriate exploitation tools are employed [2, 5, 9, 10]. For instance, this information can be analysed in order to try to induce common fraudulent (behavioural) patterns from corruption cases. However, at the same time, people with criminal or illicit intentions could also try to commit frauds in such a way that they alter the normal course of events (common patterns) in a procurement process as well as by identifying loopholes or arguments that allow to protect themselves from being discovered. Examples of this include the abuse of non-competitive procedures on the basis of legal exceptions, deficient supervision or suspicious selection, bid rotation or bid rigging, collusive tendering and market sharing, etc. Such strategies would complicate the detection and investigation task enormously.

With the aim of detecting and preventing bad practices and fraud in public administration, in this work we present SALER (“Rapid Alert System”, called after the Spanish initials), a data science-based framework launched in 2017. SALER is a joint development between the Universitat Politècnica de València (research and innovation) and Generalitat Valenciana (governance and public administration) which, by making better use of the expertise available within the public institutions, seeks to improve the discovery (an prevention) of irregularities, fraud and other malpractices in the area of public management. SALER also aims to improve the involvement of civil society in the public processes by further increasing transparency and participation in the review and control over public administration. As a technical solution, as well as providing statistics, visualisations and risk scores, SALER agglutinates a number of statistical models to analyse, describe and predict potential patterns of fraud and corruption. Thanks to the official transparency framework established recently, SALER is allowed to access and use government’s internal (public and non-public) data sources which, in turn, is enriched by means of a number of externals sources, mainly from the Spanish Mercantile Registry, as well as from the chamber of notaries or even the information from social networks. Finally, given the open and transparent nature in which SALER has been conceived, it is meant to be shared with other government bodies in Spain (and Europe), for which it is being developed following the idea that it should be freely available to everyone for use or adaptation, without restrictions from copyright, patents or other mechanisms of control.

The paper is organised as follows. Section 2 introduces and describes the SALER project and its main developments. Section 3 describes the software solution developed and includes some examples of its use. Section 4 briefly outlines the most relevant related work. Finally, Sect. 5 closes with the conclusions and future work.

2 SALER Project

SALER pursues to be a flexible and robust data science-based solution that enables, from relevant information regarding public expenditure, to perform a quantitative and qualitative analysis of this information based on requirements and specifications set by the audit, control and governing bodies in the city of Valencia, while ensuring better compliance with the policies, rules and regulations. Specifically, SALER is being designed for analysing and responding to a series of specific questions and requisites regarding fraud and corruption. To do so, we need to perform “intelligent mining” of internal and external data sources from administrative procedures by developing data analysis algorithms that cross-check data from various public and private institutions. Machine learning, pattern detection, data analysis and intelligent mining, as well as other procurement monitoring and analytic techniques are being used to try to identify projects susceptible to risks of fraud, conflict of interests or irregularities.

The effective integration of all these tools and techniques into the e-governance and e-procurement practices of the Valencia government would not only enhance decision making and further control of public expenditure, but also bring greater transparency through the simplification of audit and inspection tasks.

2.1 Scenario and Users

In Valencia, Spain, the law “on the general inspection of services and the alert system for the prevention of bad practices in the administration and public sectorFootnote 2, is intended to bring important advances in preventing and combating corruption and fraud in public administration, as well as provide legal support and security to the development of the SALER project in terms of data, means and instruments. This law establishes a new framework that affects the fulfilment of a number of ethical values in the public administration, in which the General Inspection of Services of ValenciaFootnote 3 plays an important role regarding fiscal intelligence: it promotes prevention and investigation functions and provides the investigators with full autonomy as well as with a full range of administrative and other support services.

In this context, the General Inspection of Services of Valencia is responsible, jointly with the Universitat Politècnica de València, among other related duties, to plan, coordinate and develop an ecosystem of intelligent analytical tools (i.e., SALER) for detecting and correcting bad practices and fraud in public administration. Investigators and analysts from this government body would be the end users of the final system. The cases that may be detected as potential instances of corruption or fraud would be selected for further investigation through various mechanisms such as audits, lawsuits, cross-checking, among others. It is intended that the final data science solution will join the currently existing mechanisms for selection of suspected frauds, which can benefit from the models, advanced analytics and knowledge generated by SALER. Besides detecting anomalies related to fraud and corruption, SALER seeks to be a preventive and transversal approach, by helping with the development of a risk assessment map and individual self-evaluation plans, as well as promoting the collaboration between different public agencies.

2.2 Data Collection and Processing

With respect to the data sources used, cases of fraud and other irregularities are analysed and assessed by aggregating data from public expenditure as well as information regarding senior public appointments and other related internal data sources. The main sources of internal (public and non-public) information we are using include:

  • Economic and budgetary information:

    • Public procurement, tenders, modifications, procedures of adjudication, agreements, commissions.

    • Grants and public subsidies.

    • Financing information, public debt, average payment term, inventory of goods.

    • Treasury, petty cash and other disbursement of funds.

    • Public service (bank) accounts.

  • Organisational structure:

    • Information from senior officials and government, positions, incompatibilities.

    • Budgets, salaries, annual accounts.

Since all the previous information stem from different organizations, institutions and public administrations (which sometimes use different information management tools), the very first phase consist in retrieving, combining, cleaning and processing all the information from the government’s databases using appropriate ETL tools and libraries.

On the other hand, using only internal data sources might be considered insufficient when trying to analyse corruption and fraud (due to their limited scope), we have also focused on the use of several (unstructured) external data sources which provide a greater range of possibilities for data analysis. Examples of external data sources include information regarding the Spanish Mercantile Registry (BORMEFootnote 4), notarial acts (e.g., from IVATFootnote 5), state aids (e.g., from IVACEFootnote 6) or even the information from social networks. One of the main objectives of using these sort of external sources has been the discovery of the various relationships that exist among the different entities (natural persons, legal persons or companies) that take part in the public contracts and competitions. This way we can analyse conflicts of interest between public functions and private interests, overlaps between highest-level positions, common stakeholders and shareholders, etc.

Fig. 1.
figure 1

Example of information (current and historical positions and companies) extracted and processed from BORME. This data (stored in json format) can be processed to extract relationships between different entities (natural persons, legal persons or other companies).

By means of document scraping and report mining techniques (e.g., data from BORME is usually provided in PDF), we are able to extract useful information (entities and actions) from hundreds of thousands of companies in Spain that are officially registered in this external data sources. For this task, we extended and improved some Python libraries such as bormeparserFootnote 7 and LibreBORMEFootnote 8, allowing us to download, parse and obtain relevant information (ids, dates, locations, sections, announcements, acts, registration data, etc.) not only from BORME files, but also from other unstructured sources such as notarial acts. This functionality is also supported by several existing Python libraries for PDF manipulation (PyPDFFootnote 9) as well as for extracting information from PDF documents (PDFMinerFootnote 10), and for processing XML and HTML files (lxmlFootnote 11).

Fig. 2.
figure 2

Overlaps and conflicts of interest (regarding management positions) between different natural persons, legal persons or other companies. The graph shows the current (solid edges) and historical (dashed edges) positions (gray nodes) hold by “Rodrigo de Rato y Figaredo” and “Miguel Blesa de la Parra” (yellow nodes) in the same companies (red nodes). Information extracted, structured and processed from the Spanish Mercantile Registry. (Color figure online)

Fig. 3.
figure 3

Inter-enterprise relations (management positions in common). The graph shows information concerning any overlap between two companies (“Orange Market” and “Aparcamiento de Leon”): “Ramón Blanco Balín” hold several managerial positions (secretary and advisor) in both companies. Same legend as in Fig. 2. Information extracted, structured and processed from the Spanish Mercantile Registry. (Color figure online)

The information extracted from the BORME files (see Fig. 1) is then processed and enriched with information from notarial reports and databases to obtain relationships between the different entities (natural persons, legal persons or other companies) based in the knowledge extracted from the legal acts published (society formation/dissolution, new appointments and designations, depositing dues, etc.). This relations are stored in non-relational (graph-oriented) databases such as Neo4jFootnote 12, particularly suitable for dealing, analysing and querying this sort of multiple related data [11]. Figures 2 and 3 show, respectively, examples of relationships between pairs of people and companies which were obtained processing BORMEFootnote 13.

2.3 Data Analysis Requirements

We partnered with Generalitat Valenciana, throughout the General Inspection of Services, to customise a solution to help analyse fraud cases by means of the definition of specific requirements in terms of questions and data analyses. Furthermore, we defined a number of risk scores and indicators aiming at bringing to light unusual behaviours in expenditures, abnormal patterns of services contracted, collusive tendering and market sharing, and several other factors. Regarding their definition, it may be based on (a) the application of procedure policies, rules and regulations (e.g., in restricted procedures, contracting authorities may limit the number of candidates meeting the selection criteria that they will invite to tender to three); or on (b) the application of knowledge and experience extracted from real cases of fraud and other irregularities that have previously occurred (e.g., the practice of splitting contracts and the use of dubious procedures to avoid open procedures).

All the integrated models and analyses performed can be grouped according to four main risk categories. In the following, we briefly describe each of themFootnote 14.

  • Bid rigging in public procurement: This refers to competitors agreeing to coordinate bids or engaging in collusive tendering. In this case, we analyse a number of collusive patterns thereby drawing investigators’ attention to these entities. Examples of “red flags” in this group include: same company wins most of the time (here we can analyse the number and fraction of wins, by region, by sector, ...), cartels (association rules to find frequent tenders, as well as correlations, associations, or causal structures), bid rotation (time series descriptors of winners), few or no new participants (histograms of participation distribution), bidding does not erode target price or artificial bids (bid to reserve prize statics), participant withdraws (revocations and cancellations count by bid and participant), etc.

  • Irregularities relating to public contracts: Here we assess, model an highlight several variables and patterns from the biding phase to the contract execution and payment. Specifically, we analyse the use of (non-)competitive procedures (lack of proper justification), the abuse of non-competitive procedures on the basis of legal exceptions (contract splitting, abuse of extreme urgency, non-supported modifications), suspicious selection criteria (objectively defined or not established in advance), procurement information not disclosed or made public (informal agreement on contract, absence of public notice for the invitation to bid, and evaluation and award criteria not announced), bids higher than projected overall costs, etc.

  • Conflicts of interest: Here we evaluate several indicators based both on procurement data (bidders, evaluation process, contract amendments, contract fulfilments, projects audits, etc.) as well as on external databases that can provide us information about links and connections between all the parties involved. Once the data has been processed and structured, we use graph data analysis techniques, such as community detection, pattern recognition and centrality measures, to discover and analyse potential relationships between beneficiaries, project partners, contractors/consortium members, sub-contractors, etc.

  • Abuses and other complex manipulations in performing the contracts: We also perform further analysis and pattern-discovery focusing on the quality, price and timing of the contracts. In particular, we implement indicators and testers who look for substantial change in contract conditions after award (e.g., time or price allowance for the bidder, product substitution or sub-standard work or service not meeting contract specifications), theft of new assets before delivery to end-user or before being recorded, deficient supervision or collusion between supervising officials, preferred supplier indications, subcontractors and partners chosen in an on-transparent way or not kept accountable, late payments, etc.

Fig. 4.
figure 4

Example of association rules discovered for tenderers (IDs) in public procurements. Investigators can use these type of analysis to discover frequent if/then patterns and using the criteria support and confidence to identify the most important relationships.

A straightforward approach to implement some of the previous aggregated analyses is via summary tables and charts, in reports and dashboards. Specific visualisations can be generated for presentation of these statistics split by various dimensions (e.g. bar charts) or showing the evolution (e.g. line charts, timeline). The geographical dimension is best presented on maps where detailed data can be shown as pointers with tooltips. More sophisticated analysis can be provided by statistical and data mining tools, which automatically interrelate multiple views on data, often based on contingency table. As an example, Fig. 4 shows a fragment of analysis of procurement data for finding frequent tenders and identifying the most important relationships.

Once implemented, all the questions, data analyses, risk scores and indicators are validated by official government bodies and, in particular, by the investigators of the General Inspection of Services, which, based on their experience, establish which further analyses and modifications are to be considered.

3 Implementation

The results from the different questions and analyses performed are delivered in an easy-to-use visualization and data analysis tool called SALER AnalyticsFootnote 15. Figure 5 shows the launch screen. This is being developed in R [8], using caretFootnote 16 for creating machine learning workflows when necessary, ggplot2Footnote 17 for (interactive) visualisations and, finally, ShinyFootnote 18 to provide a web application framework. In a nutshell, SALER Analytics fuses data from multiple data systems to create a unified, intuitive view with the context required by analysts and investigators to make important case decisions. The tool presents a number of different analytics and assessment reports (including multidimensional data, interactive plots and narratives) focusing on specific models or patterns as described in the previous list.

Fig. 5.
figure 5

Launch screen of SALER Analytics (cropped), where a number of global statistics and visualisations are shown, and the user may select different data sources, filters and navigate the different questions data analyses performed through the menu on the left.

Furthermore, the tool provides several risk metrics in a list view that bring to light potential cases of fraud and corruption, ranging from the lowest (coloured in green) to the highest risk (coloured in red). As an example, Fig. 6 shows the risk values provided by the SALER Analytics tool when analysing the splitting of contracts.

Fig. 6.
figure 6

Example of a (cropped) report provided by SALER Analytics for analysing the splitting of contracts in public procurement and the possible relationships between the different contractors (specific case of bid rigging). Contracts are grouped based on the different similarities (CPV hierarchies, object, dates, contracting authorities, etc.), and colour highlighted depending on the total number of contracts, total bid amount or total contract amount. Contracts can be analysed in detail with charts of data over time, summaries, and relationship graphs. (Color figure online)

SALER Analytics enables investigators from the General Inspection of Services to explore data aggregated by service providers, claimants, tenders, and services, as well as drill down to transaction details. They can also view and explore charts of data over time, geographic map presentations, and networks of providers based on common claimants. In the example shown in Fig. 6, apart from visualisations showing lists of awards of contracts during specific periods of time, as well as summaries and contract details, the investigators are provided with graphs and matrices that let them track the relationships between tenderers (bid rigging) showing that a group companies compete against each other in different groups of similar public contracts (grouped based on the similarity of CPV codes according to the hierarchical tree, textual similarities when comparing the titles/objects of the contracts, publication dates, contracting authority, etc.).

It should be also mentioned that all the results, analytics and risk calculations in SALER are stored electronically in databases. Analysts are allowed to assess and check the coherency and the correctness of the data for which they have access via dashboards for projects, contracts, contractors and beneficiaries. They can also export selected data or they can save printable reports.

Finally, SALER Analytics is not solely confined to the analysis of corruption in the city of Valencia. Many government bodies and professionals from other cities in Europe and South America have already approached us interested on using the system with their own data-bases. In principle all information regarding public procurement and expenditure around the world can be displayed in SALER Analytics, although depending on the respective national laws the detail of information as well as the calculation of the risk indicators may vary. On the other hand, it should also be noted that SALER Analytics is not mandatory and thus optional. Managing authorities have to put in place effective and proportionate anti-fraud measures according and this system can represent one complementary element of these measures.

4 Similar Initiatives

The work conducted within the context of the project SALER has not been directly inspired by other existing systems or projects. How SALER has been conceived, designed and developed strongly depends on both the data-bases available as well as on the modus operandi of Spanish public administration. However, its purpose and main objectives are similar to those from other tools and solutions developed by companies and public administrations [1, 12]. In the following we will briefly describe those we consider are most important and relevant in the fight against corruption.

zIndexFootnote 19 [2] is a public procurement benchmarking tool for rating contracting authorities which is being developed in the Czech Republic by researchers from the Charles University of Prague. Thanks to this tool public institutions can be compared according to how they manage public money. It uses real data to measure each contracting authority’s rate of transparency, efficiency and corruption potential in public procurement. In a nutshell, the zIndex measures the contracting authority’s compliance with best practice recommendations defined by international organisations, the Czech Ministry of Regional Development, and other non-governmental organisations.

Another similar initiative is ArachneFootnote 20 [9, 10]. Defined as an integrated IT tool for data mining and data enrichment developed by the European Commission, its main objective is to support managing authorities in their administrative controls and management checks in the area of Structural Funds (European Social Fund and European Regional Development Fund). Considered by the European Commission as a good tool amongst anti-fraud measures, this powerful risk-scoring tool generates more than 100 risk indicators sorted into specific risk categories to help managing authorities and intermediate bodies to prevent and detect errors and irregularities among projects, beneficiaries, contracts and contractors. Arachne is already operational and as of September 2015 was being used/tested by 21 Member States, but any input and feedback from users on functional improvements can continuously enrich its development for the benefit of all users.

We also find a wide range of local initiatives and platforms for the dissemination of the public procurement activity (including the respective contractor profiles and authorities) in SpainFootnote 21\(^{,}\)Footnote 22\(^{,}\)Footnote 23\(^{,}\)Footnote 24. All these solutions make it possible that the different entities (companies and public authorities) have a set of on-line tools and electronic services that allow them to have access to all the relevant information they would consider necessary. These initiatives are also a major source of information related to public procurement in digital and well structured format, which may facilitate further analysis. As with the SALER project, the development of these platforms are usually supported by an state law which provide legal support and security to the development.

As SALER, all the previous systems are focused on public procurement and expenditure data to a greater or lesser extent. However, due to its preventive nature, SALER tries to cover all the phases in the entire procurement process providing an accurate picture of the different steps involving a procedure as well as a general overview. SALER is also based on both existing data and on personal interviews, successful study cases and additional data queries, thus providing a flexible and dynamic tool which follows a knowledge-based development and incremental methodology. Finally, not all tools are of universal application because regulations and public procurement cultures are different. With SALER we are making a great effort in this regard by means of making use of open contracting data standardsFootnote 25 as well as methodologies and best practices for software development and open data publication.

5 Conclusions

The tools of e-Government for fighting corruption and fraud are praised for their positive impact on cost reduction, accessibility, quality and transparency of the public administration. In this paper, we have introduced the project SALER, an ongoing official project which has been developed by the Universitat Politècnica de València (Spain), aiming at detecting and preventing bad practices and fraud in public administration in the city of Valencia, Spain. The main contribution of the project has been the development of a data science-based solution (SALER Analytics) to help detect fraud and corruption cases by means of the definition of specific questions and data analysis, as well as risk indicators and other anomaly patterns. Several internal and external data sources has been analysed and assessed to explore different potential cases of fraud, corruption and other irregularities in budget and cash management, public service accounts, salaries, disbursement, grants, subsidies, etc.

What is the added value for the managing authorities to use SALER Analytics? The system systematically assists managing authorities and intermediate bodies to increase the effectiveness and efficiency of their management verifications to be carried out, putting in place effective and proportionate anti-fraud measures taking into account the risks identified. Furthermore, since prevention and detection is better than any correction of an irregularity, managing authorities can use SALER Analytics in each step of the public procurement cycle and notably (even before project approval, grant agreement or a contract signature), thus enabling them to perform investigations with increased effectiveness and efficiency compared to one not using the system. Finally, unlike other similar approaches, SALER Analytics is able to detect fraud patterns such as collusive behaviours, conflicts of interests, accumulation of public contracts, state aid, subsidies or grants, etc., providing at the same time a wide range of functionalities for the assessment of risks related to projects, contracts, contractors and beneficiaries.

The next steps in our research shall include further development and consolidation of the project as well as the inclusion of more detailed and comprehensible analyses, functionalities ans risks scores following not only the requirements but the feedback from managing authorities. We believe that SALER will not only help to improve the situation in public procurement and expenditure, but also highlight the importance of detection and preventing practices on an open and accountable public spending.