An intelligent system for focused crawling from Big Data sources
Introduction
Within the current data-driven era and the fourth industrial revolution, most companies and public administrations produce huge volumes of data as a result of their ordinary activities. Data are generated by many different devices around us: mobile devices, remote sensing, software logs, cameras, and so on, generating an exponentially increasing volume of data (Hilbert & López, 0000). Thus, companies and public administrations aim to develop the capability to analyse such data in order to infer novel knowledge and use it to improve their efficiency, productivity, and competitiveness.
According to McKinsey’s report in Manyika (2011), the proper management and exploitation of Big Data can pave the way for an important growth of the world economy and the citizens’ satisfaction w.r.t. their public administrations. For instance, the European community has estimated that the use of knowledge discovery from Big Data could potentially reduce the expenditure of the European administrative activities, increasing the generated value from 223 to 446 billion, or even more. To this end, the international Open Government Partnership1 and the Open Data Charter2 have been issued, and many countries have joined them. Among these, Italy has been one of the pioneer countries and has consequently issued a strategy for open data in public administrations to meet the demand of civil society to improve the quality and availability of information, to strengthen transparency, and to encourage the reuse of released data.3 More specifically, the Digital Administration Code4 demands that data related to the public administrations be freely available according to terms of a license or a regulatory provision, and be accessible throughout information and communication technologies (Carloni, 2005). In addition, to reduce the expenditure of administrative activities within public sectors, full access is provided to the functions of the public administrations and their related data/documents through the Web. A concrete example of such a trend is represented by the e-procurement area, which relies on calls for tender, whose artefacts are easily accessible within the Web by means of the web sites of the public administration or the web site of the Italian government gazette, so that proposals to a given call can be submitted electronically by means of certified emails.
As in other domains, the volume of public procurement data is an extremely large one, since it is typically carried out at all levels of the public administration. By considering the case of the Italian procurement, we can witness a pool of over 30,000 contracting authorities, including national ministries, national agencies, and publicly-owned companies. Public procurement has been progressively digitalised, so that more and more calls for tender are daily published within the web, and concern several different scopes, spanning from building facilities to their maintenance or revamping. However, searching within big data is an extremely complex task, especially for operators without computer science skills, hence unable to properly exploit complex and sophisticated search languages. Another challenging aspect concerns the visualisation of the inferred knowledge, so as to enhance the capability of a human operator to grasp variable insights. Therefore, the ability to carry out timely analysis of the data and display results are crucial points to which researchers are devoting considerable efforts. Almost every big company and public administration are increasingly investing money in data-mining and data-visualisation projects, whose findings are key enablers for those applications that can be used to improve the quality of life in today’s society. The application of these concepts is particularly challenging within the context of the call for tender management in public procurement. Moreover, even when calls are published through digital documents, not all of them are fully structured, and even the structured ones do not always abide by the same format or schema, since many different characterisations can be used by public administrations. This represents a considerable burden for private organisations willing to re-use their queries across multiple sources sharing calls for tender. As an example, each Italian public administration provides a customised search system and a different data format, despite each call shares a similar structure imposed by the National Anti-Corruption Authority (ANAC).
In general, after identifying a set of calls for tender of interest, a company needs to quickly come up with its own bid meeting the requirements expressed in a call. To this end, it would be highly desirable to exploit past experiences and lessons learned from participation in past similar procurement opportunities. Although the experience of employees concerning past projects is of pivotal importance, relying on human experts is not always convenient, since they might make decisions based on their intuitions, but without systematically analysing the characteristics of past projects and the companies current workload. On the contrary, the participation in past calls for tender provides huge volumes of artefacts that can be analysed by means of Big Data technologies, in order to select those calls more similar to the current one, which could provide precious hints to prepare a successful bid.
Starting from an extensive experience in the e-procurement domain, our research has focused on two problems that are common to other application domains beyond e-procurement, such as software development, cloud providing, office supplies, and so on: (i) crawling artefacts from the web whose informative content matches specific topics of interest, trying to overcome possible linguistic ambiguities of contents written in natural language; (ii) matching the characteristics of crawled artefacts against data and knowledge stored within local sources. In order to solve them, we have investigated several big data, machine learning, and natural language processing techniques, some of which have been extended and adapted. For instance, we have derived an extension of the K-means clustering algorithm aiming to prevent its convergence to local minima. In fact, in the experimental section we have shown that our proposed extension does not fall into a local optimum, also shown in Bifulco and Cirillo (2018). Such techniques have been implemented in the intelligent system Crawling Artefacts of Interest and Matching them Against eNterprise Sources (CAIMANS).
CAIMANS is composed of several modules. The first one is a novel web crawler capable to extract and pre-process unstructured data from the web. The output of such component is formatted in a suitable way to enable further analysis against a set of query terms, so as to find the artefacts that are more pertinent to the enterprise goals and capabilities. The second module relies on enterprise data and knowledge sources. A third system’s module aims to find semantic matches between the crawled artefacts and the knowledge stored within the enterprise sources. Finally, the last module is responsible for visualising the crawled artefacts. These modules can also work in isolation, thanks to a set of RESTful web services and a proper Graphical User Interface. The approach underlying CAIMANS has also been conceived to work off-line, by running the analysis in batch mode and visualising the results at the most convenient time for the user. To this end, we have also conceived a dynamic page detection approach, in order to remove the contents of dynamic pages from the answer set, since their contents might no longer be pertinent at a time when crawled artefacts are analysed.
In the context of the e-procurement domain, CAIMANS has been experimentally used to support two phases of the call for tender management process: crawling new calls from the web and searching the enterprise data and knowledge sources to extract useful information from past calls and corresponding bids, in order to support the preparation of a suitable proposal for the call for tender being examined. The aim has been to support a human operator in the overall process of formulating a competitive response to some calls of interest in a relatively short time and in more an effective way, by leveraging on positive and negative past experiences stored within the company’s data and knowledge sources.
The key contributions of our proposal are the following ones:
- •
An extension of the K-means clustering algorithm, relying on multiple random starting points (Bifulco & Cirillo, 2018), in order to tackle the well-known drawback of K-means to converge to a local minimum, since experiments revealed this to be particularly critical in the surveyed domains. Based on this enhanced version of K-means, we have built an advanced crawler capable of effectively segmenting web search results into clusters, increasing the capability of end-users to analyse them.
- •
A semantic module relying on natural language processing techniques and the cosine similarity to analyse the contents of the crawled web pages and extract information that is pertinent to a domain of interest for the company. This module also includes a customisable component exploiting search patterns to select the most relevant data, aiming to show them through a structured representation.
- •
An extended validation, accomplished in cooperation with industrial stakeholders from the e-procurement domain, on real size use cases, and comparative analysis with respect to existing systems, on several application domains, proving the effectiveness and usability of CAIMANS.
The paper is organised as follows. Section 2 contains the description of the relevant state of the art on the addressed topic and challenges. Section 3 presents the case study of call for tender search and the way CAIMANS has been exploited to solve typical problems of this domain. In Section 4, we describe the assessment we have conducted on CAIMANS. Section 5 concludes the paper by presenting our conclusions and future directions.
Section snippets
Problem statement and analysis of the related works
The main goal of CAIMANS is to crawl contents from the web, supporting the selection of contents of interests, and matching them against the contents of the enterprise data and knowledge sources. Thus, in this section existing semantic search engines and crawlers are mainly surveyed and analysed. Generally speaking, a search engine operates through three main processes: scanning (crawling) the Web, indexing, and ranking (sorting) the results obtained from a crawler, and visualising results. A
The proposed intelligent system
Current Web search engines require users to search for artefacts of interest by mainly entering query strings. Generally, this limits the search and does not guarantee that correct results will be obtained immediately. Human experts must carry out many manual searches in order to obtain useful results. In fact, often within the Search Engine Results Page (SERP) there are many pages outside the search scope. To this end, CAIMANS can reduce the overall search time, by increasing the number of
Experimental results
The prototype of the proposed system has been developed in .NET framework 4.7.1,11 based on the Model-View-Controller (MVC) architectural pattern. Each module has been developed standalone and combined into a single .NET solution connected by project references. A RESTful API service has been integrated into the solution to simplify enquires to both the crawler and the CMS.
The screenshot in Fig. 6 shows the results produced by CAIMANS for the e-procurement case
Conclusion
In this paper, we have proposed CAIMANS, an intelligent system aiming to effectively support companies in the process of selecting artefacts of interest from the web, verifying how they match company’s backgrounds and expertise stored in its data and knowledge sources. Thus, CAIMANS relies on an advanced crawler to search artefacts of interest from the web. An extended validation has been carried out in cooperation with industrial stakeholders on real size use cases from the e-procurement
CRediT authorship contribution statement
Ida Bifulco: Conceptualization, Methodology, Validation, Writing – original draft. Stefano Cirillo: Conceptualization, Methodology, Software, Validation, Writing – original draft. Christian Esposito: Conceptualization, Methodology, Writing – review & editing. Roberta Guadagni: Software, Validation. Giuseppe Polese: Conceptualization, Methodology, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work has been supported by the research project named “PROBIM ” funded by the Italian Minister of Economic Development, and this activities has been conducted as culsultancy for the regional research centre named as CeRICT scrl.
References (58)
- et al.
Torank: Identifying the most influential suspicious domains in the tor network
Expert Systems with Applications
(2019) - et al.
Combining different evaluation systems on social media for measuring user satisfaction
Information Processing & Management
(2018) - et al.
Syntactic clustering of the web
Computer Networks and ISDN Systems
(1997) - et al.
Architecture of a grid-enabled web search engine
Information Processing & Management
(2007) - et al.
Focused crawling: a new approach to topic-specific web resource discovery
Computer Networks
(1999) - et al.
An improved focused crawler based on semantic similarity vector space model
Applied Soft Computing
(2015) - et al.
An efficient page ranking approach based on vector norms using snorm (p) algorithm
Information Processing & Management
(2019) - et al.
Eras: Improving the quality control in the annotation process for natural language processing tasks
Information Systems
(2020) - et al.
Towards effective document clustering: A constrained K-means based approach
Information Processing & Management
(2008) - et al.
Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling
Expert Systems with Applications
(2020)