An intelligent system for focused crawling from Big Data sources

https://doi.org/10.1016/j.eswa.2021.115560Get rights and content

Highlights

  • A crawler to get unstructured resources from the Web by exploiting semantic technologies.

  • K-means clustering enhanced with multiple random starting points to escape from local minimum.

  • Cosine similarity and NLP techniques used to select the documents pertinent to certain keywords.

  • A proof-of-concept prototype validated within the context of public procurement.

Abstract

Nowadays, the proper management of data is a key business enabler and booster for companies, so as to increase their competitiveness. Typically, companies hold massive amounts of data within their servers, which might include previously offered services, proposals, bids, and so on. They rely on their expert managers to manually analyse them in order to make strategic decisions. However, given the huge amount of information to be analysed and the necessity of making timely decisions, they often exploit a small amount of the available data, which often does not yield effective choices. For instance, this happens in the context of the e-procurement domain, where bids for new calls for tender are often formulated by looking at some past proposals from a company. Driven by an extensive experience on the e-procurement domain, in this paper we propose an intelligent system to support organisations in the focused crawling of artefacts (calls for tender, BIMs, equipment, policies, market trends, and so on) of interest from the web, semantically matching them against internal Big Data and knowledge sources, so as to let companies analysts make better strategic decisions. The novel contribution consists of a proper extension of the K-means algorithm used by a web crawler within the proposed system, and a semantic module exploiting search patterns to find relevant data within the crawled artefacts. The proposed solution has been implemented and extensively assessed in the e-procurement domain. It has been successively extended to other domains, such as robot programming, cloud providing, and several other domains. Since to the best of our knowledge in the literature do not exists similar systems, in order to prove its effectiveness we have compared its crawling component against similar crawlers, by plugging them within our system.

Introduction

Within the current data-driven era and the fourth industrial revolution, most companies and public administrations produce huge volumes of data as a result of their ordinary activities. Data are generated by many different devices around us: mobile devices, remote sensing, software logs, cameras, and so on, generating an exponentially increasing volume of data (Hilbert & López, 0000). Thus, companies and public administrations aim to develop the capability to analyse such data in order to infer novel knowledge and use it to improve their efficiency, productivity, and competitiveness.

According to McKinsey’s report in Manyika (2011), the proper management and exploitation of Big Data can pave the way for an important growth of the world economy and the citizens’ satisfaction w.r.t. their public administrations. For instance, the European community has estimated that the use of knowledge discovery from Big Data could potentially reduce the expenditure of the European administrative activities, increasing the generated value from 223 to 446 billion, or even more. To this end, the international Open Government Partnership1 and the Open Data Charter2 have been issued, and many countries have joined them. Among these, Italy has been one of the pioneer countries and has consequently issued a strategy for open data in public administrations to meet the demand of civil society to improve the quality and availability of information, to strengthen transparency, and to encourage the reuse of released data.3 More specifically, the Digital Administration Code4 demands that data related to the public administrations be freely available according to terms of a license or a regulatory provision, and be accessible throughout information and communication technologies (Carloni, 2005). In addition, to reduce the expenditure of administrative activities within public sectors, full access is provided to the functions of the public administrations and their related data/documents through the Web. A concrete example of such a trend is represented by the e-procurement area, which relies on calls for tender, whose artefacts are easily accessible within the Web by means of the web sites of the public administration or the web site of the Italian government gazette, so that proposals to a given call can be submitted electronically by means of certified emails.

As in other domains, the volume of public procurement data is an extremely large one, since it is typically carried out at all levels of the public administration. By considering the case of the Italian procurement, we can witness a pool of over 30,000 contracting authorities, including national ministries, national agencies, and publicly-owned companies. Public procurement has been progressively digitalised, so that more and more calls for tender are daily published within the web, and concern several different scopes, spanning from building facilities to their maintenance or revamping. However, searching within big data is an extremely complex task, especially for operators without computer science skills, hence unable to properly exploit complex and sophisticated search languages. Another challenging aspect concerns the visualisation of the inferred knowledge, so as to enhance the capability of a human operator to grasp variable insights. Therefore, the ability to carry out timely analysis of the data and display results are crucial points to which researchers are devoting considerable efforts. Almost every big company and public administration are increasingly investing money in data-mining and data-visualisation projects, whose findings are key enablers for those applications that can be used to improve the quality of life in today’s society. The application of these concepts is particularly challenging within the context of the call for tender management in public procurement. Moreover, even when calls are published through digital documents, not all of them are fully structured, and even the structured ones do not always abide by the same format or schema, since many different characterisations can be used by public administrations. This represents a considerable burden for private organisations willing to re-use their queries across multiple sources sharing calls for tender. As an example, each Italian public administration provides a customised search system and a different data format, despite each call shares a similar structure imposed by the National Anti-Corruption Authority (ANAC).

In general, after identifying a set of calls for tender of interest, a company needs to quickly come up with its own bid meeting the requirements expressed in a call. To this end, it would be highly desirable to exploit past experiences and lessons learned from participation in past similar procurement opportunities. Although the experience of employees concerning past projects is of pivotal importance, relying on human experts is not always convenient, since they might make decisions based on their intuitions, but without systematically analysing the characteristics of past projects and the companies current workload. On the contrary, the participation in past calls for tender provides huge volumes of artefacts that can be analysed by means of Big Data technologies, in order to select those calls more similar to the current one, which could provide precious hints to prepare a successful bid.

Starting from an extensive experience in the e-procurement domain, our research has focused on two problems that are common to other application domains beyond e-procurement, such as software development, cloud providing, office supplies, and so on: (i) crawling artefacts from the web whose informative content matches specific topics of interest, trying to overcome possible linguistic ambiguities of contents written in natural language; (ii) matching the characteristics of crawled artefacts against data and knowledge stored within local sources. In order to solve them, we have investigated several big data, machine learning, and natural language processing techniques, some of which have been extended and adapted. For instance, we have derived an extension of the K-means clustering algorithm aiming to prevent its convergence to local minima. In fact, in the experimental section we have shown that our proposed extension does not fall into a local optimum, also shown in Bifulco and Cirillo (2018). Such techniques have been implemented in the intelligent system Crawling Artefacts of Interest and Matching them Against eNterprise Sources (CAIMANS).

CAIMANS is composed of several modules. The first one is a novel web crawler capable to extract and pre-process unstructured data from the web. The output of such component is formatted in a suitable way to enable further analysis against a set of query terms, so as to find the artefacts that are more pertinent to the enterprise goals and capabilities. The second module relies on enterprise data and knowledge sources. A third system’s module aims to find semantic matches between the crawled artefacts and the knowledge stored within the enterprise sources. Finally, the last module is responsible for visualising the crawled artefacts. These modules can also work in isolation, thanks to a set of RESTful web services and a proper Graphical User Interface. The approach underlying CAIMANS has also been conceived to work off-line, by running the analysis in batch mode and visualising the results at the most convenient time for the user. To this end, we have also conceived a dynamic page detection approach, in order to remove the contents of dynamic pages from the answer set, since their contents might no longer be pertinent at a time when crawled artefacts are analysed.

In the context of the e-procurement domain, CAIMANS has been experimentally used to support two phases of the call for tender management process: crawling new calls from the web and searching the enterprise data and knowledge sources to extract useful information from past calls and corresponding bids, in order to support the preparation of a suitable proposal for the call for tender being examined. The aim has been to support a human operator in the overall process of formulating a competitive response to some calls of interest in a relatively short time and in more an effective way, by leveraging on positive and negative past experiences stored within the company’s data and knowledge sources.

The key contributions of our proposal are the following ones:

  • An extension of the K-means clustering algorithm, relying on multiple random starting points (Bifulco & Cirillo, 2018), in order to tackle the well-known drawback of K-means to converge to a local minimum, since experiments revealed this to be particularly critical in the surveyed domains. Based on this enhanced version of K-means, we have built an advanced crawler capable of effectively segmenting web search results into clusters, increasing the capability of end-users to analyse them.

  • A semantic module relying on natural language processing techniques and the cosine similarity to analyse the contents of the crawled web pages and extract information that is pertinent to a domain of interest for the company. This module also includes a customisable component exploiting search patterns to select the most relevant data, aiming to show them through a structured representation.

  • An extended validation, accomplished in cooperation with industrial stakeholders from the e-procurement domain, on real size use cases, and comparative analysis with respect to existing systems, on several application domains, proving the effectiveness and usability of CAIMANS.

The paper is organised as follows. Section 2 contains the description of the relevant state of the art on the addressed topic and challenges. Section 3 presents the case study of call for tender search and the way CAIMANS has been exploited to solve typical problems of this domain. In Section 4, we describe the assessment we have conducted on CAIMANS. Section 5 concludes the paper by presenting our conclusions and future directions.

Section snippets

Problem statement and analysis of the related works

The main goal of CAIMANS is to crawl contents from the web, supporting the selection of contents of interests, and matching them against the contents of the enterprise data and knowledge sources. Thus, in this section existing semantic search engines and crawlers are mainly surveyed and analysed. Generally speaking, a search engine operates through three main processes: scanning (crawling) the Web, indexing, and ranking (sorting) the results obtained from a crawler, and visualising results. A

The proposed intelligent system

Current Web search engines require users to search for artefacts of interest by mainly entering query strings. Generally, this limits the search and does not guarantee that correct results will be obtained immediately. Human experts must carry out many manual searches in order to obtain useful results. In fact, often within the Search Engine Results Page (SERP) there are many pages outside the search scope. To this end, CAIMANS can reduce the overall search time, by increasing the number of

Experimental results

The prototype of the proposed system has been developed in C#.NET framework 4.7.1,11 based on the Model-View-Controller (MVC) architectural pattern. Each module has been developed standalone and combined into a single .NET solution connected by project references. A RESTful API service has been integrated into the solution to simplify enquires to both the crawler and the CMS.

The screenshot in Fig. 6 shows the results produced by CAIMANS for the e-procurement case

Conclusion

In this paper, we have proposed CAIMANS, an intelligent system aiming to effectively support companies in the process of selecting artefacts of interest from the web, verifying how they match company’s backgrounds and expertise stored in its data and knowledge sources. Thus, CAIMANS relies on an advanced crawler to search artefacts of interest from the web. An extended validation has been carried out in cooperation with industrial stakeholders on real size use cases from the e-procurement

CRediT authorship contribution statement

Ida Bifulco: Conceptualization, Methodology, Validation, Writing – original draft. Stefano Cirillo: Conceptualization, Methodology, Software, Validation, Writing – original draft. Christian Esposito: Conceptualization, Methodology, Writing – review & editing. Roberta Guadagni: Software, Validation. Giuseppe Polese: Conceptualization, Methodology, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work has been supported by the research project named “PROBIM ” funded by the Italian Minister of Economic Development, and this activities has been conducted as culsultancy for the regional research centre named as CeRICT scrl.

References (58)

  • LakshmiR. et al.

    Novel term weighting schemes for document representation based on ranking of terms and fuzzy logic with semantic relationship of terms

    Expert Systems with Applications

    (2019)
  • LangariR.K. et al.

    Combined fuzzy clustering and firefly algorithm for privacy preserving in social networks

    Expert Systems with Applications

    (2020)
  • LempelR. et al.

    The stochastic approach for link-structure analysis (salsa) and the tkc effect

    Computer Networks

    (2000)
  • LiuC.-L. et al.

    Clustering tagged documents with labeled and unlabeled documents

    Information Processing & Management

    (2013)
  • LozanoS. et al.

    Efficiency ranking using dominance network and multiobjective optimization indexes

    Expert Systems with Applications

    (2019)
  • SonJ. et al.

    Content-based filtering for recommendation systems using multiattribute networks

    Expert Systems with Applications

    (2017)
  • AcharyaS. et al.

    The process of information extraction through natural language processing

    International Journal of Logic and Computation (IJLP)

    (2010)
  • ArumawaduH.I. et al.

    K-means clustering for segment web search results

    (2015)
  • BidokiA.M.Z. et al.

    Distancerank: An intelligent ranking algorithm for web pages

    Information Processing & Management

    (2008)
  • Bifulco, I., & Cirillo, S. (2018). Discovery multiple data structures in big data through global optimization and...
  • BorgI. et al.

    Modern multidimensional scaling: Theory and applications

    Journal of Educational Measurement

    (2003)
  • BundyA. et al.

    Breadth-first search

  • CarloniE.

    Codice dell’amministrazione digitale commento al d. lgs (in Italian)

    (2005)
  • CarpinetoC. et al.

    A survey of automatic query expansion in information retrieval

    ACM Computing Surveys

    (2012)
  • CaruccioL. et al.

    Learning effective query management strategies from big data

  • CavanessC.

    Quartz job scheduling framework: building open source enterprise applications

    (2006)
  • ChoyD. et al.

    Content management interoperability services (CMIS), version 1.0

    (2010)
  • De SouzaC.R.

    A tutorial on principal component analysis with the accord. net framework

    (2012)
  • DhingraV. et al.

    Semcrawl: framework for crawling ontology annotated web documents for intelligent information retrieval

  • Cited by (0)

    View full text