1 Introduction

Motivated by the high potential benefits of a data-rich society, in February 2015 a public-private partnership launched the Big Data Center of Excellence in BarcelonaFootnote 1 to assist and boost a data culture at all types of organisations operating in Catalonia. One of the main activities driven since then has been linking the dots among relevant stakeholders of the data economy in the region while promoting meaningful initiatives to demonstrate the advantages of putting together significant datasets, still a highly challenging task nowadays.

Led by prominent actors constituted as an advisory board, debates took place to identify meaningful societal questions that could be, nowadays, solved by data-driven projects. Vocational and Educational Training (VET)Footnote 2 was selected due to its relevance for current societal challenges (i.e., it is key to fight young unemployment [1]) as well as for the existing datasets were valuable insights could be derived from. After 6 months interviewing all relevant stakeholders, selecting the right questions to answer (mapped to 5 concrete research topics), and signing collaboration agreements to be able to access data we could kick-off the BDFP project in mid-2018, where BDFP stands for Big Data en Formació Professional which translates to Big Data in VET.

The present work proposes a data-driven approach with a combination of different data analysis algorithms including data mining, rule-based systems and machine learning to analyse the evolution of the labour market (extracted by the free text job vacancies posted in the leading job portal) and the VET offering (extracted by the official training curricula in the complete territory) in Catalonia during the last few years. Special attention is devoted to two main strategic market sectors for the region, namely ICT and Industry 4.0, and the transformation of the particular skill set demanded by employers.

The paper is organised as follows. Section 2 describes similar initiatives, and Sect. 3 contextualises the work described and introduces the research topics posed. Section 4 describes the data sources included and the exploratory tools developed. In Sect. 5, we describe the main challenges of the approach proposed and how we overcame them by applying ML techniques. Section 7 describes some preliminary results obtained by using the tools developed and Sect. 8 points to conclusions and future work.

2 Similar Initiatives

Analysing the relationship between the labour market and the VET is a broad field of study due to its relevance for society. In the scientific community, we can find [2], where various works were compiled that analysed VET skills from different perspectives, analysing the role of policymakers, the territorial deployment or the impact of demographical and employment patterns changes.

Focusing on the link between labour demand and VET supply, we can find similar works focused on different regions, like in [3], where they analyse the situation of VET in India, and the situation of VET graduates in the labour market. In [5], they studied in depth the relationship between the VET and jobs in Germany from different points of view like gender or type of training. In [4], they carried out a similar study in Australia but performing the analysing at the level of the skills acquired in training and how these skills were translated to the labour market.

Most of the previous works base their studies on surveys and official sources of information. In this sense, the present work is more aligned with initiatives like the project e-skills matchFootnote 3, that analyse the labour market from the demands posted on job portals. In this study, we follow this approach but extending it to other types of studies and skills like those related to Industry 4.0.

3 Context, Methodology and Goals

The BDFP project has been promoted by the Big Data Center of Excellence in Barcelona, an initiative led by Eurecat (the primary Technology Center of Catalonia) with the support of the Catalan government, the City Council of Barcelona and Oracle. Thanks to the commitment and contacts of its advisory boardFootnote 4 a consortium was put together with the required datasets as well as technological and domain knowledge. Namely, the following organizations and teams have contributed significantly to this project: three departments of the Catalan government (Digital Policy and Public Administration, Education, and, Business and Knowledge), the Catalan Occupation Service, the local development agency of the Barcelona City Council (Barcelona Activa), the GIPE research group of the Universitat Autònoma of Barcelona, the consulting company Everis, the job portal Infojobs, the Bertelsmann foundation aimed to improve the youth employment, the Fundació Barcelona Formació Professional aimed at improving VET, and the Big Data and Data Science department of Eurecat.

This project was launched with the aim of achieving two very different goals. First, it provides valuable insights about the evolution of the skill-sets offered in the VET contents in all Catalan territory as well as the ones demanded by the job market during the same period, 2015–2018. This information is of great value for the Catalan Education Department in charge of designing the curricula of the VET courses trying to satisfy the local job market near future requirements. Second, the project aims also to showcase the need and benefits of joining efforts by different institutions to solve societal challenges, with different knowledge and assets, ranging from datasets, data science experts, technology providers, domain experts, decision-makers and facilitators. From the one hand, all these roles are required to launch and successfully execute such a project and are impossible to find in a single organization but many. On the other hand, the potential benefits of data sharing and novel analytic approaches can only be efficiently assimilated by decision-makers when knowledge sharing processes are established among the different actors when working in common challenges, and therefore, they will be inserted in the next future agenda of the relevant stakeholders.

As a matter of fact, the first 6 months of the project were dedicated to holding several one-to-one meetings with every organization, understanding their perspectives and goals around VET in Catalonia, the existing challenges and what would be the main questions that they would find most relevant answering, taking into consideration their feasibility, and how could they help. The agreed five research topics are the following:

  • T1: Geographic characterization of the labour demand, evolution of the labour demand based on the contracts registered in Catalonia during the last 5 years.

  • T2: Analysis of the relationship between Dual VET and sectorial labour demand, a particular study case of the type of VET training includes an internship in a company.

  • T3: Temporal evolution of the VET skills, evolution of the demanded kills focusing on ICT and Industry 4.0.

  • T4: Comparison of labor supply and demand in ICT and Industry 4.0,

  • T5: Identification of overqualification, required in a job post with respect to the responsibilities described.

During the following eight months, a core team was established, mainly formed by data analysts and scientists, who explored the data and created the models as explained in the next sections. On-demand meetings were done with domain experts of the organizations when needed to set the basis, describe standard definitions, and interpret the results in every step. Further, after each significant step, high-level meetings were appointed with high-level representatives to inform and validate those partial results and define the next steps. Finally, a great effort was devoted with volunteers of most organizations, first to define the rules in order to categorize the skills from the free text, and afterwards to label 100 skills in 1,117,729 job posts to train the models.

4 Exploratory Analysis

4.1 Data Collection and Processing

Concerning the datasets used in the project, we split them into two categories: the labour demand or the VET offer.

First, we start by studying the VET offer in Catalonia. In Spain, there are three types of VET: FP INICIAL, which corresponds to Initial VET studies that combine theoretical subjects with practical training, FP Dual, a type of VET that includes in-company training, and FP Ocupacional, that is meant for professionals that are unemployed or that want to improve their careers. For each type of VET, we obtained the number of students enrolled, and graduates by family, studies, year (for the period 2013–2017, except for Dual where only the period 2016–2017 was available) We obtained this data from the Education Department of the Local Government and the local Employment Service, SOC.

We also analyze the labour demand from two sources. On the one hand, the original contracts as registered during 2013–2017 in the Public Employment service, including the contract duration, occupation code, company activity code and geographical location, plus some demographic information like gender or age. On the other hand, for the same period, we collected jobs posted in two job portals: Infojobs, one of the leading job portals in Spain, and Feina Activa, the job portal hosted by the local Employment Service.

Beyond gathering all data and standard processing carried out to the different datasets, to handle outliers or missing values or to normalize some attributes, there have been some individual cases where additional processing was required.

For the VET data sources, the first issue to solve was the identification of the studies to be included in the study. The posterior analysis includes two levels of granularity: the characterization of the demand, including all the VET studies, and, on the other hand, the analysis of the matching between demand and supply focused on the two families of studies: ICT and Industry 4.0 To characterize those studies and be able to link them to the labour demand, a set of training skills definition was selected. The selection was carried out following the official definitions provided by the National Ministry of Education for each of the studies included in the study.

In the case of the contracts, we had to deal with different types of contracts depending on their duration, ranging from hours to years, so just counting the number of contracts would have led to wrong conclusions. To mitigate this issue, we complement the contracts with an index of labour turnover based on the annual reports by the Labor and Productive Model Observatory, and a study carried out by the Observatory of Industry, allowing to approximate the effective hiring from the total contracting. Besides, we selected the training skills as the link between both domains to match the contract with the VET supply. We developed together with SOC, the local public Employment Service, a dictionary of National Occupational Codes, CNO, and training skills.

Moreover, there was an additional processing step to merge the information of job averts coming from both the private portal, Infojobs, and the public one, Feina Activa. To have a single source of information for the demand, we defined a data model for job posts that included a common set of attributes present in both portals. To merge some of the attributes, we developed individual dictionaries of equivalences for things like the duration of the contract, the educational level or the type of working day.

4.2 Exploration

The different sources of information included in the project, both for the definition of the labour demand and the supply of VET, are stored as different Elasticsearch indexes in the system. We chose Elasticsearch because it is an indexing system that offers a very nice search API, and comes with Kibana, an excellent tool for building useful and intuitive dashboards, both of them opensource.

We developed an exploratory dashboard for each of the sources (contracts, the different types of VET and job posts) in order to deliver an exploratory tool that allows the analysis of the data sources and start answering the research questions posed, especially to non-technical users. An example of this dashboard is shown in Fig. 1.

Fig. 1.
figure 1

Example of dashboard created using Kibana that includes some indicators related to the job posts.

In the case of the job posts, the results obtained by the different classification models complemented the information included from the original data sources, i.e., we were able to add the families they belong, the skills, languages required and if there was overqualification.

For some of the research topics, it was required to join some of the data sources altogether to, for example, compare the demand and the supply. We found out that Kibana was quite limiting in this aspect, as it did not allow to join different indexes to produce combined visualizations. Thus, for some of the analysis, we had to develop custom visualizations in order to show some of the indicators required in the study.

5 Scientific Challenges

This section explains the main two challenges encountered during the execution of the project and how they are addressed thanks to knowledge discovery, data mining and machine learning techniques.

5.1 Challenge 1: Mapping Job Posts with Training Skills

Mapping skills like “Implement, verify and document web apps” or “Manage relational databases” to jobs like “.NET junior developer” or “Analyst in SQLServer and SSIS” is not straightforward. Despite some of the fields included in the job advert can help on establishing this link, they sometimes can be misleading - for example, the sector of the company may differ from the position to cover - or might not be informed if they are marked as optional.

The significant volume of job posts included in the study, more than one million, prevent the team from manually analysing them, making it necessary to adopt smarter approaches. Leveraging the presence of domain experts in the team, we used a combination of rule-based models together with ML models, based on annotated data.

The Large Scale Labelling Problem. The development of a semi-automatic tool for large scale labelling is required to automate the process of assigning job offerings to the family of studies and tagging the required skills demanded. This semi-automatic labelling mechanism combines a first fully automatic identification of instances based on the definition of a set of rules, together with a labelling tool that allowed to perform fine-grained annotation of some of the job adverts by human experts.

Automatic Labelling: The first step used in the project for tackling the different classification tasks included was the development of a set of rules that allowed to classify job descriptions on whether they require a given training skill or not. To develop these rules, we asked the domain experts to define a set of keywords and expressions that are usually associated with a given training skill. For example, The skill of “Set up and manage a database” relates to concepts like DBA, SQL, Mysql, Oracle, and similar words.

Once this set of rules is defined, we translated them using the Spacy library. This library delivers a set of models for different NLP tasks like POS identification, NER, Lemmatization or Tokenization, and Rule-based matching, among others. In this case, we made use of the Rule-based matching tool and the POS, Lemmatization and tokenization available for the Spanish language.

This rule-based method allows having the first selection of skills and, by grouping these skills by the family of studies they belong, it also allows us to give a first tagging of the family they belong to. The same approach was used in the project to infer other characteristics of the job posts like the requirement of a foreign language and other soft skills by developing ad-hoc rules for that purpose. The main drawback of this approach is the fact that it depends on how exhaustive is the definition of the rules, making it impossible to capture all the terms and expressions that might be related to each concept.

Labelling Tool: To overcome the limitations of the rule-based model, we complement it with a ML classification system. In order to have labelled data, in the context of the project we developed a labelling tool, that aimed at enabling the domain experts to evaluate job descriptions, allowing them to identify the family of studies, training skills required and determine whether there was overqualification or not.

The annotation process run for two months with the invaluable help of more than 30 experts that annotated more than 3,000 job posts. Despite this significant effort, the annotation of some skills was not sufficient. Some of the skills had less than ten labels, preventing the corresponding classifier from training properly. To fight the lack of labels, we added some positive examples drawing a sample from the most representative examples as classified using the rule-based system for each skill. Besides complementing the positive examples with those obtained from the rule-based, we also limited the study to those skills with more than 100 labels.

Fig. 2.
figure 2

Models used in the project. (a) Deep learning model used for challenge 1, (b) Uncertainty Bayesian wrapper for deep black-box models used in challenge 2.

Job Advertisement Topic Prediction and Skills Automatic Tagging. After the annotation process, we first trained a classifier for topic prediction, with three possible labels: ICT, Industry 4.0, and Others. Once the job advert is classified in a family, the original intention was to apply a multi-labelling classifier for the skills. The fact that the labelling tool stored a separated label for each skill, having no way to link them afterwards, force us to train a classifier for each of them. At this step, we trained for those skills that were more annotated, 15 for ICT and 30 for Industry 4.0.

The model used in all cases is shown in Fig. 2(a). The input for the models is the description of the job, including different attributes like the level of education, minimum requirements, description of the position, title or previous knowledge. In order to fight the lack of positive labels for some of the skills, we try different combinations of those fields for building different samples out of the same labelled job. We also combine samples obtained from the labelling tool with some from the rule-based model to obtain balanced datasets. The resulting text for each job position is transformed into a sequence of words with a limited length of 120 words, using left padding for shorter texts.

We used an artificial neural network, where the first layer of the model is an embedding layer. In this work, we use a pre-trained Word2vec embedding [6] trained with a billion of Spanish words [7]. We average the embeddings of the words included in the job description to obtain a latent representation for it. This latent representation is the input for a fully connected layer with a softmax activation, used for obtaining the probabilities of each class predicted. For some of the classifiers, we substitute the average of the embeddings for an LSTM layer plus a hidden dense layer, showing better accuracy scores. Sometimes, though, this leads to overfitting, despite having 50% of dropout, probably due to the lack of sufficient data samples, so we have to stick to the original model.

From the original 100 skills selected, 29 for ICT and 71 for Industry 4.0, we finally selected only 38, 15 for ICT and 23 for Industry 4.0, that correspond with those that received enough labels to train the classifier properly. Each classifier was trained using a specific training dataset. The number of examples for each classifier depended on the number of labels and the results obtained from the rule-based model for that specific skill, varying from 1,000 to almost 4,000 examples.

Besides tuning the architecture, some of the hyperparameters needed to be tuned depending on each problem, with learning rates for the Adam optimiser used varying from 1e−3 to 5e−4, or a different number of epochs. Each training dataset was split into three sets: the training set itself, a validation set that was used for adjusting the parameters of the optimiser and validate the architecture, and a test set that was used to obtain the performance metrics. The proportion for each set was 90% for training and validation, split then in 90% and 10% each, and 10% for testing. For each training process, we train each model until the training loss converged, preventing overfitting by observing the validation loss.

As a result, the family/topic classifier obtained an accuracy of 94.75%. Figure 3 shows the accuracy obtained by the 38 skill classifiers. In addition to the accuracy, we computed a confusion matrix to analyse the behaviour of each classifier with regards to false positives or negatives, and we performed a manual check of some examples to carry out a qualitative evaluation of the models. Even though the majority of the classifiers achieved accuracies over 80%, we observe that those with lower values correspond to skills that are hard to model using rules. For those cases, obtaining a more significant set of labels could help with improving the results.

Fig. 3.
figure 3

Accuracies obtained in test for the 38 skills classifiers

Once trained, we employed the learned classifiers to the whole set of job adverts. First, we applied the family of studies classifier, to isolate those that belong to ICT and Industry 4.0. Later on, we used the 15 ICT skill classifiers to the ICT jobs and the 23 Industry 4.0 one to tag the job adverts of each family with the skills that are required for the position. As a result, we add a list of skills required for the described position for each ICT and Industry 4.0 job advert that can be used in the later analysis.

5.2 Challenge 2: Job Overqualification Analysis

A second valuable insight from this project consists of understanding the degree of matching between job demands and job offerings. In this second challenge, we take advantage of the data gathered to classify each job advert on whether it includes overqualification in its description or not. Similarly to the case of detecting the skills present in the job offers, in the case of detecting overqualification in demand, it is required to deal with unstructured textual information. This motivates us to use advanced NLP models.

Using a similar strategy to that explained in challenge 1, the prediction accuracy only reaches 72.28%. This value, though acceptable in some contexts, is inadequate for a careful analysis of the problem. Our hypothesis for justifying this value considers the difficulty of defining the concept of over-qualification and the subjectivity involved in those definitions. There are different levels of overqualification. For example, they are asking a university degree for a job that a VET student can perform, or asking for a high level of VET when it can be carried out by medium degree VET graduates.

These two different properties introduce noise in the labelling process, and consequently, affect the performance of the machine learning classifiers. In this setting, the degree of confidence in the prediction can help to refine the results. This is addressed by means of modeling uncertainty.

At first glance, when using artificial neural networks, observing the entropy of the probabilities resulting from the softmax output can provide an idea of how confident the predictions are. This is, high probabilities for the target class, close to 1, suggests confident predictions. However, the problem with this approach is that for some data points with low occurrences in the training dataset or with ambiguous semantics, the model can yield overconfident predictions [8]. And, thus mislead further analysis.

Estimating the Uncertainty. When talking about machine learning techniques’ uncertainty we find two different concepts of uncertainty depending on its source. These are:

  • Epistemic uncertainty, which corresponds to the uncertainty originated by the model. It can be explained as to which extent our model is able to describe the distribution that generated the data. There are two different types of uncertainties caused by whether the model has been trained with enough data, or whether the expressiveness of the model can capture the complexity of the distribution. When using an expressive enough model, this type of uncertainty can be reduced by including more samples during the training phase.

  • Aleatoric uncertainty, that belongs to the data. This uncertainty is inherent to the data and cannot be reduced by adding more data to the training process. We can further divide this uncertainty in two classes:

    • Homoscedastic: measures the level of noise that is derived from the measurement process. This uncertainty remains constant for all the data.

    • Heteroscedastic: measures the level of uncertainty caused by the data. In the case of NLP, this can be explained by the ambiguity of some words or sentences.

The different types of uncertainty must be measured differently. Consider a dataset, \(D = \{x_i,y_i\}, \; i =1\dots N\), that is composed by pairs of data and labels points, respectively. Given a new sample \(x^*\) we want to predict its label \(y^*\). Our goal is to capture the distribution that generated the outputs by using a model with parameters W. Under the Bayesian setting, this description corresponds to the following marginal equation,

$$\begin{aligned} p(y^* | x^*, D) = \int _{W} p(y^*|f^W(x^*))p(W|D) dW \end{aligned}$$
(1)

In Eq. 1 one can see that the distribution of the output depends on two terms: one that depends on the application of the model to the input data, the aleatoric component, and a second one that is measuring how the model may vary depending on the training data, the epistemic component. For the epistemic uncertainty, we model the weights as random variables by introducing Gaussian perturbations and Flipout as introduced in [10], to estimate the conditioned probability by using Montecarlo (MC) sampling. The model trained up to this model is a deterministic model, except for the epistemic components. As such, it does not allow to infer the aleatoric component of the uncertainty. Thus, for computing the aleatoric heteroscedastic uncertainty, we build an assistance a deep neural model that will complement the former ANN. We assume that the aleatoric component can be modelled using a latent layer of random variables following Gaussian distributions [9]. While the ANN output will stand for the mean value of that distribution, we will let the assistance NN to work out the standard deviation corresponding to each input sample. The details of the implementation can be found in [13]. As in the former case, we use Montecarlo approximations to sample from the latent layer and approximate the output distribution.

As a result of both Montecarlo samplings, a probability density function of the output values is obtained. This will serve for computing the overall uncertainty.

Computing the Uncertainty Score. Figure 2(b) illustrates the classifier trained to estimate both epistemic and aleatoric uncertainties. It is remarkable that just by training the model with the new architecture for 20 epochs, the resulting accuracy increases up to 84,25%.

Model in Fig. 2(c) depicts the model that is used to predict the uncertainty score. In this article, we used predictive entropy as defined in [8]. The predictive entropy, introduced by [11], measures the dispersion of the predictions around the mode. In this case, we combine the computation of the aleatoric and epistemic uncertainty. Thus it is necessary to combine the random variables that learned the variability of the model, the epistemic component, together with the random variables assigned to the output logits that model the aleatoric uncertainty.

In order to obtain the uncertainty score, we carry out two consecutive Montecarlo simulations: first, we sample a model W and then use this model to sample the output prediction for each logit, following the distribution parameterized by the outputs of the ANN and the assistance NN. Using the resulting probabilities, we compute the predictive entropy as follows,

$$\begin{aligned} \mathbb {H}[y|\mathbf {x}, D_{train}] := -\sum _{c} \mathbb {E}(p_c)\log {\mathbb {E}(p_c)} \end{aligned}$$
(2)

Uncertainty-Based Rejection Classifier. Despite the increment in accuracy obtained after training the ANN with uncertainty, we can further exploit the uncertainty scored to improve the results by filtering out the less confident predictions. As a result, the quality of the filtered predictions obtained increases as it only outputs confident predictions. This score can be used as a rejection mechanism for the classification.

Different measurements can be defined for rejection. For example, the test dataset can be sorted by the rejection value, from higher to lower scores. If there is a correlation between the uncertainty score and the misclassification value of the input, by discarding the more uncertain data points, we will increase the performance of the classification system. We consider different rejection points corresponding to descending values of the rejector from including all points to discard all of them in the last iteration. The analysis of the three performance measures [12] for each rejection point must consider a trade-off between the number of samples rejected and the quality of the classifications.

Fig. 4.
figure 4

Performance measures showing the accuracy of kept points, how correct predictions are kept and wrong discarded, and the ability of rejecting wrong samples.

Figure 4 compares the results obtained after applying three different models for computing uncertainty: predictive entropy for the aleatoric model, as proposed in this work, the predictive entropy of the original model, used as a baseline, and an additional uncertainty score, variation ratios, as described in [8]. The plot shows how only by rejecting the 10% of predictions with higher uncertainty scores, the model increases its performance up to 88% – reaching 90% prediction accuracy when rejecting the more uncertain 20% of the predictions.

6 Project Results Insights

The present section shows preliminary results obtained from the application of the models and techniques introduced in the previous sections. The goal of the section is to describe how these results complement the original information and allow domain experts to analyse them in order to obtain insights and explanations for the relationship between labour demand and VET supply.

6.1 Overall Analysis of the Labour Market

For answering T1, geographic characterization of the labour demand, we looked at the demand at a general level, not focusing on the selected families but considering the full picture of the hiring at the region. To do so, we look at the contracts data source, connecting it to the VET studies by using the mapping developed. For tackling this research topic, we also include additional data sources, like demographic information and about the companies, like the number of companies per sector, their number of employees, to contextualize the data about the hiring and its evolution, enabling comparisons between the number of contracts, population and number or size of companies.

For T2, we analyze the relationship between Dual VET and sector labour demand, as a specific case of T1 focusing on Dual VET. Dual VET is a particular type of training where students have the opportunity of finishing their studies doing an internship in a company. Here, we focus on aspects like the impact of gender or age range, studying the distribution of the dual VET across the families of studies and the region. For this analysis, we only had access to one year of data instead of the five years that the study covers. The conclusions extracted, therefore, are preliminary, waiting for having access to a more extended period of data.

6.2 Analysis of the Matching Between VET and Job Market

After applying the family and skill classification models described in Sect. 5.1, tagging, therefore, each job with the skills required, now it is possible to analyse the temporal evolution of these skills, Fig. 5, and link this information with the rest of the attributes of the job descriptions. The study of T3 will allow the analysis of trends in demand for the selected skills, delivering new tools for the design of VET curricula that best adapt to the real market needs.

Fig. 5.
figure 5

Evolution of % of top 10 demanded skills with respect to the total ICT demand.

Fig. 6.
figure 6

Territorial analysis comparing between contracts and VET students for sys admins.

The skills assigned to the job positions are also used to match VET studies and answer T4, checking the degree of educational coverage of the VET according to the real demand obtained from the analysis of job posts.

Figure 6 shows a comparison between the demand for job positions that require a given skill and the enrollments for studies that include the skill, aiming to study the educational coverage for those skills across the region. Again, this study can be carried out considering different aspects like the temporary evolution, the type of job positions, and so forth.

6.3 Adequacy Between Required Functions/Skills and the Level of Studies

Finally, T5 is answered based on the results obtained from the uncertainty-based rejection method described in Sect. 5.2. Taking the 222,957 job posts that correspond to ICT and Industry 4.0 positions, as indicated by the results obtained from using the family classifier, we apply a rejection ratio that, on test time, reached a 92.5% of accuracy. Thus, we discard 137,157 job descriptions, focusing the study on the remaining 85,800. From those, we estimate only 17,388 as including overqualification in the job position.

This process allows us to study the trends associated with the overqualification, its temporal evolution, analysing potential differences between ICT and Industry 4.0, or studying the phenomenon individually on each family. By examining the overqualified job positions, we would like to study the different types and levels of overqualification and determine their cause. Together with domain experts, we continue to analyse these results to obtain the final project insights.

7 Conclusions

In this paper, we have introduced the BDFP project, (Big Data per a la Formació Professional or Big Data for VET in English, an ongoing initiative promoted by the Center of Excellence in Big Data of Barcelona, and supported by the local government and a total of 11 public and private institutions. The project aims at analysing the evolution of the jobs market and the corresponding VET offering in the region of Catalonia since 2013 to 2018, with a strong focus on two strategic sectors, ICT and Industry 4.0.

We consider the following three as the main contributions of the project. The first one has been to enable the joint conversation and definition of common goals in data terms among all participating entities. Having debates aiming at social good with the main decision-makers, data owners, domain experts and data technologies experts in a region is of capital relevance and, although not easy, should be energetically motivated everywhere.

The second main contribution has been the opportunity to break down some information silos and connect the labour market as seen by private companies like Infojobs (with mostly free text job posts) to the VET offering as defined by the information managed by local authorities. This significant effort consisted in an active collaboration of data scientists and domain experts who together designed the classification models by defining the rules and participating in the labelling process of more than one million job posts, including hundreds of hours devoted by a team of knowledgeable volunteers. The project has also been an excellent opportunity to validate the effectiveness of NLP techniques based on Deep Learning to extract the skillsets required by employers from the free text of job posts as defined in the VET official curricula. Moreover, the complexity inherent in some tasks - e.g. the detection of overqualification - has required the application of advanced methods for improving the performance of the resulting models.

The third main contribution has been the characterisation of the evolution of a relevant part of the skill sets required by the employers in the whole Catalan territory for the two selected sectors during the period of study, and the relationship of the corresponding VET offerings. We, therefore, validated the applicability of the data sources and models selected for relevantly answering the research topics defined in the project. Although we are aware that the job vacancies from the two collaborating portals do not represent the entire job market, and it is biased to those higher profiled, it is a significant sample where concluding results can be derived from.

Currently, together with the domain experts, we are exploring the results of the project to obtain further insights and explanations for the questions posed at the beginning of the project. The complete conclusions of this work will be compiled into a report that will be publicly available and will be used by local policymakers to analyse the evolution of the labour demand, detect new trends on the demand of VET skills and study how to best adapt the training programs to the current requirements of the companies. Finally, we expect to work with some of the many organisations that have approached us during this process in order to extend it to other disciplines beyond ICT or Industry 4.0, and other regions in Spain and Europe.