Keywords

The government performance appraisal refers to the comparison and analysis of actual work and performance indicators among the economic, political, social, cultural, and ecological aspects of government organizations in the process of performing their functions, so as to monitor the performance of the indicators in real time. By doing these, the government is able to comprehensive evaluation process in a timely manner. Due to the performance evaluation index involving economic, political, social, cultural, ecological and many other aspects with variety categories, it makes the workload of visiting check is huge. Therefore, the government department will select the check indexes and check pointes randomly.

Usually, the system selects indicators from the index database according to the specified number randomly. However, use this method to select indicators do not have obvious characteristics and representativeness which will indirectly affect the results of performance evaluation. For this kind of situation, the author tries to analyze the indicators by means of text categorization with the combination of community classification algorithms and the search popularity of keywords. By doing these, the indicators can be divided into different communities for providing a much more effective recommendation and make the selected indicators are more targeted and representative.

1 Related Work

TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and exploration [1]. TF-IDF is a statistical method used to assess the importance of one word for a specific text in a corpus and it is usually used by searching engines as a measure or rating of the degree of relevance between documents and query terms [2].

The community is usually reflects part of the characteristics of a specific individual’s behavior in a network and the relationships between the individuals in a network. Therefore, to analyze the communities in a network plays a crucial role in understanding the structure and function of an entire network. Meanwhile, it can help us to analyze and predict the interaction relationships between the entities among the entire network. The commonly used community discovery algorithms include a method based on faction filtering, a method based on local expansion and optimization, and an overlapping community discovery algorithm based on a line graph and edge community-based discovery method. The clique percolation method (CPM) [3] is used to find overlapping communities. A clique is a set of vertices where any two points are connected, which is also known as a complete sub graph. A Class Association Rule (CAR) [4] is an association rule algorithm that focuses on mining special subsets. The goal is to find a set of rules in the database for precise classification [5]. The first scan generates local frequent sets, and the second scan forms global frequent item sets. The disadvantage is that some false frequent item sets are generated.

The basic idea of ​​the label propagation algorithm (LAP) is that each node initially has an independent label, and then according to the labeled information of the labeled nodes to predict the labels of unlabeled nodes. The label propagation between nodes is mainly based on the similarity of the labels. During the propagation process, the unlabeled nodes iteratively update their own label information according to the labels of the adjacent nodes. If the similarity of adjacent nodes is very similar, then the influence of the marked weighting value is greater and the labels of neighboring points are much easier to spread. However, the overall optimality of label propagation does not play a significant role in practice, so it may not produce high-quality communities. And in reality, because of the characteristics of indicators are not as complex as the network of human elements in a social network, the non-overlapping community discovery algorithm based on the modularity-based optimization method can increase the speed of the process while ensuring the effect.

The Louvain algorithm [6] is a graph algorithm model based on the quality of modularity. With the comparison of the common algorithm, which is based on quality of modularity and modular gain, this algorithm performs better in both efficiency and effect. Meanwhile, it can also discover hierarchical community structures [7]. And the goal of optimization is to maximize the modularity of the overall graph attribute structure (community network) [8]. The algorithm is very fast, and the clustering effect is especially obvious for many points with few polygons. And this method is pretty suitable for community division. With the combination of the TF-IDF values of keywords and the index of the keywords in the network into the algorithm, the community division is more accurately and the characteristics of the communities are more obviously.

2 Problem Description and Analysis

Indicators of performance evaluation are distributed in various fields such as economy, politics, society, and culture [9]. Therefore, for the process of the government selects the indicators randomly, the selected indicators are random and contingent. It is likely that most of the indicators are distributed into the same categories. In another aspect, it is also likely that the selected indicators are with less influence and no representative. In this way, the data collected by the inspection and verification at the checkpoint is not representative, which indirectly reduces the efficiency of government performance evaluation.

However, for the evaluation criteria of measuring an indicator can include many aspects, such as the number of keywords included in a specific indicator, the search popularity of the keyword, the influence of the keyword on the indicator and so on. Combining these metrics alone can give a certain assessment of the value and importance of the indicators. However, by linking the attributes of these keywords with the indicators, we can form a network containing the relationships between indicators and keywords. In this way, it is obviously found the deeper level of the relationship between indicators and keywords, indicators and indicators so as to achieve the purpose of index recommendation. Because the recommendation is for government workers, it is necessary to combine some data visualization methods to make the results much more understandable, which makes the recommendation results clearer and more vividly displayed for staff selection.

3 Index Recommendation Algorithm Based on the Louvain Algorithm for Adding Keyword Popularity

3.1 Keyword Extraction Strategy

Chinese lexical analysis is the foundation and key point of Chinese information processing. The Institute of Computing Technology of the Chinese Academy of Sciences has developed a Chinese lexical analysis system based on the accumulation of years of research work. The NLPIR Chinese word segmentation system (aka ICTCLAS2013) mainly includes Chinese word segmentation, part-of-speech tagging, named entity recognition, and new word recognition. At the same time, it also support user dictionary function and GBK encoding, UTF8 encoding, BIG5 encoding. After that, new segmentation, new word discovery and keyword extraction were added.

TF-IDF is a statistical method to assess the importance of one word for a specific document in a corpus. The importance of a word increases proportionally with the number of times it appears in the document, but at the same time, it decreases inversely with the frequency in which it appears in the corpus. The main idea of TF-IDF [10] is that if a word or phrase appears in an article with a high frequency (TF) [11] and is rarely found in other articles. It is considered that the word or phrase has class discrimination ability so that it is suitable for text classification. It is divided into two parts: one is word frequency (TF), another one is reverse file frequency (IDF).

Term frequency (TF) refers to the frequency of a specific word appears in the file. This number is a normalization of the term counts to prevent it from biasing towards the long files. (The same word may have a higher number of words in a long file than a short file regardless of whether the word is important or not.) For a word in a particular document, the importance of the word can be expressed as:

$$ {\text{tf}}_{\text{i,j}} = \frac{{n_{i,j} }}{{\sum\nolimits_{k} {n_{k,j} } }} $$

According to above formula, \( n_{i,j} \) is the number of \( t_{i} \) occurrences of the word in the \( d_{j} \) document, while the denominator is the sum of the occurrences of all the words in the \( d_{j} \) document.

The inverse document frequency (IDF) is a measure of the general importance of a word. The IDF of a particular term can be obtained by using total number of documents divided by the number of documents containing the term, and then get the logarithm of the quotient:

$$ {\text{idf}}_{\text{i}} = \log \frac{\left| D \right|}{{\left| {\left\{ {j:t_{i} \in d_{j} } \right\}} \right|}} $$

From above formula, |D|: The total number of files in the corpus: \( \left| {\left\{ {j:t_{i} \in d_{j} } \right\}} \right| \).

The number of documents contains the specific words ti (the number of documents where \( n_{i,j} \ne 0 \)). If the word is not in the corpus, it will cause the dividend to be zero.

Then,

$$ {\text{tfidf}}_{\text{i,j}} = {\text{tf}}_{\text{i,j}} \times {\text{idf}}_{\text{i}} $$

A high weight TF-IDF can be generated for a high frequency word in a particular file and a low file frequency for the word in the entire file set. Therefore, TF-IDF tends to filter out common words and retain important words.

For the indicators, they are also a kind of short text. So the extraction of indicator keywords mainly includes the following steps:

  1. (1)

    Read out the indicator from the database and store it in a file with one indicator.

  2. (2)

    Using this file as input, call the NLPIR of the Chinese Academy of Sciences to segment the indicator.

  3. (3)

    Remove the stop words and the extract valid words.

  4. (4)

    Perform TF-IDF calculations for each keyword in each indicator to assess the importance of the keyword for that indicator.

  5. (5)

    Determine the specific keywords that correspond to each indicator.

Firstly, get the indicator id and the indicator name from the database and write them into a file. Secondly, the NLPRIIR is used to segment each indicator. The words are separated by spaces, and the indicators are separated by newlines. Next, after the analysis, the method decided to take nouns, verbs, adjectives as the keywords, which are separated by spaces, and the indicators are stored in a separate file with newlines. Then, for each keyword in the specific indicator, the TF-IDF values are calculate of the specific indicator and sort those keywords according to the descending order of TF-IDF value. The larger the value, the more influence of the keyword has on the indicator. It means that the more the keyword can represent the characteristics of the indicator. Finally, according to a number of experiments, we decided to take the keywords whose TF-IDF value is into the first two-thirds of the total for each indicator. If the number of the keywords extracted from the indicator smaller than five, all of the keywords are taken out as the key words of the specific indicator. The Fig. 1 shows below:

Fig. 1.
figure 1

TF-IDF file

3.2 Keyword Extraction Strategy

Jsoup is an HTML parser for Java that used to directly parses a URL address or HTML text content. It provides a very labor-saving API which is very similar to jQuery for extracting and manipulating data through DOM, CSS.

The ‘webmaster’s home’ is a website for searching the search index of a specific keyword in the entire network. For each given keyword, it can return the search index of the word on platforms such as 360, Sogou, WeChat and even for mobile, and the combination of all these indices. The comprehensive index of the entire network is used here.

These method based on the Jsoup package of java, according to the input keyword to crawled into the search index of the entire network on the website. And the results are stored in the file in the form of “keyword = TF-IDF: search index of the entire network”. Figure 2 is shown below:

Fig. 2.
figure 2

Index file.

3.3 Keyword Extraction Louvain-Based Community Classification Algorithm

Louvain algorithm is a community discovery algorithm based on modularity. The algorithm performs better in both efficiency and effect. And it can discover hierarchical community structure at the same time. The goal of the optimization of this method is to maximize the degree of modularity for the attribute structure of the entire graph (Community Network).

The main core points of this algorithm are the following two points:

  1. (1)

    Modularity is defined to measure the degree of the modularity in a community network. It means that the difference between the number of edges of a node in a community and the number of edges in a random case. And the range of this value is (0, 1). Its definition is as follows:

    $$ Q = \frac{1}{2m}\sum\limits_{i,j} {[A_{ij} - \frac{{k_{i} k_{j} }}{2m}} ]\delta (c_{i} ,c_{j} ) $$
    $$ \delta (u,v) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {when\; u \,{=}{=}\, v} \hfill \\ 0 \hfill & {else} \hfill \\ \end{array} } \right. $$

    According to the function, Aij represents the edge weights between nodes i and j (edge weights can be seen as 1 when the graph has no edge weights). Ki represents the sum of the edge weights from the adjacent edges of node i (it becomes the number of degrees of the node when the graph has no edge weights). m is the sum of all edges weights in the graph. Ci is the community number where node i is located.

  2. (2)

    Incremental module delta Q is part of modularity changes after an isolated point is put into a community C. The main point of the calculation process is as below:

    First, calculate modularity of the first point and the modularity of community C.

    Then, calculate the modularity of the merged new community.

    Finally, the delta Q is the modularity of the new community minus the first two values.

    $$ \begin{aligned} \Delta Q & = \left[ {\frac{{\sum_{in} + k_{i,in} }}{2m} - \left( {\frac{{\sum_{tot} + k_{i} }}{2m}} \right)^{2} } \right] - \left[ {\frac{{\sum_{in} }}{2m} - \left( {\frac{{\sum_{tot} }}{2m}} \right)^{2} - \left( {\frac{{k_{i} }}{2m}} \right)^{2} } \right] \\ & = \frac{1}{2m}\left( {k_{i,in} - \frac{{\sum_{tot} \,k_{i} }}{m}} \right) \\ \end{aligned} $$

    Where ki,in is the sum of the edge weights between node i and node in in community c. DeltaQ is divided into two parts: The front part shows the module degree after adding node i to community c and the latter part is the module degree between the node i as an independent community and community c.

In this method, the points are the keywords of the index and the index ids and the edges are the line connecting each keyword with its corresponding index. Since the TF-IDF value of the keyword for the specific index represents the influence of the keyword on the index. So the larger the value is, the greater the influence has. Setting it as a weight, the relation between the indexes with the same keywords is closer for the community division. If two indicators are linked to a same keyword and the TF-IDF values of both indicators are large. Then the two indicators are most likely to be the same type of indicators. In the community division, the two indicators are very likely to be divided into the same community with the comparison of other indicators.

The resulting community results according to this method are saved as json file for the next step: visualization of the data. The json file is shown below in Fig. 3:

Fig. 3.
figure 3

Json file

3.4 Data Visualization

The purpose of data visualization is to visualize the data so that the information can be delivered clearly and effectively. The full name of D3’s is (Data-Driven Documents), it is a data-driven document. In another words, it is a JavaScript library. It is mainly used for data visualization. Document is the Document Object Model (DOM). D3 allows the user to bind arbitrary data to the DOM, and then manipulate the document based on the data for creating interactive icons. The code of the D3 project is mainly hosted on GitHub.

According to the above process, we have obtained json format files of the community division. The data includes the keyword network data and the indicators with the community number. The d3 library provides a visual representation of the data. The weight of the point is the search popularity of the keyword corresponding to the point and the weight of the edge is the TF-IDF value of the keyword for the index.

4 Results Display and Analysis

Based on the NLPIR for the indicators segmentation, and then remove the stop words, extract keywords, calculate TF-IDF values and sort keywords according to the calculation results and determine final keywords corresponding to the indicators. According to the modularity-based Louvain non-overlapping community division algorithm, the network of indicators and keywords is divided into communities. All keywords and indicators are marked according to the community numbers. At the same time, the search index of the entire network for each keyword is introduced into the system. Finally, the visualization of the network is realized by using d3, which is based on JavaScript. The result is shown in the Fig. 4 below.

Fig. 4.
figure 4

The results of the current day

In the Fig. 4, the nodes named by words represent the specific keywords for one or more index. And the nodes named by numbers are the index ids corresponding to the particular indexes (because the length of each index is too long, here the corresponding index id in database is used as a substitute). One side, the nodes with same color are divided into the same community, which means they are in the same class. In another words, it means that these keywords are closely linked and the indexes represented by the keywords have very similar application scenarios. On the other side, the node size indicates the search popularity of a specific keyword in the network. It means that the larger the radius of the circle is, the more popular the search index is. And the width of each edge represents the weight of the particular edge. The value of the weight is particular TF-IDF value between the keyword and the specific index. This value represents the closeness of the keyword to the indicator. At the top, there is a date slider, which is used to get different images from today to 30 days before. (Figure 4 is the image based on the search index of the population of today, Fig. 5 is the image based on the search index of the population of 14 days ago). The population of each keyword is different in daily search.

Fig. 5.
figure 5

The result of 14 days ago

In the image, the points located at center are relatively dense and there are many lines connect to each of the center point. It indicates these keywords are related to many indexes and are much more important for these indexes. And it is obvious that most of the indexes associated with these keywords are mostly in the same community, which means that these indexes are similar or they describe similar semantics so that they are divided into the same categories. There are several points on the left side of Fig. 4 without connect with other points. They are actually represents a few keywords corresponding to one particular index and these keywords are not in other indexes. It shows that this index is semantically different from other indexes so that the index is divided into a single category.

By comparing with Figs. 4 and 5, it is easy to find some subtle difference. For example, the size of the node named “tea” in Fig. 4 is larger than it in Fig. 5, which indicates that this particular keyword is much more popular at that day than 14 days before. In fact, the change of keywords’ popularity is not pretty clear in one or two week. But it can be obvious for a longer time. And this kind of difference will cause a condition that the same index can be divided into different classes with the change of the date.

When the government staff selects indicators, if they want to make the selected indicators cover all kinds of areas and types as comprehensively as possible, they can select the indicators with different colors. For indicators with the same color, we can choose the one to check according to the search interest of the keyword (i.e., the size of the circle). For example, the search index of the word “sports” is very high, so when choosing the indicators from the indicators in pink, we can choose the indicators linked to “sports” for checking. If the keyword “sports” is associated with multiple indicators, and these indicators belong to the same community, then we can choose according to the width of the edge associated with the indicator and the keyword. The wider the edge is, the more closely associated between the keyword and the index. It means that the indicator is much more representative.

5 Conclusion

This paper proposes an index recommendation method based on the Louvain algorithm combined with the popularity of indicators’ keywords and combines the recommendation results with d3 framework for data visualization. First, each indicator is processed by word segmentation, stop words removing and keywords’ initial extraction. Then, TF-IDF is calculated for each keyword from each indicator. And the keywords with higher influence are used. Next, a network of keywords and indicators is established and Louvain algorithm is used to divide the nodes into communities. Finally, results visual presentation. According to the results, we can clearly and intuitively see the relationship between indicators and keywords, the types of indicators, and the popularity of indicators. It is easy for users to understand, which makes the user can select indicators much more targeted according to the needs of the business process. By doing these, the indicators are much more representative for checking and the efficiency of the government organization can be improved greatly.