Exploration of rule-based knowledge bases: A knowledge engineer’s support

doi:10.1016/j.ins.2019.02.019

Information Sciences

Volume 485, June 2019, Pages 301-318

https://doi.org/10.1016/j.ins.2019.02.019 Get rights and content

Highlights

•
A hierarchical structure of clusters of rules supports a knowledge engineer in managing large knowledge bases and helps detect rare rules as well as the similar ones.
•
Clustering of similar rules and generating clusters’ representatives as an effective representation method for knowledge bases with clusters of rules.
•
Optimization of the inference process for knowledge bases with clusters of rules.
•
The experiments performed for four knowledge bases of a varying size.

Abstract

Data exploration helps us understand the investigated reality in a faster and better way. In this paper, the data to be explored are domain knowledge bases with rules representation. The specificity of rule representation requires optimum selected analysis methods to provide useful new knowledge to both the knowledge engineer and the user of a decision support system with a rule-based knowledge base. Effective exploration of rule-based knowledge bases can be carried out through the creation of if... then clusters of rules and their representatives using hierarchical methods. This is a new and unique approach to the managing of domain knowledge bases as it facilitates the creation of cohesive and well-described clusters and the detection of rare rules (those dissimilar to other rules) while concurrently providing visualization of a knowledge base. In the experiments, four knowledge bases with a varying number of attributes and rules have been used. The knowledge bases have been explored using four different methods of determining clusters’ representatives, four clustering methods and nine similarity measures. It turns out that each of the factors substantially influences to the size of the resulting clusters, the number of outliers and the occurrence frequency of overgeneral and overspecific rule clusters’ representatives.

Introduction

Data exploration helps us understand the investigated reality in a faster and better way. The data to be explored are domain knowledge bases with rules representation i.e. if... then chains which make it possible to conveniently describe a knowledge domain as relations connecting premises and conclusions arising out from the observations of those premises. Such rules can be generated automatically from data with one of the available rule induction methods or can be directly fed by an expert (or experts). The main advantages of this knowledge representation are easy interpretation and an ability to record the knowledge as if... then chains. In addition, such representation is free of limitations as to the type of data as it works well with both numeric and categorical data. For medicine, economics or any other subject, rule representation provides a domain expert with means of an accessible presentation of the gathered knowledge. A knowledge engineer just needs to save this knowledge using available IT tools. As a result, both simple and complex decision support systems proliferate. Rule-based knowledge representation requires an optimum selection of methods for its analysis so as to make new knowledge explored from it useful for a knowledge engineer as well as for an end user of a decision support system. A knowledge engineer’s perception of a rule-based knowledge base is aimed at an easy management of rules (searching within the rules, discovering frequent and rare rules or their conditions) and as far as an end user is concerned, the efficiency of inference which an end user is involved in becomes a priority when a given system is used. In each of the above cases, when there are too many rules in a knowledge base, their effective searching might be impaired. What is more, when there are too many rules in a given knowledge base, it is almost impossible to find any relationships or similarities between them as well as rare rules. There is a promising prospect of having a tool which could describe a knowledge base by giving information about the number of groups of rules with similar premises, the size of such groups, their representatives as well as the number of rules which contain premises irrelevant to any rule or group of rules and thus they are classified as so-called rare rules and in consequence are not clustered with others. Within the field of data mining, this issue is related to the outlier detection approach [19]. The knowledge on rare rules allows for a wider exploration of the field in a previously unknown area. Effective exploration of rule-based knowledge bases can be carried out through the creation of if... then clusters of rules and their representatives using hierarchical methods. This is a new and unique approach to the managing of domain knowledge bases as it facilitates the creation of cohesive and well-described clusters and detection of rare rules (those dissimilar to other rules) while concurrently providing visualization of a knowledge base. This issue has recently grown in importance. Almost in every aspect of everyday life, we need tools that allow us to swiftly and efficiently manage huge datasets and this does not only apply to information search but primarily it enables a user to generalize and visualize data for the needs of data arrangements. Development of computerization has been accompanied by development of decision support systems based on domain knowledge in almost every aspect of life, from industry to economics and medicine. With the passing of time, knowledge bases set up within such systems have contained more and more rules (while MYCIN contained a few hundred rules, modern systems can contain a few hundred thousand of them). A domain expert or knowledge engineer is simply unable to efficiently manage such a massive dataset. Hence the necessity to create tools that help to explore modern domain knowledge bases, which are often dispersed and contain data with a complex structure.

Clustering is one of the methods which allow huge datasets to be managed effectively. Depending on the context and the clustering method, the results may vary substantially. Among the available clustering techniques, non-hierarchical (partitional) and hierarchical methods can be used. The subject of clustering is another important factor. There are numerous available papers which discuss the clustering and managing of huge datasets (text documents, images and numerical data). The subject of research in this paper is the representation of specific data such as rule-based knowledge bases.

Even though there are numerous papers which present rule representation as decision tables and association rules and provide the methods and tools for their effective management (especially when there is a big number of rules in such sets), so far we have not found any papers which present available exploration tools which can deal with big sets of data for production rules. This has become our main motivation for research on exploration methods and tools for rule-based knowledge bases. Having analyzed numerous clustering methods (partitional, density-based and hierarchical), we decided to focus our efforts on the hierarchical ones as we found them most promising. As a result, we propose a modification of the classic clustering algorithm. This proposed new technique clusters rules on the grounds of their premises being similar and then labels each cluster with a representative. The innovation here involves clustering rule premises only using the maximum similarity criterion (not the distance criterion used by most existing methods and tools), various similarity measures and intra-cluster methods and, in particular, looking for an optimal number of clusters. Besides the knowledge on the structure of the generated clusters (i.e. the number of clusters and the composition of each cluster), in the proposed approach, the authors have used a new approach, based on descriptive representation of clusters in addition to their visualization. This new approach allows for the designation of representatives for the created clusters with the use of the generalization approach, the specification approach or an approach which combines both. This would undoubtedly provide massive support for a domain expert or knowledge engineer, who can improve their knowledge of the domain described in a knowledge base (as the number of rule clusters with similar conditions, the size of those clusters, the number of rare rules, which could not be clustered) and the cluster representatives would make it possible for a domain expert or knowledge engineer to find desired rules in order to update them or explore the domain in a previously unexplored part.

The authors claim that the clustering of similar premises and generating cluster representatives of these clusters enables the optimization of searching of rules to be activated in the inference process. An excessive number of rules in relation to given facts becomes a bottleneck in the data-driven inference and this phenomenon is investigated in this paper by the authors. At the same time, a generated cluster of rules with its representative supports a knowledge engineer by managing the knowledge recorded in a knowledge base. We believe that methods and tools for managing knowledge bases improve the effectiveness of every decision support system. Since so much depends on the quality of clusters, we wanted to know the influence of clustering parameters on the final clustering results. For this purpose, in the experiments we analyze the impact of clustering parameters and cluster representation methods on the effectiveness of the investigated knowledge bases. They have been explored with the use of four cluster representative designation methods, four inter-cluster methods and nine intra-cluster similarity measures. It turns out that each of the aforementioned factors substantially influences the size of the resulting clusters, the number of rare rules and the frequency of overgeneral or overspecific representatives of rule clusters.

The structure of the paper is as follows: In Section 2, the notions of knowledge base and inference processes as the engine of every decision support system are presented. Additionally, Section 2 presents the format of a knowledge base as proposed in this paper. Section 3 contains the description of the proposed approach to rule clustering together with the suggested clustering algorithm and the methods of designating cluster representatives (presented as pseudocode). The most important inter-cluster methods and intra-cluster similarity measures are also presented in brief in this paper. In Section 4, rare rules detection methods are described along with the algorithm used for this purpose. This section also contains a description of the CluVis tool used for clustering, designation of representatives and visualization of the resulting structure of clusters of rules. The results of the experiments are presented in Section 5. This paper concludes with a summary of the results and a description of research to be conducted in the near future.

Rules have been a central form of knowledge representation since the earliest development of intelligent systems. In many past and current rule-based systems, domain experts manually organize rules through labeling and assigning to different groups based on their semantics. For example, the Cyc project has an enormous number of rules, often referred to as a sea of assertions, in its comprehensive knowledge base of common sense knowledge for general-purpose reasoning [14]. To organize such a knowledge base, domain experts divide these rules into smaller parts. Although this manual categorization of rules is semantically precise, it is a laborious task. When the data size is huge, the speed of them cannot be ensured. There is a need for developing automated methods to manage rule bases when the number of rules becomes large and relationships among rules are too complicated for developers and domain experts to comprehend [8]. Several efforts have been deployed to tackle the problem of summarizing and pruning the huge number of rules. A distinction should be made between rule-related work involving clustering and research on managing the rules using other techniques. By other techniques we mean methods of generalization (shortening and joining) of rules or their filtering. It is necessary to choose a quality measure which controls the process of shortening or joining rules, for example a maximum acceptable decrease of the rule quality measure after shortening or joining, whether the set of joined rules should be joined later, and whether to create a ranking for special rules before they are joined. This ranking determines the order of the joining process. Only decision rules which show sufficient similarity to the selected basic decision rule are taken into account [23]. In [15] the authors propose to realize the process of rule shortening by removing elementary conditions using heuristic strategies (for example hill climbing) or exhaustive searching. Rules are shortened until the quality of the shortened rule drops below a certain fixed threshold. Research into the use of clustering for such representation of domain knowledge as rules resulted in the discovery of many interesting approaches. However, all of them were related to only two types of rules: fuzzy and association rules. These two types of rules are very specific and not always usable in practical applications (in the case of fuzzy rules we need to have continuous data as input data, while in the case of association rules, only nominal data are expected). There are many cases in which production rules are needed. The advantage of representation of knowledge using production rules lies in the fact that we may use any type of attributes in rules and usually such rules are short and easy to interpret for any kind of the system’s user (end user, knowledge engineer or domain expert). There are many research results related to the clustering of fuzzy rules. The construction of a rule base from fuzzy clusters provides an initial approximation for the data which can be used as a basis for further improvements. An interesting approach is proposed in [22]. The authors propose a rule clustering algorithm which allows the automatic organization of the sets of fuzzy rules of one monolithic fuzzy system in the hierarchical structure, with various sub-models. They believe that the readability of fuzzy models is related to their organizational structure and the corresponding rule base thus they use clustering to build the structure of the system. The objective of the fuzzy clustering partition is the separation of a set of fuzzy rules into a given number of clusters, according to a similarity criterion, finding the optimal centers of clusters and the partition matrix. An approach based on fuzzy clustering (a Fuzzy Clustering of Fuzzy Rules Algorithm (FCFRA)) which allows the automatic organization of the sets of fuzzy rules of one fuzzy system in a hierarchical prioritized structure, is presented in [21]. Fuzzy clustering seems to be a very appealing method for learning fuzzy rules since there is a close and canonical connection between fuzzy clusters and fuzzy rules. The idea of deriving fuzzy classification rules from data can be formulated as follows: the training data set is divided into homogeneous group and a fuzzy rule is associated with each group. The proposed FCFRA algorithm has been successfully applied to the modeling of a nonlinear small scale Pilot Plant Reactor. In paper [30], the authors propose to use the D-AFC(c)-algorithm as a direct possibilistic clustering algorithm, based on the construction of an allotment among an a priori given number c of partially separate fuzzy clusters.

The second approach for clustering the rules is the one related to the clustering association rules. When mining association rules we may find hundreds or thousands of rules corresponding to specific attribute values. In [9] the authors propose a method for grouping and summarizing large sets of association rules according to the items contained in each rule. Hierarchical clustering is used to partition the initial rule set into thematically coherent subsets. This enables the summarization of the rule set by adequately choosing a representative rule for each subset, and helps in the interactive exploration of the rule model by the user. In [20] the authors propose a method to analyze links between binary attributes in a large sparse data set. Initially the variables are clustered to obtain homogeneous clusters of attributes. Association rules are then mined in each cluster. The study shows that the combined use of association rules and classification methods is more relevant. Actually this approach brings about an important decrease in the number of rules produced.

To the best of our knowledge it is difficult to find research results for clustering production rules. Although such types of rules are very simple to build, it brings about problems related to creating coherent and well separated groups if different attributes are used in their conditional and decisional part. As they usually create cause and effect chains, it is quite difficult to partition them properly.

The approach proposed in this paper is similar to the ones recently published within the domain of decision making which involves a large number of experts, especially when building a consensus [11]. Both similarity measures and clustering are used to detect the most influential experts. The closeness of experts’ preferences is computed using a similarity function. When there is a multitude of experts, they can be divided into subgroups in such a way in that experts are placed in the same cluster when they are more similar to each other if compared to the ones assigned to different cluster(s). When using the agglomerative hierarchical clustering algorithm it is possible to find structurally equivalent experts and then, applying the centrality concept in determining a group (or network) leader, to drive advice in the feedback process. An interesting approach is presented in [32] where the authors propose a novel method to select the most influential rules in a fuzzy rule-based model. In such a model, especially when type-2 fuzzy rules are considered, using the back-propagation method will almost certainly suffer from rule redundancy. Therefore, it is necessary to select the most important fuzzy rules and remove the redundant ones from the generated rule base. According to the proposed idea a rule significance index is assigned to each rule, then rules are ranked and the influential rules are selected (based on the aforementioned index).

Section snippets

Knowledge base and rule representation

One of the most common and popular methods of domain knowledge representation is knowledge representation as rules i.e. if... then chains, for example: if a premise then a conclusion. Sometimes in the literature the notions of premise and conclusion are replaced with condition and decision [6]. When defining the way of inference with conditions that must be met to take a decision, domain experts convey their knowledge onto rules recorded in a knowledge base. The rules are activated when their

Clustering of rules

Too many rules in the knowledge base can negatively affect the effectiveness of management of rules. One of ways of managing the rules is to cluster them into groups and to describe the groups by their representatives. The notion of cluster analysis indicates that objects in the analyzed dimension are split into clusters which collect the objects most similar to one another and the resulting clusters are as different as possible [12]. It guarantees an optimum internal cohesion and external

Exploration of clusters

This section presents a description of the approach proposed by the authors. It clusters rules which show a similarity in their premise part and then create representatives of the clusters. By exploration of rule-based knowledge bases, the authors mean both utilizing descriptive information generated from the analysis of the rule clusters structure obtained in the course of clustering with the AHC algorithm i.e. information on a biggest cluster and alternatively on smallest clusters of rare

Experiments

The goal of the experiments was the exploration of rule-based knowledge bases with the structure of clusters. In the course of the experiments, the authors have analyzed the influence of clustering parameters and representative designation methods on the effectiveness of the resulting rule clusters. The authors propose that rules should be grouped into clusters with common rule premises and representatives should designated (using various methods) for these groups. The resulting structure is of

Conclusions

The subject of the analysis in this paper are cluster-structured rule-based knowledge bases. The authors propose clustering of similar rules as an exploration method for big knowledge bases with rule-based representation. A knowledge engineer has an insight into clusters of similar rules and their representatives and this is constitutes a ready-for-use tool that helps manage knowledge effectively and possible improve it in the future. It has to be emphasized that the presented idea of

References (33)

A.M. Jorgey
Hierarchical clustering for thematic browsing and summarization of large sets of association rules
Proceedings of the 2004 SIAM International Conference on Data Mining
(2004)
M. Lichman, UCI Machine Learning Repository,...
R. Slowiński et al.
Rough sets in decision making
Encyclopedia of Complexity and Systems Science
(2009)
S. Still et al.
How many clusters? An information-theoretic perspective
Neural Comput.
(2004)
A. Al-Ajlan
The comparison between forward and backward chaining
Int. J. Mach. Learn. Comput.
(2015)
J.G. Bazan et al.
A new version of rough set exploration system
RSCTC
(2002)
S. Boriah et al.
Similarity measures for categorical data: a comparative evaluation
SIAM
(2008)
A. Dudek
A comparison of the performance of clustering methods using spectral approach
Data Analysis Methods and Its Applications
(2012)
J.S. Fu et al.
ICA: an incremental clustering algorithm based on OPTICS
Wireless Pers. Commun.
(2015)
J.W. Grzymala-Busse
Rule induction from rough approximations

J.W. Grzymala-Busse et al.

Three discretization methods for rule induction

Int. J. Intell. Syst.

(2001)

S. Hassanpour et al.

Clustering rule bases using ontology-based similarity measures

Web Semantics

(2014)

Y. Jung et al.

A decision criterion for the optimal number of clusters in hierarchical clustering

J. Global Optim.

(2003)

N.H. Kamis et al.

Preference similarity network structural equivalence clustering based consensus group decision making model

Appl. Soft Comput.

(2018)

L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis,...

C. Matuszek et al.

An introduction to the syntax and content of Cyc

Proceedings of the 2006 AAAI Spring Symposium on Formalizing and Compiling Background Knowledge and Its Applications to Knowledge Representation and Question Answering

(2006)

Cited by (26)

Enhancing learning and exploratory search with concept semantics in online healthcare knowledge management systems: An interactive knowledge visualization approach
2024, Expert Systems with Applications
Healthcare is a knowledge intensive process, which requires large amount of information from various healthcare knowledge providers. However, this growing information makes challengeable for users in finding useful healthcare information to support their healthcare decision making. Knowledge management and knowledge management systems are drawing increasing attention in healthcare community. This paper proposes an online healthcare knowledge management framework and system prototype, featured with integrating knowledge learning and exploratory search for solving problems efficiently. Inspired by the knowledge to action framework in healthcare and the iterative process in software engineering, an iterative and incremental knowledge management model is designed to organize the activities during the knowledge learning and searching procedure. Supported by the knowledge-enhanced infrastructure of online healthcare information designed, domain knowledge learning and task knowledge learning environment can work collaboratively and interactively to improve the result of learning and exploratory search. Knowledge base also provides helpful information in an interactive way for solving problems. Given exploiting probabilistic information retrieval model and concept semantics described in Description Logics, a search intent discovery method and search intent guided learning and searching mechanism are designed within this environment for achieving the learning and search goal. The results of the experiments have proved the efficacy of the methods and process model used in the leaning-focused healthcare knowledge management system.
A semantic data-driven knowledge base construction method to assist designers in design inspiration based on traditional motifs
2023, Advanced Engineering Informatics
Directing at the problem of inaccurate expression of tacit knowledge and low heuristic of retrieval results in the design inspiration drawing process of designers based on traditional motifs, a semantic data representation method of traditional motifs knowledge is put forward, and an effective knowledge base construction and knowledge retrieval method for traditional motifs are described. First, the connotation of traditional motifs is deeply explored based on the theory of symbol semantics, realizing the interpretation and transfer of the semantic features in traditional motifs. Second, shape grammar is adopted for fractal traditional motifs, the hierarchical structure of traditional motifs is built, and the parametric coding of semantic features of traditional motifs is completed, providing the semantic data representation method of traditional motifs knowledge. Third, based on the clustering algorithm, the construction method of the traditional motifs knowledge base is explained, and a traditional motifs knowledge retrieval method taking into account the user's perceptual preference is proposed. Finally, setting the traditional Chinese motifs as an example, with a tea set as the designed product, the construction process of a small traditional motif knowledge base is shown in detail, proving the feasibility of this paper’s theory. Compared with the traditional theory, the semantic data representation method of traditional pattern knowledge is greatly advantageous in semantic differences elimination and computational efficiency.
Detecting outliers in rule-based knowledge bases using Self-Organizing Map and Local Outlier Factor algorithms
2023, Procedia Computer Science
Our research deals with intelligent decision support systems based on rule-based knowledge bases. Decision support systems use rules ”If a condition, then a decision” as a form of knowledge representation. In the process of inference, which mirrors the process of human reasoning, we look for rules that confirm the facts and thus generate new knowledge. Such rule-based knowledge bases can (and often do) contain outlier rules. Our goal is to find such unusual rules. Thanks to this, we can influence the completeness of the knowledge base by finding unusual rules and asking domain experts to supplement knowledge in a rare area. To enhance the effectiveness of decision support systems, we conducted separate investigations into two distinct methods. The first method involved the utilisation of the Local Outlier Factor (LOF) algorithm in detecting rule outliers, while the second method employed the Self-Organizing Maps (SOM) algorithm for the same purpose. Our experiments not only confirmed the effectiveness of both the LOF and SOM algorithms but also involved comparing the results obtained from both methods. The discovery of outlier rules can aid knowledge engineers and domain experts in knowledge exploration and enhance the completeness of the knowledge base, which is crucial for decision support systems.
Outliers in Covid 19 data based on Rule representation - The analysis of LOF algorithm
2021, Procedia Computer Science
The article concerns the detection of outliers in rule-based knowledge bases containing data on Covid 19 cases. The authors move from the automatic generation of a rule-based knowledge base from source data by clustering rules in the knowledge base to optimize inference processes and to detecting unusual rules allowing for the optimal structure of rule groups. The paper presents a two-phase procedure, wherein in the first phase, we look for the optimal structure of rule clusters when there are outlier rules in the knowledge base. In the second phase, we detect outliers in the rules using the LOF (Local Outlier Factor) algorithm. Then we eliminate the unusual rules from the database and check whether the selected cluster quality measures are responded positively to the elimination of outliers, which would indicate that the rules were rightly considered outliers. The performed experiments confirmed the effectiveness of the LOF algorithm and selected cluster quality measures in the context of detecting atypical rules. The detection of such rules can support knowledge engineers or domain experts in knowledge mining to improve the completeness of the knowledge base, which is usually the basis of the decision support system.
Influence of outliers in MOBA games winner prediction
2021, Procedia Computer Science
This work regards about predicting the result of a match in Multiplayer Online Battle Arena team games. This type of games is characterized by very complex mechanics, which makes it difficult to predict the final result of the game based on its course. This prediction is based on the achievements of the teams participating in the game - usually complete data on the course of the match is available. Until now, the work in this field has not focused on capturing atypical games from the data set. For example -a team loses, although during the match its achievements exceeded that of the opposing team. The aim of this study is to examine the influence of atypical games on the quality of prediction process. The conducted research has shown that atypical games have a negative impact on the effectiveness of winner prognosis, and their removal demonstrates a beneficial effect on the prediction quality. The result of the work provokes further research in order to increase the effectiveness of predictions for matches with an unusual course.
LoRMIkA: Local rule-based model interpretability with k-optimal associations
2020, Information Sciences
Citation Excerpt :
Further, Zhang et al. [46] considered multiple Gaussian models to represent the distribution of data where each Gaussian model reflects some local characteristics related to the dataset. However, especially outside of the field of deep learning, recently many researchers have shifted from linear models as explainers to rule-based explanations [30,28], as they arguably provide more precise explanations to the end users [35] and are more interpretable [21] compared with others. Puri et al. [31] introduced a global rule-based explainer.
As we rely more and more on machine learning models for real-life decision-making, being able to understand and trust the predictions becomes ever more important. Local explainer models have recently been introduced to explain the predictions of complex machine learning models at the instance level. In this paper, we propose Local Rule-based Model Interpretability with k-optimal Associations (LoRMIkA), a novel model-agnostic approach that obtains k-optimal association rules from a neighbourhood of the instance to be explained. Compared with other rule-based approaches in the literature, we argue that the most predictive rules are not necessarily the rules that provide the best explanations. Consequently, the LoRMIkA framework provides a flexible way to obtain predictive and interesting rules. It uses an efficient search algorithm guaranteed to find the k-optimal rules with respect to objectives such as confidence, lift, leverage, coverage, and support. It also provides multiple rules which explain the decision and counterfactual rules, which give indications for potential changes to obtain different outputs for given instances. We compare our approach to other state-of-the-art approaches in local model interpretability on three different datasets and achieve competitive results in terms of local accuracy and interpretability.

View all citing articles on Scopus

View full text

Exploration of rule-based knowledge bases: A knowledge engineer’s support

Highlights

Abstract

Introduction

Section snippets

Knowledge base and rule representation

Clustering of rules

Exploration of clusters

Experiments

Conclusions

Neural Comput.

The comparison between forward and backward chaining

Int. J. Mach. Learn. Comput.

A new version of rough set exploration system

RSCTC

Similarity measures for categorical data: a comparative evaluation

SIAM

A comparison of the performance of clustering methods using spectral approach

Data Analysis Methods and Its Applications

ICA: an incremental clustering algorithm based on OPTICS

Wireless Pers. Commun.

Rule induction from rough approximations