QualityCover: Efficient binary relation coverage guided by induced knowledge quality
Introduction
Formal Concept Analysis (FCA) is a mathematical tool for analyzing data and formally representing conceptual knowledge [16]. FCA forms conceptual structures from data. Such structures consist of units, which are formal abstractions of concepts of human thought allowing meaningful and comprehensible interpretation [26]. Interestingly enough, a distinguishing feature of FCA is an inherent integration of components of conceptual processing of data and knowledge [4]. Through the integration of these components, FCA’s mathematical settings have recently been shown to act as a powerful tool by providing a theoretical framework for the efficient resolution of many practical problems from data mining, software engineering and information retrieval, to cite but a few [14], [15], [18], [21].
Nevertheless, the overwhelming number of formal concepts that may be drawn from even a reasonably sized contexts [1], was a hindrance towards a larger utilization of FCA. In fact, an interesting tackle of this issue is to find coverage of a formal context by a minimal number of formal concepts. By considering this problem, we instantiate the famous algorithmic problem of set coverage, which involves finding the smallest sub-collection of sets that covers some universe. Although finding an optimal solution to this problem is NP-hard, a greedy algorithm is widely used, and typically finds solutions that are close to optimal [36].
In this respect, the dedicated literature is witnessing a heavy number of works [10]. Thus, Belkhiter et al. [3] introduced a pertinent rectangular decomposition of a formal context as well as an application to documentary databases. The decomposition introduced is based on the selection of ”optimal” formal concepts. The optimality is assessed through the maximization of a function that computes the storage space of a formal concept. Later, Khcherif et al. [22] introduced a rectangular decomposition approach based on the Riguet’s difunctional relation [32]. The computation of this difunctional is reduced to the localization of a set of key points called isolated points. The latter have been shown to enable a minimal set of formal concepts to cover a given formal context to be determined. Belohlavek and Vychodil [7] introduced the GreCond approach and later Belohlavek and Trnecka [6] via the GreEss approach, put focus on the same issue. Thus, they proposed a new method for decomposing a binary matrix into a Boolean product of factors. Worthy of citation, both latter approaches have a close connection with the isolated points through shedding light on ”mandatory” formal concepts in the coverage.
In this paper, we introduce a new approach for the extraction of a pertinent coverage of a formal context. The driving idea is that chosen formal concepts convey added value from the quality of knowledge that may be drawn. Indeed, the intent part of a formal concept has been shown to play a key role in the rule set construction. This rule set is at the origin of a variety of compact subsets of the implication/association rule1 sets of a context, which are called as generic bases [8].
Interestingly enough, as stressed on in [17], only informative (generic) association rules can be derived from highly correlated patterns. This fact is behind our motivation for the choice of the chosen formal concepts kept in the coverage. This choice is based on the assessment of the correlation of their respective intent parts. By doing so, our aim is improving the informativity as well as the strength of the derived association rules.
It is important that, to the best of our knowledge, the introduced approach is the first that tackles this issue from a data mining point of view. Indeed, pioneering approaches of the literature only, respectively, paid attention to the minimization of the storage of a documentary database as well as the number of factors within the Boolean factor analysis framework.
To show the benefits of our approach, extensive comparisons were carried out versus pioneering ones in the literature and these shed light on very encouraging results. The validation protocol relies heavily on the criterion of coverage’s compacity as well as common quality metrics–e.g., coupling, cohesion, stability, separation and distance.
The remainder of the paper is organized as follows: The next section recalls the key notions used throughout this paper. Section 3 reviews related work. Then, we thoroughly describe, in Section 4, our algorithm for the extraction of a pertinent coverage from a binary relation, called QualityCover. Section 5 describes the experimental study and the results we obtained. Section 6 concludes the paper and identifies avenues for future work.
Section snippets
Key notions
In this section, we briefly sketch the key notions used in the remainder of this paper.
Related work
The costly computation complexity of extractions the whole set of formal concepts was a main impediment to the wide-scale use of FCA battery of results for large datasets. To overcome this drawback, the issue of extracting a compact coverage of formal concepts grasped the interest of the research community. At a glance, the dedicated literature witnessed two main streams for addressing such a task: (i) Gain function based approaches; and (ii) Localization of key points based approaches. In the
Extracting a pertinent coverage from a binary relation
In this section, we introduce a new approach, based on a greedy algorithm, for the extraction of a pertinent coverage of a binary relation. The guiding idea of our approach, is that the extraction process is mainly based on the quality of knowledge that may be drawn from each formal concept. In fact, our gain function is based on the assessment of the correlation of the intent part of pertinent formal concepts.
In the following, we thoroughly describe a new algorithm, called QualityCover, for
Experimental results
In this section, we present our results showing the efficiency of our proposed QualityCover algorithm. The solution was implemented and executed on a Core i7 PC, CPU 2.4 GHz with 16 GB of RAM and an Ubuntu distribution Linux system. The experiments mainly concern the compacity as well as the quality of the formal concepts composing the coverage. The quality is assessed through coupling, cohesion, stability, separation and distance metrics. At first, we lead a series of experiments to assess the
Conclusion
In this article, we presented a new gain function based approach, built on a greedy algorithm QualityCover, for the extraction of a pertinent coverage of a binary relation. The main thrust of the approach is that the obtained coverage relies on an assessment of a measure of correlation of a set of items to select formal concepts to be included in the coverage. Extensive experimental work showed that QualityCover obtains very encouraging results versus those obtained by pioneering approaches of
Acknowledgments
The authors are thankful for the anonymous reviewers and Pr. Peter Eklund (Head of PhD School, IT University of Copenhagen) who accepted to proof read this paper. We also thank the respective authors whom accepted to provide us the source codes of their algorithms, namely Martin Trnecka for the GreEss algorithm, Vilem Vychodil for the GreCond algorithm and Fethi Ferjani for GenCoverage algorithm.
References (41)
- et al.
From-below approximations in boolean matrix factorization: Geometry and new algorithm
J. Comput. Syst. Sci.
(2015) - et al.
Concept lattices reduction: Definition, analysis and classification
Expert Syst. Appl.
(2015) - et al.
Using minimal generators for composite isolated point extraction and conceptual binary relation coverage: application for extracting relevant textual features
Inf. Sci.
(2016) - et al.
Formal context coverage based on isolated labels: an efficient solution for text feature extraction
Inf. Sci.
(2012) - et al.
Looking for a structural characterization of the sparseness measure of (frequent closed) itemset contexts
Inf. Sci.
(2013) - et al.
Using difunctional relations in information organization
Inf. Sci.
(2000) - et al.
Concept learning via granular computing: A cognitive viewpoint
Inf. Sci.
(2015) 100 years of psychology of concepts: the theoretical notion of concept and its operationalization
Stud. Hist. Phil. Biol. Biomed. Sci
(2007)- et al.
Selecting the right objective measure for association analysis
Inf. Syst.
(2004) - et al.
Why concept lattices are large– extremal theory for the number of minimal generators and formal concepts
Ordre et Classification. Algèbre et Combinatoire
Décomposition rectangulaire optimale d’une relation binaire: application aux bases de données documentaires
INFOR
Basic level of concepts in formal concept analysis
Proceedings of the 10th International Conference on Formal Concept Analysis (ICFCA2012), LNCS 7278, Leuven, Belgium
Discovery of optimal factors in binary data via a novel method of matrix decomposition
J. Comput. Syst. Sci.
A new generic basis of factual and implicative association rules
Intell. Data Anal.
Theory of capacities
Ann. l’Inst. Fourier
Dfsp: A new algorithm for a swift computation of formal concept set stability
Proceedings of the 11th International Conference on Concept Lattices and Their Applications (CLA’2014)
Effective and Efficient Correlation Analysis with Application to Market Basket Analysis and Network Community Detection, (Ph.D. Thesis)
Concept similarity and related categories in information retrieval using formal concept analysis
Int. J. Gen. Syst.
Cited by (37)
Boolean matrix factorization for symmetric binary variables
2023, Knowledge-Based SystemsRevisiting the GreCon algorithm for Boolean matrix factorization
2022, Knowledge-Based SystemsCitation Excerpt :Namely, the formal context represents the input data, and formal concepts represent factors in such data. As a consequent, BMF can be seen as a covering of a formal context by formal concepts [7–9]. In the pioneer work [9] two main algorithms, GreCon and GreConD, for BMF as well as a fundamental theory of BMF based on FCA were established.1
Boolean matrix factorization with background knowledge
2022, Knowledge-Based SystemsCitation Excerpt :Nevertheless, our approach can be simply extended to an arbitrary BMF method. There are many (general) BMF methods, e.g., [2,10–19]. Note, none of them utilizes the background knowledge.
GC and other methods for full and partial context coverage
2021, Procedia Computer ScienceBI-COMDET: Community detection in bipartite networks
2019, Procedia Computer ScienceOn the efficient stability computation for the selection of interesting formal concepts
2019, Information SciencesCitation Excerpt :Avenues of future work are as follows: Design of a Top-k extraction method of formal concepts: It is a crucial issue to design a Top-k extraction method that belongs to the relevance-oriented type, which usually extracts the first k formal concepts sequentially according to a ranking based on an aggregation of the above considered quality used in [23]. It is of paramount importance that the aggregation of these assessment criteria should avoid the pitfall of using the classical weighted means and linear combination schemes to address this issue.