Keywords

1 Introduction

The continuous increase in the number of documents produced on the Web makes it more complex and costly to analyze, categorize and retrieve documents without considering the semantics of each document. One way to represent knowledge of documents is through ontologies.

An ontology, from the computer science perspective, is “an explicit specification of a conceptualization” [1].

Ontologies can be divided into four main categories, according to their generalization levels: generic ontologies, representation ontologies, domain ontologies, and application ontologies. Domain ontologies, or ontologies of restricted domain, specify the knowledge for a particular type of domain, for example: medical, tourism, finance, artificial intelligence, etc. An ontology typically includes the following components: classes, instances, attributes, relations, constraints, rules, events and axioms.

The ontologies are resources that allow to capture the explicit knowledge in the data, through concepts and relationships. In this paper we are interested in the process of discovering and evaluating ontological relations, thus, we focus our attention on the following two types: taxonomic relations and/or non-taxonomic relations. The first type of relations are normally referred as relations of the type “is-a” (hypernym/hyponymy or subsumption) or class-inclusion.

In order to evaluate concepts and semantic relations of three domain ontologies using Latent Semantic Analysis, in this research work we present two variants, the first one based on the cosine similarity, and second one based on clustering by committee.

The experiments carried out and the obtained results are discussed through the remaining of this paper, which is organized as follows: in Sect. 2 we present the related work, in Sect. 3 we present the concept of latent semantic analysis, whereas in Sect. 4 we describe the concept of clustering by committee, both employed in this research work. The method proposed is presented in Sect. 5. The experimental results are shown and discussed in Sect. 6. Finally, in Sect. 7 the conclusions of the work are given.

2 Related Work

Different approaches employing LSA for task related with ontologies can be found in literature. For example, in [2] it is presented an automatic method for ontology construction using latent semantic, clustering and Wordnet over a collection of documents.

In [3] they show methods for improving both, the recall and the precision of automatic methods for extraction of hyponymy (IS-A) relations from raw text. By applying latent semantic analysis (LSA) to filter extracted hyponymy relations, they reduce the error rate of their initial pattern-based hyponymy extraction by 30%, achieving precision of 58%. By applying a graph-based model of noun-noun similarity learned automatically from coordination patterns to previously extracted correct hyponymy relations, they achieve roughly a five-fold increase in the number of correct hyponymy relations extracted.

In [4], the authors describe an approach that extracts hypernym and meronym relations between proper nouns in sentences of a given text. Their approach is based on the analysis of the paths between noun pairs in the dependency parse trees of the sentences.

In [5] techniques of machine learning and statistical natural language processing are used to attempt to construct a domain concept taxonomy. They employ different evaluation measures such as: Precision, Recall, F-measure, and others. Their work focused on the integration of knowledge acquisition with machine learning techniques for the ontology creation.

We purpose it is evaluate semantic relationships with evidence in the domain corpus through of latent semantic analysis method. For the evaluation, we use the mesure of accuracy.

3 LSA

Latent Semantic Analysis (LSA) is a computational model used in natural language processing, considered in its beginnings as a method for representing knowledge [6]. LSA is considered an unsupervised dimensionality reduction tool, such as principle component analysis (PCA) [7]. The rationale behind this model indicates that words in the same semantic field tend to appear together or in similar contexts [8, 9].

LSA has its origin in an information retrieval technique called Latent Semantic Indexing (LSI) whose purpose is to reduce the size of an array of document terms using a linear algebra technique called Singular Value Decomposition (SVD). The difference with LSA is that it uses a word-context matrix. The context can be a word, a sentence, a paragraph, a document, a test, etc.

Venegas [6] considers that LSA is characterized for being a mathematical-statistical technique that allows the creation of multidimensional vectors for the semantic analysis of the relationships that exist among the different contexts.

The purpose of dimensionality reduction in LSA is to eliminate noise present in the relationships between terms and contexts, since it is usually possible to express the same concept with different terms.

LSA does not consider the linguistic structure of contexts, but the frequency and co-occurrence of terms. However, it has been possible in some cases to identify semantic relationships such as synonymy using LSA [8].

This technique is based on the principle that the words in a same context tend to have semantic relationships, and consequently, indexing of documents with similar contexts should be included by the words that appear in similar contexts even if the document does not contain that words.

4 Clustering By Committee

The Clustering By Committee algorithm (CBC) allows automatic discovery of concepts from text [10, 11]. Initially it discovers a set of strict groups called committees that are scattered in the space of similarity. The feature vector that represents a group is the centroid of the committee members, and the clustering method proceed to assign elements to their most similar groups.

The CBC algorithm consists of three phases:

  1. 1.

    To find the most similar elements. In order to calculate the most similar words of a word w, first the characteristics of the word w are ranked according to their mutual information with w.

  2. 2.

    To discover the committees. Each committee that is discovered in this phase defines one of the final groups for the output of the algorithm.

  3. 3.

    To assign elements to the groups. Each element is assigned to the group containing the most similar committee.

CBC has also been used to find the meanings of a word w [12] (algorithm in its flexible version), and for clustering texts (algorithm in its strong version) [13]. Other authors, such as Chatterjee and Mohan [14], have successfully used this algorithm in its flexible version for the discovery of word meanings, including Random Indexing to reduce the dimensionality of the context matrix.

5 The Proposed Approach

The proposed approach uses the method of latent semantic analysis, with the purpose of identifying the semantic relationships between the concepts existing in the ontology and looking for evidence in the domain corpus for further evaluation.

LSA points out that words in the same semantic field tend to appear together or in similar contexts, therefore, we considered that the concepts that are semantically related can be in the same sentence or in different sentences sharing information in common.

Based on this assumption, we present the following algorithm that takes into account two variants: (a) Cosine similarity and (b) Grouping by committees (CBC) that assign a weight w to each evaluated relation of the domain ontology. The algorithm performs the following steps:

  1. 1.

    Pre-processing the domain corpus and domain ontologies. The domain corpus is divided into sentences and the empty words (such as prepositions, articles, etc.) are removed. The Porter stemming algorithm is applied to the words contained in these sentences [15]. The concepts are also extracted from the ontologyFootnote 1. The same process is applied to each one of the concepts of the ontology in order to maintain consistency in the terminology representation (empty words elimination and the Porter stemming algorithm).

  2. 2.

    Application of the LSA algorithm to reduce the dimensionality of the context matrix. In this case, we use the S-SpaceFootnote 2 package and the LSA algorithmFootnote 3. The algorithm receives as parameters the sentences of the corpus of Domain and K dimensions (we use 300 dimensions). The output of the LSA algorithm are semantic vectors of dimension K for each word identified by LSA in the corpus.

  3. 3.

    Construction of concepts. The words obtained by the LSA method are clustered by using the cosine similarity to form the concepts of the ontology.

  4. 4.

    Dimension reduction of vocabulary (vectors) in the LSA matrix. Only the concepts obtained in the previous step are kept in the next step, the rest of the words of the original matrix are removed.

  5. 5.

    Application of variants. At this point two variants are used: cosine similarity for each relation and CBC algorithm to cluster concepts.

    • Similarity cosine

    1. (a)

      Calculation of cosine similarity. The concepts obtained in the previous step are used to determine the degree of similarity between each pair of concepts that form the class-inclusion and non-taxonomic relations.

    2. (b)

      Calculation of threshold u and weight w assigned to the relation. The threshold u is calculated as the sum of the similarities between the total of relationships divided by 2. If the value of the degree of similarity of the relation is greater than the threshold u, the relation takes the weight of \(w=1\), otherwise \(w=0\).

    • CBC Algorithm

    1. (a)

      Application of the CBC algorithm in its flexible version. The concepts formed by similarity, in the previous step, are the input to the CBC algorithm. The output of the algorithm are the clustered concepts.

    2. (b)

      Identification of the concepts that form the relationship in the clusters generated by CBC. If the pair of concepts that form the relation (class-inclusion and non-taxonomic) are in the cluster, the relation takes the weight \(w=1\) otherwise it receives the weight \(w=0\).

  6. 6.

    Ontology evaluation. We used the metric of accuracy for evaluating the concepts and semantic relations obtained with our approach for each input domain ontology.

The next section, we present the obtained results with this approach.

6 Experimental Results

Below, we present the dataset and the results obtained with the aforementioned approach.

6.1 Dataset

The domains used in the experiments are Artificial Intelligence (AI)Footnote 4, standard e-learning SCORM [16] and OIL taxonomy.

In Table 1 we present the number of concepts (C), class-inclusion relations (S) and non-taxonomic relations (R) of the ontology evaluated. The characteristics of its reference corpus are also given in the same Table: number of documents (D), number of tokens (T), vocabulary dimensionality (V), and the number of sentences (O)

Table 1. Datasets

6.2 Results

The number of vectors or words retrieved by the LSA algorithm from the domain corpus are shown in Table 2. After concepts discovering by employing the cosine similarity, the approach reduces the matrix to the total of concepts of the domain ontology. Por example, from 1,659 words obtained by LSA for the ontology IA, the matrix is reduced to 276 concepts included in the ontology (see Table 1).

Table 2. Vocabulary obtained by the LSA algorithm for each domain

The LSA method with cosine similarity obtained favorable results for the three domains ontologies evaluated, finding more than 70% of the class-inclusion relations (see Table 3). The CBC method obtained the best results in the OIL ontology with 54% of accuracy.

Table 3. Experimental results of the LSA approach to class-inclusion relation in each domain ontology

In Table 4 we show the total of concepts that integrate a class-inclusion relation (CO) in the domain ontology and the total of these obtained by the LSA approach (Enc) for this type of relation.

The accuracy of the concepts found by the LSA method is greater than 79% with the cosine similarity variant (see Table 4). However, the CBC variant does not report satisfactory results for the first two ontologies. In the case of the OIL ontology it obtained a better behavior by achieving 62% accuracy, but without exceeding the result of the cosine variant (79%). The CBC method does not cluster all the concepts, so it was expected that most of the relations would not be found.

Table 4. Experimental results of concepts that maintain only a class-inclusion relation using the LSA approach for each domain ontology

In the case of non-taxonomic relations, the results obtained by the approach are presented in Table 5. Again, the cosine variant obtains better results (78% accuracy) than the CBC variant for this type of relation. As the CBC variant failed to cluster all concepts (see Table 6), the approach does not achieve a satisfactory accuracy in such relations. A first approximation of this approach is presented in [17] reporting only the concepts found by LSA.

Table 5. Experimental results of the LSA approach to non-taxonomic relations in each domain ontology.

In the case of concepts, the cosine variant obtains 85% accuracy in comparison with that obtained with the CBC variant (see Table 6).

Table 6. Experimental results of concepts that keep a non-taxonomic relation using the LSA approach for each domain ontology

7 Conclusions

The LSA method has been widely used in the state of the art to represent semantic at the context level, and with the proposed approach it was possible to obtain more than 70% of the semantic relations of each domain ontologies.

The results of the LSA approach, considering only the cosine similarity variant, obtained satisfactory results. But when the CBC variant was employed, it was not possible to find in the clustered concepts all the ontology relations (approximately only 10% of the total concepts).

The CBC method is very costly at runtime and did not produce satisfactory results. We consider that this is because we do not have enough information from each domain ontology, that this variant can process.

The LSA based approach requires a robust corpus (in terms of domain and size), including a large vocabulary that this allows more terms to be clustered. However, the accuracy offered is acceptable for one of the variants presented.

As future work we consider to increase the number of documents processed by the approach, as well as, the reviewing of other alternatives of concept clustering for the evaluation of domain ontologies.