CBER: An Effective Classification Approach Based on Enrichment Representation for Short Text Documents

Eman Ismail; Walaa Gad

doi:10.1515/jisys-2015-0066

Open Access Published by De Gruyter February 29, 2016

CBER: An Effective Classification Approach Based on Enrichment Representation for Short Text Documents

Eman Ismail and Walaa Gad

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2015-0066

Abstract

In this paper, we propose a novel approach called Classification Based on Enrichment Representation (CBER) of short text documents. The proposed approach extracts concepts occurring in short text documents and uses them to calculate the weight of the synonyms of each concept. Concepts with the same meanings will increase the weights of their synonyms. However, the text document is short and concepts are rarely repeated; therefore, we capture the semantic relationships among concepts and solve the disambiguation problem. The experimental results show that the proposed CBER is valuable in annotating short text documents to their best labels (classes). We used precision and recall measures to evaluate the proposed approach. CBER performance reached 93% and 94% in precision and recall, respectively.

Keywords: Semantic classification; lexical ontology; word sense disambiguation (WSD)

MSC 2010: 62H30; 68Q55

1 Introduction

Short text document (STD) annotation plays an important role in organizing large amounts of information into a small number of meaningful classes. Annotation of STDs becomes a challenge in many applications such as short message service, online chat, social networks comments, tweets, and snippets.

STDs do not provide sufficient word occurrences, and words are rarely repeated. The traditional methods of classifying such types of documents are based on a Bag of Words (BOW) [15], which indexes text documents as independent features. Each feature is a single term or word in a document. A document is represented as a vector in feature space. A document vector is the term frequencies in the document. Word weights are the number of word occurrences in the document. Classification based on BOW has many drawbacks: STDs do not provide enough co-occurrence of words or shared context. Representation of such documents is almost sparse because of empty weights when using BOW. Data sparsity leads to low classification accuracy because of the lack of information. The BOW approach treats synonym words as different features. BOW does not represent the relations between words and documents. Therefore, it fails to solve the disambiguation problem among words (terms).

Therefore, semantic knowledge is introduced as a background [6] to increase the accuracy of classification. Wikipedia [7, 9, 13] and WordNet [3] are two main types of semantic knowledge that are involved in document classification. Semantic knowledge approaches represent text documents as a bag of concepts (BOC). They treat terms as concepts with semantic weights that depend on relationships among them. Wikipedia is a large repository in the Internet that contains more than 4 million articles at the time this paper is written. Each page (Wikipage) in Wikipedia describes a single topic. The page title describes concepts in the Wikipedia semantic network that is built hierarchically. In Ref. [3], the authors used the Wikipedia structure to represent documents as BOC for the classification process. Using WordNet [12, 18], BOW is enriched by new features representing topics of the text. The performance of classification in both methods is significantly better than BOW. However, data enrichment can also introduce noise.

We propose a novel approach, Classification Based Enrichment Representation (CBER), for classifying documents using WordNet as a semantic background. CBER exploits the WordNet ontology hierarchical structure and relations to provide terms (concepts) a new weight. The new weights depend on an accurate assessment of semantic similarities among terms. Moreover, the proposed approach enriches the STDs with semantic weights to solve disambiguation problems such as polysemous and synonyms. We propose two approaches. The first is Semantic Analysis Based WordNet Model (SAWN), and the second is a hybrid approach between SAWN and the traditional document representation, SAWN_WVTF. The word vector term frequency (WVTF) is a BOW representation of text documents. SAWN chooses the most suitable synonym for document terms by studying and understanding the surrounding terms in the same document. This is done without the need to increase the document features such as in Ref. [18]. SAWN_WVTF is a hybrid approach that discovers the hidden information in STDs by identifying the important words.

We applied CBER on short text “web snippets.” These types of text documents are noisy, and terms are always rare. Snippets do not share enough words to overlap well. They contain few words and do not provide enough co-occurrence of terms. The CBER performance is compared to other approaches [3, 11, 18] that use WordNet as semantic knowledge to represent documents. The obtained results are very promising. The CBER performance reaches 93% and 94% in precision and recall, respectively.

The remainder of the paper is organized as follows. The previous work is presented in Section 3. The proposed approach, CBER, is described in Section 4. In Section 5, we present the experimental results and evaluation process. In Section 6, we conclude our results and overall conclusion about our approach.

2 Literature Overview

In recent years, the classification of STDs has been in research focus. Two main types were proposed: enrichment- and reduction-based classification. STDs do not have enough co-occurrence terms or shared context for classification. Enrichment methods [1, 3, 11, 12, 14, 18, 22] are used to enrich the short text with more semantic information to increase document terms. In Ref. [11], the enrichment methods are based on the BOW representation for text document. They generate new words derived from an external-knowledge base, such as Wikipedia. Wikipedia is crawled to extract different topics that are associated to document keywords (terms). The new extracted topics are added to the documents as new semantic features to enrich STDs with new information.

In addition, document enrichment may be done by topic analysis using Latent Dirichlet allocation (LDA) [3, 5, 12, 18], which uses probabilistic models to perform latent semantic analysis to include synonym and polysemy. LDA includes the hidden topics of STD and enriches the traditional BOW text document representation with topics.

In Ref. [18], the Wikipedia knowledge base is used to apply semantic analysis to extract all topics covered by the document. TAGME, a topical annotator, is used to identify different spots from Wikipedia to annotate the text document. Moreover, they use latent topics derived from LSA (Latent Semantic Analysis) or LDA. As in Ref. [3], they annotate all training data with subtopics. They detect the topics occurring in the input texts by using a recent set of information retrieval (IR) tools, called topic annotators [11]. These tools are efficient and accurate in identifying meaningful sequences of terms in a text and link them to pertinent Wikipedia pages representing their underlying topics. Then, a ranking function is used to rank higher topics to represent documents.

In Ref. [21], the authors map the document terms to topics with different weights using LDA. Each document is represented as topic features rather than term features. In Ref. [1], the authors used LDA to extract documents topics; then, a semantic relationship is built among extracted topics of a document and its words.

Moreover, a reduction approach is proposed to solve the problems of short text in classification [9, 14]. This approach reduces the document features and exchanges them with new terms. The new features are selected using WordNet for better classification accuracy. Soucy and Mineau [15] follow the same way by extracting some terms as features. Terms that have weights greater than a specific threshold are selected based on the weighting function in Ref. [15].

In Ref. [16], the authors reduce document features by selecting a set of words to represent a document and its topics. They use BOW representation and term frequency tf or term frequency-inverse term frequency tf-idf [17] to extract a few words to be used as query words to search with them. The words are extracted according to a clarity function to give score to the words that share specific topics.

The previous methods have many drawbacks:

In enrichment methods [4, 7, 10, 13], new features or words are added to text, which increase document representation dimensions and classification process time.
In reduction methods [9, 16], documents are represented only by their topics using Wikipedia or WordNet. These methods focus on words that are related to text topics and neglect others.

3 CBER

We propose the CBER model. Figure 1 shows the main modules of CBER. The proposed approach enriches the short text with auxiliary information provided by WordNet. WordNet is a lexical database for the English language [16]. It groups English words into sets of synonyms called synsets, and records relations among these synonym sets or their members. We use the word document to refer to STDs. The proposed CBER consists of

Document preprocessing;
Document representation using WVTF;
Document enrichment using the SAWN approach;
Hybrid approach SAWN_WVTF using both WVTF and SAWN approaches.

Figure 1:

CBER Approach.

3.1 Document Preprocessing

Each text document is introduced to CBER as a line of terms or words. CBER indexes document into terms, removes stop words, and applies stemming using the Porter stemmer algorithm [10]. We apply the stemmer algorithm on words that are not defined in WordNet. Moreover, a pruning step is done to eliminate rare words. Rare words are terms that only appear in a few documents and are unlikely to be helpful in annotation. The unwanted rare words increase the size of BOW. Therefore, we set a threshold for pruning to reduce the number of features and get rid of words that have a number of occurrences in a data set less than the predefined threshold.

3.2 WVTF

STDs are represented as vectors. Each vector is a set of keywords (terms). Each word has a weight, term frequency tf, which is the number of occurrences of this word in a document.

We apply the term frequency-inverse document frequency, tf-idf [15], weighting function to improve the classification performance. The term weight, t_w, is defined as

(1)tw = log(tft,d + 1)lognDt,

where tf_t,d is the number of occurrences of term t in document d, n is the total number of documents, and D_t is the number of documents that contain term t. After that, we apply normalization using the L² norm function defined for complex vector, also known as the Euclidean norm.

(2)Lnorm2(ti) = twiSqrt(tw02 + tw12+ .. +twm2),

where t_i is the term i in document d; twi is the weight of term i; and m is the number of terms in document d. The norm function helps in computing similarities between documents [6]. We use a cosine function [13] to calculate the semantic similarities among documents.

3.3 SAWN

The proposed semantic analysis based on WordNet, SAWN, uses WordNet to get and capture the semantic meaning of documents. SAWN chooses the best concepts that represent a document semantically. WordNet is a database of English words that are linked together by their semantic relationships. It is a dictionary or thesaurus with a graph structure hierarchy. It contains 155,327 terms, 597 senses, and 207,016 pairs of term-sense. It groups nouns, verbs, adjectives, and adverbs into sets of synonyms. Words having the same concept are grouped into synsets. Each synset contains a brief definition, gloss, for the synset. It supports different relations such as hyperonymy, hyponymy, or is-a relation.

SAWN enriches STDs with semantic information to understand document meaning and overcome the disambiguation problems [8]. The proposed SAWN captures the most meaningful sense of a document terms by studying and understanding the surrounding terms in the same document.

Many semantic similarity measures [2, 20] are used for calculating the relatedness among terms. The relatedness is based on gloss overlaps [19]; that is, if the glosses (definitions) of two concepts share words, then they are related. The more words the two concept glosses share, the concepts are more related. We adopt the similarity measure of Wu and Palmer [2, 7] to calculate the relatedness between two senses to solve the disambiguation problem.

Each document d_j∀1≤j≤n, where n is the number of documents in the data set D, is represented as a vector of terms t_i∀1≤i≤m, where m is the number of terms in document d_j. Document d_j is defined as a vector of terms as d_j=t₁, t₂, …, t_m.

The proposed SAWN is searching for the best meaning of a term t_i. Each term has many senses. Each sense is given a score. The sense that has the highest score is chosen to be the best meaning of a term t_i. For example, the term “dog” has many senses:

<Synset(’dog’), Synset(’frump’), Synset(’cad’), Synset(’frank’), Synset(’pawl’), Synset(’andiron’), Synset(’chase’)>

SAWN captures the best synset (sense) of the term “dog” by its context, and the relatedness of senses and other senses in the document. SAWN(t_i) is defined as

(3)SAWN(ti) = ∑tn ≠ timaxm(SimDist(S(ti)j, S(tn)m),

where S(t_i)_j is the sense j of term t_i, SimDist is the similarity distance between term senses, and max_m is the highest sense score of similarity distance.

SAWN calculates semantic similarities by considering the depths of the two senses, along with the depth of the lowest common ancestor (LCA) of two senses. A score is given to represent semantic distance, 0<score<=1. A score cannot be zero because the depth of the LCA is never zero. The score is 1 if the two input senses are the same.

(4)SimDist(S(t1), S(t2)) = 2 ∗ Depth(LCA(S(t1), S(t2)))Depth(S(t1)) + Depth(S(t2)).

Equation (4) returns a score indicating how similar two senses are, based on the depth of the two senses in the taxonomy and their LCA.

3.4 Hybrid SAWN_WVTF Approach

SAWN enhances classification performance if all document terms are defined in WordNet. However, some document terms may not be found and defined in WordNet. They may be abbreviations or spelling mistake terms. The undefined words do not have senses. Therefore, term frequency should be considered in weighting scores. The hybrid SAWN_WVTF is proposed to sum the term frequency weight using WVTF and semantic weight using SAWN. SAWN_WVTF calculates the new semantic weight for term t_i as follows:

(5)SAWNWVTF(ti) = twi + SAWN(ti),

where twi is the term weight using the WVTF approach. Combining the two approaches solves the limitations of the previous work by adapting the Lesk dictionary algorithm [16]. Experiments show that the CBER with its two approaches overcomes the limitations found in other works [9, 18, 21].

4 Experimental Results

We evaluate the proposed CBER with its two approaches, SAWN and the hybrid SAWN_WVTF, over a snippets data set. We use a snippets data set because it is an example of STDs that is used in Refs. [11, 18].

The snippets data set has been created by Phan et al. [11]. It is composed of 12 K snippets drawn from Google. The snippets data set is labeled to eight classes: business, computers, culture-arts, education-science, engineering, health, politics, and sports. Figure 2 shows the eight classes of the snippets data set.

Figure 2:

Snippets Data Set Classes.

A naive Bayes classifier [6] is used to assess and evaluate the proposed model. It is a simple probabilistic classifier based on applying Bayes theorem with strong (naive) independence assumptions between the features. The naive Bayes classifier is highly scalable, and helps to get very good results. We do cross-folding validation on the snippets to validate the classification process. The snippets data set is partitioned into two groups. One partition is the training data set. The second is the testing data set.

Cross-folding validation measures the classification accuracy over the training data set, and takes different values from 2 to 10. For each fold, we measure the relative absolute error in the WVTF approach and CBER. The comparison of the relative absolute error is shown in Figure 3. We stop cross-validation on 10-folds as the error decreases. In our experiments, we reduce the average error rate from 24.94% using the WVTF approach to 15.0% using the proposed CBER. The improvement reaches 10% in error rate measure.

Figure 3:

Cross-folding Validation on the Training Data Set.

Four measures are used to evaluate the classification performance of the proposed model. They are Precision, Recall, F-measure, and Accuracy. The performance measures are defined as

(6)Precision = TpTp + Fp,

(7)Recall = TpTp + Fn,

(8)FMeasure = 2 ∗ Precision∗RecallPrecision + Recall

(9)Accuracy = Tp + TnTp + Tn + Fp + Fn,

where Tp, true positive, is the number of documents correctly selected to their classes; Fp, false positive, is the number of documents incorrectly rejected from their classes; Fn, false negative, is the number of documents incorrectly selected to a class; and Tn, true negative, is the number of documents correctly rejected from a class.

In Figure 4, we run the classifier on different data set sizes starting from 1 K to 10 K. We compare our proposed SAWN and SAWN_WVTF with WVTF as a baseline. The STD data set shares few common words. It is very sparse. Therefore, the WVTF fails to label the documents to their correct classes [11]. The CBER gets higher accuracy in classification, as it increases the accuracy from 65.75% to 93%. CBER is efficient in working with sparse and noisy data sets. It is not based only on frequency weight for terms but it also adds a semantic weight to capture the semantic relatedness of the context of terms.

Figure 4:

Accuracy of the CBER Approach in Comparison with the Baseline Approach as in Ref. [11].

Figures 5–7 show a detailed comparison for the SAWN and hybrid SAWN_WVTF compared to WVTF. Figure 8 compares SAWN and hybrid SAWN_WVTF with the classifier found in Refs. [11, 18]. The authors proposed a classifier called topical classifier that extracts topics related to documents.

Figure 5:

Evaluation of the CBER Approach in Terms of Precision.

Figure 6:

Evaluation of the CBER Approach in Terms of Recall.

Figure 7:

Evaluation of the CBER Approach in Terms of F-Measure.

Figure 8:

Results of the CBER Approach.

The topical classifier [11] is based on enrichment representation. It enriches documents with document topics using TAGME tools [18]. The topical annotator [18] is connected to Wikipedia and classifies the extracted topics. The results are not good because of data sparseness. It loses parts of its original text. Its results are close to the results of Phan et al. [11]. They built a framework for classifying STD. The framework gets document topics using LDA. It chooses a large number of words with special characteristics to cover all documents. Then, it uses the topics to classify documents and annotate them.

As shown in Figure 8, we compare the accuracies of the proposed SAWN and SAWN_WVTF at different database sizes with those in Phan et al. [11], Topical, MaxEnt, and SVM. SAWN increases the accuracy from 81% to 89% compared to that in Phan et al. SAWN_WVTF reaches an accuracy of 93% compared with the Topical classifier, which gets a maximum of 81% classification performance, while SVM and MaxEnt reach an accuracy of 74.93% and 65.75%, respectively.

The results ensure that the proposed CBER outperforms other approaches in terms of accuracy, precision, recall, and F-measure. The main contribution of this work is to assign a weighting score to document terms that are related in meaning. We adopt Wu and Palmer’s method to measure the semantic relatedness among the senses of terms. If terms do not have senses because they are not defined in the dictionary, the traditional weighting score WVTF is only considered. Therefore, CBER only gives good results if text documents are written in formal English language. If they are not written well, CBER gives the same results of the traditional methods such as WVTF.

5 Conclusion

In this paper, we proposed a novel, scalable, and efficient approach for classifying STDs. The proposed CBER focuses on STD enrichment with hidden information. We focused on capturing the semantic context of the STD concepts.

The main contribution of this paper is employing WordNet to solve disambiguation problems in short text classification. CBER consists of two approaches: SAWN and SAWN_WVTF. The proposed CBER captures the semantic context of documents by analyzing documents and giving a semantic score to document terms.

The traditional methods are based on the term frequency in the document. We applied CBER on short text web snippets. A snippets data set is sparse and noisy. Moreover, it does not share enough terms to overlap well.

Extensive experimental evaluation shows that the additional semantic information increases the accuracy of the classification results. This performance improvement demonstrates a promising achievement compared to other document classification methods in terms of Precision, Recall, F-measure, and Accuracy.

Corresponding author: Walaa Gad, Faculty of Computers and Information Sciences, Ain Shams University, Abbassia, Cairo 11566, Egypt, e-mail: Walaagad@cis.asu.edu.eg

Bibliography

[1] A. Bouaziz, C. Dartigues and P. Lloret, Short text classification using semantic random forest, Springer International Publishing, Switzerland, 2014.10.1007/978-3-319-10160-6_26Search in Google Scholar

[2] A. Budanitsky and G. Hirst, Evaluating WordNet-based measures of lexical semantic relatedness, Association for Computational Linguistics, J. Comput. Linguis.-COLI, 32 (2006), 13–47.10.1162/coli.2006.32.1.13Search in Google Scholar

[3] P. Ferragina and U. Scaiella, TAGME: on-the-fly annotation of short text fragments (by Wikipedia entities), in: Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM ’10), pp. 1625–1628, ACM, New York, 2010.10.1145/1871437.1871689Search in Google Scholar

[4] Y. Genc, Y. Sakamoto and J. Nickerson, Discovering context: classifying tweets through a semantic transform based on Wikipedia, in: FAC 2011, edited by D. D. Schmorrow and C. M. Fidopiastis. LNCS, Springer, Heidelberg, 2011.10.1007/978-3-642-21852-1_55Search in Google Scholar

[5] J. Hoffart, M. Yosef, I. Bordino, M. Pinkal, M. Spaniol, B. Taneva, S. Thater and G. Weikum, Robust disambiguation of named entities in text, in: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 782–792, Edinburgh, Scotland, UK, 2011.Search in Google Scholar

[6] V. Korde and C. Mahender, Text classification and classifiers: a survey, Int. J. Artif. Intell. Appl. (IJAIA)3 (2012).10.5121/ijaia.2012.3208Search in Google Scholar

[7] C. Makris, Y. Plegas and E. Theodoridis, Improved text annotation with Wikipedia entities, Coimbra, Portugal, 2013.10.1145/2480362.2480425Search in Google Scholar

[8] R. Navigli, Word sense disambiguation: a survey, J. ACM Comput. Surveys–CSUR, 41 (2009), 1–69.10.1145/1459352.1459355Search in Google Scholar

[9] L. Patil and M. Atique, A semantic approach for effective document clustering using WordNet, arXiv preprint arXiv:1303.0489, 2013.Search in Google Scholar

[10] T. Pedersen, S. Patwardhan and J. Michelizzi, WordNet: similarity-measuring the relatedness of concepts, American Association for Artificial Intelligence, 2004. www.aaai.org.10.3115/1614025.1614037Search in Google Scholar

[11] X. Phan, L. Nguyen and S. Horiguchi, Learning to classify short and sparse text and web with hidden topics from large-scale data collections, International World Wide Web Conference Committee (IW3C2), ACM, April, 2008.10.1145/1367497.1367510Search in Google Scholar

[12] U. Scaiella, P. Ferragina, A. Marino and M. Ciaramita, Topical clustering of search results, in: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 223–232. ACM, 2012.10.1145/2124295.2124324Search in Google Scholar

[13] J. Sedding and D. Kazakov, WordNet-based text document clustering, in: Proceedings of the 3rd Workshop on Robust Methods in Analysis of Natural Language Data. Association for Computational Linguistics, 2004.10.3115/1621445.1621458Search in Google Scholar

[14] G. Song, Y. Ye, X. Du, X. Huang and S. Bie, Short text classification: survey, J. Multimedia9 (2014), 635–643.10.4304/jmm.9.5.635-643Search in Google Scholar

[15] P. Soucy and G. Mineau, Beyond TFIDF weighting for text categorization in the vector space model, in: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 1130–1135, 2004.Search in Google Scholar

[16] A. Sun, Short text classification using very few words, in: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1143–1144, 2012.10.1145/2348283.2348511Search in Google Scholar

[17] X. Sun, W. Haofen and Y. Yong, Towards effective short text deep classification, in: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1143–1144. ACM, 2011.10.1145/2009916.2010090Search in Google Scholar

[18] D. Vitale, P. Ferragina and U. Scaiella, Classification of short texts by deploying topical annotations, in: Advances in Information Retrieval, vol. 7224, pp. 376–387, Springer Berlin Heidelberg, 2012.10.1007/978-3-642-28997-2_32Search in Google Scholar

[19] B. Wang, Y. Huang, W. Yang, and X. Li, Short text classification based on strong feature thesaurus, Zhejiang University and Springer-Verlag, Berlin, 2012.10.1631/jzus.C1100373Search in Google Scholar

[20] M. Warin, Using WordNet and semantic similarity to disambiguate an ontology, vol. 25. University of Stockholm, Stockholm, Sweden, 2004. Retrieved January.Search in Google Scholar

[21] L. Yang, C. Li and O. Ding, Combining lexical and semantic features for short text classification, Proc. Comp. Sci., 22 (2013), 78–86.10.1016/j.procs.2013.09.083Search in Google Scholar

[22] W. Yih and C. Meek, Improving similarity measures for short segments of text, in: AAAI'07 Proceedings of the 22nd National Conference on Artificial Intelligence, vol. 2, pp. 1489–1494, AAAI Press, Palo Alto, CA, USA, 2007.Search in Google Scholar

Received: 2015-6-27

Published Online: 2016-2-29

Published in Print: 2017-4-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

CBER: An Effective Classification Approach Based on Enrichment Representation for Short Text Documents

Abstract

1 Introduction

2 Literature Overview