Skip to main content
Log in

CCODM: conditional co-occurrence degree matrix document representation method

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Document representation is a key problem in document analysis and processing tasks, such as document classification, clustering and information retrieval. Especially for unstructured text data, the use of a suitable document representation method would affect the performance of the subsequent algorithms for applications and research. In this paper, we propose a novel document representation method called the conditional co-occurrence degree matrix document representation method (CCODM), which is based on word co-occurrence. CCODM not only considers the co-occurrence of terms but also considers the conditional dependencies of terms in a specific context, which leads to more available and useful structural and semantic information being retained from the original documents. Extensive experimental classification results with different supervised and unsupervised feature selection methods show that the proposed method, CCODM, achieves better performance than the vector space model, latent Dirichlet allocation, the general co-occurrence matrix representation method and the document embedding method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of China [Grant Numbers 71771034, 71501023, 71421001] and the Open Program of State Key Laboratory of Software Architecture [Item Number SKLSAOP1703]. Besides, We are very grateful to Dr. Deqing Wang (Wang et al. 2016b) for giving us all the code of RP-GSO and Dr. Xiangzhu Meng for guiding us to do all the experiments on doc2vec. We would like to thank the anonymous reviewers for their constructive comments on this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chonghui Guo.

Ethics declarations

Conflict of interest

Wei Wei, Chonghui Guo and Lin Tang have received research grants from Neusoft Corporation (Shenyang, PR China). Jingfeng Chen and Leilei Sun declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Communicated by V. Loia.

Appendix

Appendix

Term frequency (TF) The term frequency of term in corpus is computed by

$$\begin{aligned} \hbox {TF}(t)=\sum _{k=1}^N{freq(t | d^k)}, \end{aligned}$$
(A.1)

where \(freq(t|d^k)\) represents the occurrence times of term in document \(d^k\) and N is the number of documents in corpus.

Document frequency (DF) The document frequency of term in corpus is designed as

(A.2)

where the numerator represents the number of documents containing the term t in corpus and \(T^k\) is the feature set of document \(d^k\).

Information gain (IG) The information gain of term in corpus is defined as

$$\begin{aligned} \begin{aligned} \hbox {IG}(t)&=\, -\sum _{i=1}^C{p(c_i) \log p(c_i)} + p(t)\sum _{i=1}^C{p(c_i | t) \log p(c_i | t)} \\&\quad \, + p(\bar{t})\sum _{i=1}^C{p(c_i | \bar{t}) \log p(c_i | \bar{t})}, \end{aligned} \end{aligned}$$
(A.3)

where \(c_i\) represents the \(i\hbox {th}\) category document in corpus, C is the total number of categories label in corpus and \(\bar{t}\) means that term t does not occur. \(p(c_i)\) is the probability of the category \(c_i\) in corpus, p(t) is the probability of documents containing term t in corpus, and \(p(\bar{t})\) is the probability of documents that term t does not occur. \(p(c_i | t)\) is the conditional probability of the \(i\hbox {th}\) category given that term t occurs and \(p(c_i | \bar{t})\) is the conditional probability of the \(i\hbox {th}\) category given that term t does not occur. Besides, it is worth emphasizing that all the definitions of these symbols appeared in the following are consistent with the definitions in IG.

For the actual convenience of calculation, we define \(A_i(t)\) as the number of documents containing term t and belonging to category \(c_i\), \(B_i(t)\) as the number of documents belonging to category \(c_i\) but not containing term t, \(C_i(t)\) as the number of documents containing term t but not belonging to category \(c_i\). Therefore, the information gain of term t in corpus is calculated by

$$\begin{aligned}&\hbox {IG}(t)= -\sum _{i=1}^C{p(c_i) \log p(c_i)} + p(t)\sum _{i=1}^C{p(c_i | t) \log p(c_i | t)} \nonumber \\&\quad \quad \quad \quad + p(\bar{t})\sum _{i=1}^C{p(c_i | \bar{t}) \log p(c_i | \bar{t})} \nonumber \\&\quad \quad =-\sum _{i=1}^C{\{ (A_i(t) +B_i(t))/N \}} \log {\{ (A_i(t) +B_i(t))/N \}} \nonumber \\&\quad \quad \quad + \{ (A_i(t)+C_i(t))/N \} \cdot \sum _{i=1}^C{\{ A_i(t)/(A_i(t)+C_i(t)) \}} \nonumber \\&\quad \log {\{A_i(t)/(A_i(t)+C_i(t)) \}} + \{ (N-A_i(t)-C_i(t))/N \} \nonumber \\&\quad \quad \quad \cdot \sum _{i=1}^C{\{B_i(t)/(N-A_i(t)-C_i(t)) \}}\nonumber \\&\quad \quad \times \log {\{B_i(t)/(N- A_i(t)-C_i(t)) \}}. \end{aligned}$$
(A.4)

Mutual information (MI) The mutual information between term t and category \(c_i\) is formulated by

$$\begin{aligned} \hbox {MI}(t,c_i)= \log p(t | c_i )/p(t) = \log p(t, c_i )/(p(t) p(c_i)), \end{aligned}$$
(A.5)

where \(p(t, c_i )\) is the joint probability of documents containing term t and belonging to category \(c_i \). Moreover, the MI of term t to the whole corpus can be expressed in terms of the average value of the MI of term with each category in corpus, which is formulated by

$$\begin{aligned} \begin{aligned} \hbox {MI}(t)&= \sum _{i=1}^C{p(c_i)\log p(t | c_i )} /p(t) \\&= \sum _{i=1}^C{p(c_i)\log p(t,c_i )} /(p(t) p(c_i)). \end{aligned} \end{aligned}$$
(A.6)

In order to calculated conveniently, we give the same definition of \(A_i(t)\), \(B_i(t)\) and \(C_i(t)\) as in the calculation of IG above. Therefore, the MI of term t in corpus is formulated by

$$\begin{aligned} \begin{aligned}&\hbox {MI}(t)= \sum _{i=1}^C{p(c_i)\log p(t,c_i )} /(p(t) p(c_i)) \\&\quad = \sum _{i=1}^C{\{ (A_i(t) +B_i(t))/N \}}\\ {}&\times \log {\{A_i(t) \cdot N/\{ (A_i(t) +B_i(t)) \cdot (A_i(t) +C_i(t)) \} \}}. \end{aligned} \end{aligned}$$
(A.7)

Expected cross-entropy (ECE) The expected cross-entropy of term is defined as

$$\begin{aligned} \hbox {ECE}(t)= p(t) \sum _{i=1}^C{p(c_i|t)\log p(c_i | t)/p(c_i)}. \end{aligned}$$
(A.8)

As the definition of \(A_i(t)\), \(B_i(t)\) and \(C_i(t)\) in the calculation of IG and MI above, the MI of term t in corpus is redesigned conveniently as

$$\begin{aligned} \begin{aligned} \hbox {ECE}(t)&= p(t) \sum _{i=1}^C{p(c_i|t)\log p(c_i | t)/p(c_i)} \\ \quad&= \sum _{i=1}^C{p(c_i, t)\log p(t | c_i)/p(t)} \\ \quad&= \sum _{i=1}^C{p(c_i, t)\log p(t, c_i)/(p(t)p(c_i))} \\ \quad&= \sum _{i=1}^C{ \{A_i(t) /N\}} \\&\quad \cdot \log { \{ A_i(t) \cdot N/\{ (A_i(t) +B_i(t)) \cdot (A_i(t) +C_i(t))\}\}}. \end{aligned} \end{aligned}$$
(A.9)

Random projection and Gram–Schmidt orthogonalization (RP-GSO) The original paper (Wang et al. 2016b) gives a detailed description of the unsupervised feature selection model, RP-GSO, and here we will not make a copy of this model any more.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, W., Guo, C., Chen, J. et al. CCODM: conditional co-occurrence degree matrix document representation method. Soft Comput 23, 1239–1255 (2019). https://doi.org/10.1007/s00500-017-2844-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-017-2844-8

Keywords

Navigation