CCODM: conditional co-occurrence degree matrix document representation method

Wei, Wei; Guo, Chonghui; Chen, Jingfeng; Tang, Lin; Sun, Leilei

doi:10.1007/s00500-017-2844-8

CCODM: conditional co-occurrence degree matrix document representation method

Methodologies and Application
Published: 20 September 2017

Volume 23, pages 1239–1255, (2019)
Cite this article

Soft Computing Aims and scope Submit manuscript

Wei Wei^1,2,
Chonghui Guo^1,2,
Jingfeng Chen¹,
Lin Tang^1,2,3 &
…
Leilei Sun¹

529 Accesses
2 Citations
Explore all metrics

Abstract

Document representation is a key problem in document analysis and processing tasks, such as document classification, clustering and information retrieval. Especially for unstructured text data, the use of a suitable document representation method would affect the performance of the subsequent algorithms for applications and research. In this paper, we propose a novel document representation method called the conditional co-occurrence degree matrix document representation method (CCODM), which is based on word co-occurrence. CCODM not only considers the co-occurrence of terms but also considers the conditional dependencies of terms in a specific context, which leads to more available and useful structural and semantic information being retained from the original documents. Extensive experimental classification results with different supervised and unsupervised feature selection methods show that the proposed method, CCODM, achieves better performance than the vector space model, latent Dirichlet allocation, the general co-occurrence matrix representation method and the document embedding method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Clustering graph data: the roadmap to spectral techniques

Article Open access 22 January 2024

References

Azam N, Yao J (2012) Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst Appl 39(5):4760–4768. doi:10.1016/j.eswa.2011.09.160
Article Google Scholar
Benabdeslem K, Elghazel H, Hindawi M (2016) Ensemble constrained laplacian score for efficient and robust semi-supervised feature selection. Knowl Inf Syst 49(3):1161–1185. doi:10.1007/s10115-015-0901-0
Article Google Scholar
Bengio Y, Courville A, Vincent P (2014) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828. doi:10.1109/TPAMI.2013.50
Article Google Scholar
Bengio Y, Schwenk H, Sencal J, Morin F, Gauvain J (2003) Neural probabilistic language models. J Mach Learn Res 3(6):1137–1155, doi:10.1162/153244303322533223, http://dl.acm.org/citation.cfm?id=944919.944966
Bernotas M, Laurutis R (2007) The peculiarities of the text document representation, using ontology and tagging-based clustering technique. J Inf Technol Control 36(2):217–220
Google Scholar
Bettina G, Kurt H (2017) Topicmodels: an R package for fitting topic models. Version 0.2-6. doi:10.18637/jss.v040.i13
Bhushan S, Danti A (2017) Classification of text documents based on score level fusion approach. Pattern Recognit Lett 94:118–126. doi:10.1016/j.patrec.2017.05.003
Article Google Scholar
Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022, http://dl.acm.org/citation.cfm?id=944919.944937
Boulares M, Jemni M (2016) Learning sign language machine translation based on elastic net regularization and latent semantic analysis. Artif Intell Rev 46(2):145–166. doi:10.1007/s10462-016-9460-3
Article Google Scholar
Bullinaria J, Levy J (2012) Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behav Res Methods 44(3):890–907. doi:10.3758/s13428-011-0183-8
Article Google Scholar
Cambria E, Gastaldo P, Bisio F, Zunino R (2015) An ELM-based model for affective analogical reasoning. Neurocomputing 149:443–455. doi:10.1016/j.neucom.2014.01.064
Article Google Scholar
Cheng X, Yan X, Lan Y, Guo J (2014) Btm: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941. doi:10.1109/TKDE.2014.2313872
Article Google Scholar
Du Y, Liu W, Lv X, Peng G (2015) An improved focused crawler based on semantic similarity vector space model. Appl Soft Comput 36:392–407. doi:10.1016/j.asoc.2015.07.026
Article Google Scholar
Farahat A, Kamel M (2011) Statistical semantics for enhancing document clustering. Knowl Inf Syst 28(2):365–393. doi:10.1007/s10115-010-0367-z
Article Google Scholar
Franco-Salvador M, Gupta P, Rosso P, Banchs R (2016) Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowl Based Syst 111:87–99. doi:10.1016/j.knosys.2016.08.004
Article Google Scholar
Hsu C, Huang W (2016) Integrated dimensionality reduction technique for mixed-type data involving categorical values. Appl Soft Comput 43:199–209. doi:10.1016/j.asoc.2016.02.015
Article Google Scholar
Huang H, Kuo Y (2010) Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach. IEEE Trans Fuzzy Syst 18(6):1098–1111. doi:10.1142/S0218001411008890
Article Google Scholar
Ibrahim O, Landa-Silva D (2016) Term frequency with average term occurrences for textual information retrieval. Soft Comput 20(8):3045–3061. doi:10.1007/s00500-015-1935-7
Article Google Scholar
Jin L, Gong W, Fu W, Wu H (2015) A text classifier of english movie reviews based on information gain. In: The 3rd international conference on applied computing and information technology/2nd international conference on computational science and intelligence, pp 454–457. doi:10.1109/ACIT-CSI.2015.86
Johnson-laird P, Oatley K (1989) The language of emotions: an analysis of a semantic field. Cogn Emot 3(3):81–123. doi:10.1080/02699938908408075
Article Google Scholar
Keikha M, Khonsari A, Oroumchian F (2009) Rich document representation and classification: an analysis. Knowl Based Syst 22(1):67–71. doi:10.1016/j.knosys.2008.06.002
Article Google Scholar
Lau R, Xia Y, Ye Y (2014) A probabilistic generative model for mining cybercriminal networks from online social media. IEEE Comput Intell Mag 9(1):31–43. doi:10.1109/MCI.2013.2291689
Article Google Scholar
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1188–1196
Li J, Li J, Fu X, Masud M, Huang J (2016) Learning distributed word representation with multi-contextual mixed embedding. Knowl Based Syst 106:220–230. doi:10.1016/j.knosys.2016.05.045
Article Google Scholar
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22. http://CRAN.R-project.org/doc/Rnews/
Liaw A, Wiener M (2015) Package ’randomForest’. Breiman and Cutlers random forests for classification and regression. Version 4.6-12. https://www.stat.berkeley.edu/~breiman/RandomForests/
Liu Q, Zhang H, Yu H, Cheng X (2004) Chinese lexical analysis using cascaded hidden Markov model. J Comput Res Dev 41(8):1421–1429
Google Scholar
Liu Z, Yu W, Deng Y, Bian Z (2010) A feature selection method for document clustering based on part-of-speech and word co-occurrence. In: 2010 Seventh international conference on fuzzy systems and knowledge discovery, vol 5, pp 2331–2334. doi:10.1109/FSKD.2010.5569827
Lopez-Gazpio I, Maritxalar M, Gonzalez-Agirre A, Rigau G, Uria L, Agirre E (2017) Interpretable semantic textual similarity: finding and explaining differences between sentences. Knowl Based Syst 119:186–199. doi:10.1016/j.knosys.2016.12.013
Article Google Scholar
Lu Y, Mei Q, Zhai C (2011) Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retr J 14(2):178–203. doi:10.1007/s10791-010-9141-9
Article Google Scholar
Lu M, Zhao X, Zhang L, Li F (2016) Semi-supervised concept factorization for document clustering. Inf Sci 331:86–98. doi:10.1016/j.ins.2015.10.038
Article MathSciNet MATH Google Scholar
Miao Y, Grefenstette E, Blunsom P (2017) Discovering discrete latent topics with neural variational inference. arXiv preprint arXiv:1706.00359
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119
Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space, pp 1–12. arXiv preprint arXiv:1301.3781
Neubig G, Watanabe T (2016) Optimization for statistical machine translation: a survey. Comput Linguist 42(1):1–54. doi:10.1162/COLI_a_00241
Article MathSciNet Google Scholar
Nguyen A, Yosinski J, Clune J (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 427–436, http://arxiv.org/abs/1412.1897
Pessiot J, Kim Y, Amini M, Gallinari P (2010) Improving document clustering in a learned concept space. Inf Process Manag 46(2):180–192. doi:10.1016/j.ipm.2009.09.007
Article Google Scholar
Phan X, Nguyen C, Le D, Nguyen L, Horiguchi S, Ha Q (2011) A hidden topic-based framework toward building applications with short web documents. IEEE Trans Knowl Data Eng 23(7):961–976. doi:10.1109/TKDE.2010.27
Article Google Scholar
Radim Ř, Petr S (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, pp 45–50
Ravi D, Bober M, Farinella G, Guarnera M, Battiato S (2016) Semantic segmentation of images exploiting DCT based features and random forest. Pattern Recognit 52:260–273. doi:10.1016/j.patcog.2015.10.021
Article Google Scholar
Ren F, Sohrab M (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125. doi:10.1016/j.ins.2013.02.029
Article Google Scholar
Rule A, Cointet J, Bearman P (2015) Lexical shifts, substantive changes, and continuity in State of the Union discourse. Proc Natl Acad Sci USA 112(35):10,837–10,844. doi:10.1073/pnas.1512221112
Article Google Scholar
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi:10.1145/361219.361220
Article MATH Google Scholar
Tang G, Xia Y, Sun J, Zhang M, Zheng TF (2015) Statistical word sense aware topic models. Soft Comput 19(1):13–27
Article Google Scholar
Trovati M, Bessis N (2016) An influence assessment method based on co-occurrence for topologically reduced big data sets. Soft Comput 20(5):2021–2030. doi:10.1007/s00500-015-1621-9
Article Google Scholar
Vila M, Bardera A, Feixas M, Sbert M (2011) Tsallis mutual information for document classification. Entropy 13(9):1694–1707. doi:10.3390/e13091694
Article MATH Google Scholar
Wang H (2015) Study on the application of feature selection for big text data using expected cross entropy. J Inf Comput Sci 12(18):6835–6843. doi:10.12733/jics20150077
Article Google Scholar
Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-Test feature selection approach based on term frequency for text categorization. Pattern Recognit Lett 45(11):1–10. doi:10.1016/j.patrec.2014.02.013
Article Google Scholar
Wang D, Shen H, Truong Y (2016a) Efficient dimension reduction for high-dimensional matrix-valued data. Neurocomputing 190:25–34. doi:10.1016/j.neucom.2015.12.096
Article Google Scholar
Wang D, Zhang H, Liu R, Liu X, Wang J (2016b) Unsupervised feature selection through Gram–Schmidt orthogonalization—a word co-occurrence perspective. Neurocomputing 173(P3):845–854. doi:10.1016/j.neucom.2015.08.038
Article Google Scholar
Wu Z, Zhu H, Li G, Cui Z, Huang H, Li J, Chen E, Xu G (2017) An efficient Wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28. doi:10.1016/j.ins.2017.02.009
Xiao Q, Song R (2017) Motion retrieval based on motion semantic dictionary and HMM inference. Soft Comput 21(1):255–265. doi:10.1007/s00500-016-2059-4
Article MathSciNet Google Scholar
Xu H, Zhang F, Wang W (2015) Implicit feature identification in Chinese reviews using explicit topic mining model. Knowl Based Syst 76:166–175. doi:10.1016/j.knosys.2014.12.012
Yan H, Yang J (2014) Joint laplacian feature weights learning. Pattern Recognit 47(3):1425–1432. doi:10.1016/j.patcog.2013.09.038
Article MATH Google Scholar
Yang Y, Pedersen J (1997) A comparative study on feature selection in text categorization. In: Proceedings of fourteenth international conference on machine learning (ICML), vol 4, pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137
Zheng Y, Han W, Zhu C (2014) A novel feature selection method based on category distribution and phrase attributes. In: International conference on trustworthy computing and services (ISCTCS), Berlin, Heidelberg, pp 25–32. doi:10.1007/978-3-662-47401-3_4
Zhou Q, Zhou H, Li T (2016) Cost-sensitive feature selection using random forest: selecting low-cost subsets of informative features. Knowl Based Syst 95:1–11. doi:10.1016/j.knosys.2015.11.010
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of China [Grant Numbers 71771034, 71501023, 71421001] and the Open Program of State Key Laboratory of Software Architecture [Item Number SKLSAOP1703]. Besides, We are very grateful to Dr. Deqing Wang (Wang et al. 2016b) for giving us all the code of RP-GSO and Dr. Xiangzhu Meng for guiding us to do all the experiments on doc2vec. We would like to thank the anonymous reviewers for their constructive comments on this paper.

Author information

Authors and Affiliations

Institute of Systems Engineering, Dalian University of Technology, Dalian, People’s Republic of China
Wei Wei, Chonghui Guo, Jingfeng Chen, Lin Tang & Leilei Sun
State Key Laboratory of Software Architecture (Neusoft Corporation), Shenyang, People’s Republic of China
Wei Wei, Chonghui Guo & Lin Tang
City Institute, Dalian University of Technology, Dalian, People’s Republic of China
Lin Tang

Authors

Wei Wei
View author publications
You can also search for this author in PubMed Google Scholar
Chonghui Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jingfeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lin Tang
View author publications
You can also search for this author in PubMed Google Scholar
Leilei Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chonghui Guo.

Ethics declarations

Conflict of interest

Wei Wei, Chonghui Guo and Lin Tang have received research grants from Neusoft Corporation (Shenyang, PR China). Jingfeng Chen and Leilei Sun declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Communicated by V. Loia.

Appendix

Term frequency (TF) The term frequency of term in corpus is computed by

$$\begin{aligned} \hbox {TF}(t)=\sum _{k=1}^N{freq(t | d^k)}, \end{aligned}$$

(A.1)

where $freq(t|d^k)$ represents the occurrence times of term in document $d^k$ and N is the number of documents in corpus.

Document frequency (DF) The document frequency of term in corpus is designed as

(A.2)

where the numerator represents the number of documents containing the term t in corpus and $T^k$ is the feature set of document $d^k$.

Information gain (IG) The information gain of term in corpus is defined as

$$\begin{aligned} \begin{aligned} \hbox {IG}(t)&=\, -\sum _{i=1}^C{p(c_i) \log p(c_i)} + p(t)\sum _{i=1}^C{p(c_i | t) \log p(c_i | t)} \\&\quad \, + p(\bar{t})\sum _{i=1}^C{p(c_i | \bar{t}) \log p(c_i | \bar{t})}, \end{aligned} \end{aligned}$$

(A.3)

where $c_i$ represents the $i\hbox {th}$ category document in corpus, C is the total number of categories label in corpus and $\bar{t}$ means that term t does not occur. $p(c_i)$ is the probability of the category $c_i$ in corpus, p(t) is the probability of documents containing term t in corpus, and $p(\bar{t})$ is the probability of documents that term t does not occur. $p(c_i | t)$ is the conditional probability of the $i\hbox {th}$ category given that term t occurs and $p(c_i | \bar{t})$ is the conditional probability of the $i\hbox {th}$ category given that term t does not occur. Besides, it is worth emphasizing that all the definitions of these symbols appeared in the following are consistent with the definitions in IG.

For the actual convenience of calculation, we define $A_i(t)$ as the number of documents containing term t and belonging to category $c_i$, $B_i(t)$ as the number of documents belonging to category $c_i$ but not containing term t, $C_i(t)$ as the number of documents containing term t but not belonging to category $c_i$. Therefore, the information gain of term t in corpus is calculated by

$$\begin{aligned}&\hbox {IG}(t)= -\sum _{i=1}^C{p(c_i) \log p(c_i)} + p(t)\sum _{i=1}^C{p(c_i | t) \log p(c_i | t)} \nonumber \\&\quad \quad \quad \quad + p(\bar{t})\sum _{i=1}^C{p(c_i | \bar{t}) \log p(c_i | \bar{t})} \nonumber \\&\quad \quad =-\sum _{i=1}^C{\{ (A_i(t) +B_i(t))/N \}} \log {\{ (A_i(t) +B_i(t))/N \}} \nonumber \\&\quad \quad \quad + \{ (A_i(t)+C_i(t))/N \} \cdot \sum _{i=1}^C{\{ A_i(t)/(A_i(t)+C_i(t)) \}} \nonumber \\&\quad \log {\{A_i(t)/(A_i(t)+C_i(t)) \}} + \{ (N-A_i(t)-C_i(t))/N \} \nonumber \\&\quad \quad \quad \cdot \sum _{i=1}^C{\{B_i(t)/(N-A_i(t)-C_i(t)) \}}\nonumber \\&\quad \quad \times \log {\{B_i(t)/(N- A_i(t)-C_i(t)) \}}. \end{aligned}$$

(A.4)

Mutual information (MI) The mutual information between term t and category $c_i$ is formulated by

$$\begin{aligned} \hbox {MI}(t,c_i)= \log p(t | c_i )/p(t) = \log p(t, c_i )/(p(t) p(c_i)), \end{aligned}$$

(A.5)

where $p(t, c_i )$ is the joint probability of documents containing term t and belonging to category $c_i $. Moreover, the MI of term t to the whole corpus can be expressed in terms of the average value of the MI of term with each category in corpus, which is formulated by

$$\begin{aligned} \begin{aligned} \hbox {MI}(t)&= \sum _{i=1}^C{p(c_i)\log p(t | c_i )} /p(t) \\&= \sum _{i=1}^C{p(c_i)\log p(t,c_i )} /(p(t) p(c_i)). \end{aligned} \end{aligned}$$

(A.6)

In order to calculated conveniently, we give the same definition of $A_i(t)$, $B_i(t)$ and $C_i(t)$ as in the calculation of IG above. Therefore, the MI of term t in corpus is formulated by

$$\begin{aligned} \begin{aligned}&\hbox {MI}(t)= \sum _{i=1}^C{p(c_i)\log p(t,c_i )} /(p(t) p(c_i)) \\&\quad = \sum _{i=1}^C{\{ (A_i(t) +B_i(t))/N \}}\\ {}&\times \log {\{A_i(t) \cdot N/\{ (A_i(t) +B_i(t)) \cdot (A_i(t) +C_i(t)) \} \}}. \end{aligned} \end{aligned}$$

(A.7)

Expected cross-entropy (ECE) The expected cross-entropy of term is defined as

$$\begin{aligned} \hbox {ECE}(t)= p(t) \sum _{i=1}^C{p(c_i|t)\log p(c_i | t)/p(c_i)}. \end{aligned}$$

(A.8)

As the definition of $A_i(t)$, $B_i(t)$ and $C_i(t)$ in the calculation of IG and MI above, the MI of term t in corpus is redesigned conveniently as

$$\begin{aligned} \begin{aligned} \hbox {ECE}(t)&= p(t) \sum _{i=1}^C{p(c_i|t)\log p(c_i | t)/p(c_i)} \\ \quad&= \sum _{i=1}^C{p(c_i, t)\log p(t | c_i)/p(t)} \\ \quad&= \sum _{i=1}^C{p(c_i, t)\log p(t, c_i)/(p(t)p(c_i))} \\ \quad&= \sum _{i=1}^C{ \{A_i(t) /N\}} \\&\quad \cdot \log { \{ A_i(t) \cdot N/\{ (A_i(t) +B_i(t)) \cdot (A_i(t) +C_i(t))\}\}}. \end{aligned} \end{aligned}$$

(A.9)

Random projection and Gram–Schmidt orthogonalization (RP-GSO) The original paper (Wang et al. 2016b) gives a detailed description of the unsupervised feature selection model, RP-GSO, and here we will not make a copy of this model any more.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, W., Guo, C., Chen, J. et al. CCODM: conditional co-occurrence degree matrix document representation method. Soft Comput 23, 1239–1255 (2019). https://doi.org/10.1007/s00500-017-2844-8

Download citation

Published: 20 September 2017
Issue Date: 27 February 2019
DOI: https://doi.org/10.1007/s00500-017-2844-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CCODM: conditional co-occurrence degree matrix document representation method

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

A Comprehensive Survey of Clustering Algorithms

Clustering graph data: the roadmap to spectral techniques

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CCODM: conditional co-occurrence degree matrix document representation method

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

A Comprehensive Survey of Clustering Algorithms

Clustering graph data: the roadmap to spectral techniques

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation