Information-theoretic term weighting schemes for document clustering and classification

Ke, Weimao

doi:10.1007/s00799-014-0121-3

Information-theoretic term weighting schemes for document clustering and classification

Published: 30 July 2014

Volume 16, pages 145–159, (2015)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Weimao Ke¹

1492 Accesses
6 Citations
Explore all metrics

Abstract

We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term’s (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms’ prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The literature has used a variety of names in reference to KL divergence. While Kullback preferred discrimination information for the principle of minimum discrimination information (MDI) [21], the literature has often referred to it as divergence information or relative entropy.
Inference probabilities are never perfectly independent of one another given the degree of freedom. But to simplify the discussion and formulation, we use the independence assumption.
The term opposite does not indicate true vs. false information. Opposite information semantics are essentially to increase vs. to decrease the probability of an inference, e.g., good news vs. bad news about a candidate that may influence the outcome of an election.

References

Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). doi:10.1023/A:1022689900470
Aizawa, A.: The feature quantity: an information theoretic perspective of TFIDF-like measures. In: SIGIR’00, pp. 104–111 (2000). doi:10.1145/345508.345556
Amati, G., van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)
Article Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of carefull seeding. In: SIAM’07, pp. 1027–1035 (2007)
Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analyzing retrieval measures. In: SIGIR’05, pp. 27–34 (2005) doi:10.1145/1076034.1076042
Baierlein, R.: Atoms and Information Theory: An Introduction to Statistical Mechanics. W.H. Freeman and Company, New York (1971)
Berry, M.W.: Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, New York (2004)
Clinchant, S., Gaussier, E.: Information-based models for Ad Hoc IR. In: SIGIR’11, pp. 234–241 (2011)
Cover, T.M., Thomas, J.A.: Entropy, Relative Entropy and Mutual Information. Wiley, New York, pp. 12–49 (1991)
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: AAAI’98, pp. 509–516 (1998). http://dl.acm.org/citation.cfm?id=295240.295725
Dhillon, I.S., Mallela, S., Kumar, R.: A divisive information theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265–1287 (2003). http://dl.acm.org/citation.cfm?id=944919.944973
Fast, J.D.: Entropy: The Significance of the Concept of Entropy and Its Applications in Science and Technology. McGraw-Hill, New York (1962)
Fox, C.: Information and misinformation: an investigation of the notions of information, misinformation, informing, and misinforming. In: Contributions in Librarianship and Information Science. Greenwood Press, Westport (1983). http://books.google.com/books?id=TNHgAAAAMAAJ
Jain, A.K., Murty, M.N., Flynn, J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999). doi:10.1145/331499.331504
Jaynes, E.T. : Information theory and statistical mechanics. II. Phys. Rev. 108, 171–190 (1957). doi:10.1103/PhysRev.108.171
Ji, X., Xu, W.: Document clustering with prior knowledge. In: SIGIR’06, pp. 405–412 (1996). doi:10.1145/1148170.1148241
Kantor, P.B., Lee, J.J. :The maximum entropy principle in information retrieval. In: SIGIR’86, pp. 269–274 (1986). doi:10.1145/253168.253225
Ke, W.: Least Information Modeling for Information Retrieval. ArXiv preprint arXiv:1205.0312 (2012)
Ke, W., Mostafa, J., Fu, Y.: Collaborative classifier agents: studying the impact of learning in distributed document classification. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, New York, JCDL ’07, pp. 428–437 (2007). doi:10.1145/1255175.1255263
Knight, K.: Mining online text. Commun. ACM 42(11), 58–61 (1999). doi:10.1145/319382.319394
Kullback, S.: Letters to the editor: the Kullback–Leibler distance. Am. Stat. 41(4), 338–341 (1987). http://www.jstor.org/stable/2684769
Kullback, S., Leibler, A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951). doi:10.1214/aoms/1177729694
Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, pp. 331–339 (1995)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). http://dl.acm.org/citation.cfm?id=1005332.1005345
Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (2006). doi:10.1109/18.61115
Liu, T., Liu, S., Cheng, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003). AAAI Press, Washington, DC, pp. 488–495 (2003)
Lovins, B.: Development of a stemming algorithm. Mech. Transl. Comput. Linguist. 11, 22–31 (1968)
Google Scholar
MacKay, D.M.: Information, Mechanism and Meaning. The M.I.T. Press, Cambridge (1969)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Rapoport, A.: What is information? ETC Rev. Gen. Semant. 10(4), 5–12 (1953)
Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60, 503–520 (2004)
Article Google Scholar
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends\(\textregistered \) Inf. Retr. 3(4), 333–389 (2009). doi:10.1561/1500000019
Sandhaus, E.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). doi:10.1145/505282.505283
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423; 623–656 (1948)
Siegler, M., Witbrock, M.: Improving the suitability of imperfect transcriptions for information retrieval from spoken documents. In: ICASSP’99, IEEE Press, pp. 505–508 (1999)
Snickars, F., Weibull, J.W.: A minimum information principle: theory and practice. Reg. Sci. Urban Econ. 7(1), 137–168 (1977). doi:10.1016/0166-0462(77)90021-7
Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 60, 493–502 (2004)
Article Google Scholar
Witten, I.H., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
Google Scholar
Yang, Y., Pedersen, J.O.A.: Comparative study on feature selection in text categorization. In: ICML’97, pp. 412–420 (1997) http://dl.acm.org/citation.cfm?id=645526.657137
Zhang, D., Wang, J., Si, L.: Document clustering with universum. In: SIGIR’11, pp. 873–882 (2011) doi:10.1145/2009916.2010033
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: CIKM’02, pp. 515–524 (2002). doi:10.1145/584792.584877

Download references

Author information

Authors and Affiliations

College of Computing and Informatics, Drexel University, Philadelphia, USA
Weimao Ke

Authors

Weimao Ke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weimao Ke.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ke, W. Information-theoretic term weighting schemes for document clustering and classification. Int J Digit Libr 16, 145–159 (2015). https://doi.org/10.1007/s00799-014-0121-3

Download citation

Received: 31 October 2013
Revised: 18 June 2014
Accepted: 01 July 2014
Published: 30 July 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s00799-014-0121-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Information-theoretic term weighting schemes for document clustering and classification

Abstract

Access this article

Similar content being viewed by others

Learning Heterogeneous Coupling Relationships Between Non-IID Terms

A Novel Weighting Scheme Applied to Improve the Text Document Clustering Techniques

Model-induced term-weighting schemes for text classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Information-theoretic term weighting schemes for document clustering and classification

Abstract

Access this article

Similar content being viewed by others

Learning Heterogeneous Coupling Relationships Between Non-IID Terms

A Novel Weighting Scheme Applied to Improve the Text Document Clustering Techniques

Model-induced term-weighting schemes for text classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation