Skip to main content
Log in

Information-theoretic term weighting schemes for document clustering and classification

International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term’s (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms’ prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The literature has used a variety of names in reference to KL divergence. While Kullback preferred discrimination information for the principle of minimum discrimination information (MDI) [21], the literature has often referred to it as divergence information or relative entropy.

  2. Inference probabilities are never perfectly independent of one another given the degree of freedom. But to simplify the discussion and formulation, we use the independence assumption.

  3. The term opposite does not indicate true vs. false information. Opposite information semantics are essentially to increase vs. to decrease the probability of an inference, e.g., good news vs. bad news about a candidate that may influence the outcome of an election.

References

  1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). doi:10.1023/A:1022689900470

  2. Aizawa, A.: The feature quantity: an information theoretic perspective of TFIDF-like measures. In: SIGIR’00, pp. 104–111 (2000). doi:10.1145/345508.345556

  3. Amati, G., van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)

    Article  Google Scholar 

  4. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of carefull seeding. In: SIAM’07, pp. 1027–1035 (2007)

  5. Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analyzing retrieval measures. In: SIGIR’05, pp. 27–34 (2005) doi:10.1145/1076034.1076042

  6. Baierlein, R.: Atoms and Information Theory: An Introduction to Statistical Mechanics. W.H. Freeman and Company, New York (1971)

  7. Berry, M.W.: Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, New York (2004)

  8. Clinchant, S., Gaussier, E.: Information-based models for Ad Hoc IR. In: SIGIR’11, pp. 234–241 (2011)

  9. Cover, T.M., Thomas, J.A.: Entropy, Relative Entropy and Mutual Information. Wiley, New York, pp. 12–49 (1991)

  10. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: AAAI’98, pp. 509–516 (1998). http://dl.acm.org/citation.cfm?id=295240.295725

  11. Dhillon, I.S., Mallela, S., Kumar, R.: A divisive information theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265–1287 (2003). http://dl.acm.org/citation.cfm?id=944919.944973

  12. Fast, J.D.: Entropy: The Significance of the Concept of Entropy and Its Applications in Science and Technology. McGraw-Hill, New York (1962)

  13. Fox, C.: Information and misinformation: an investigation of the notions of information, misinformation, informing, and misinforming. In: Contributions in Librarianship and Information Science. Greenwood Press, Westport (1983). http://books.google.com/books?id=TNHgAAAAMAAJ

  14. Jain, A.K., Murty, M.N., Flynn, J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999). doi:10.1145/331499.331504

  15. Jaynes, E.T. : Information theory and statistical mechanics. II. Phys. Rev. 108, 171–190 (1957). doi:10.1103/PhysRev.108.171

  16. Ji, X., Xu, W.: Document clustering with prior knowledge. In: SIGIR’06, pp. 405–412 (1996). doi:10.1145/1148170.1148241

  17. Kantor, P.B., Lee, J.J. :The maximum entropy principle in information retrieval. In: SIGIR’86, pp. 269–274 (1986). doi:10.1145/253168.253225

  18. Ke, W.: Least Information Modeling for Information Retrieval. ArXiv preprint arXiv:1205.0312 (2012)

  19. Ke, W., Mostafa, J., Fu, Y.: Collaborative classifier agents: studying the impact of learning in distributed document classification. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, New York, JCDL ’07, pp. 428–437 (2007). doi:10.1145/1255175.1255263

  20. Knight, K.: Mining online text. Commun. ACM 42(11), 58–61 (1999). doi:10.1145/319382.319394

  21. Kullback, S.: Letters to the editor: the Kullback–Leibler distance. Am. Stat. 41(4), 338–341 (1987). http://www.jstor.org/stable/2684769

  22. Kullback, S., Leibler, A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951). doi:10.1214/aoms/1177729694

  23. Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, pp. 331–339 (1995)

  24. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). http://dl.acm.org/citation.cfm?id=1005332.1005345

  25. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (2006). doi:10.1109/18.61115

  26. Liu, T., Liu, S., Cheng, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003). AAAI Press, Washington, DC, pp. 488–495 (2003)

  27. Lovins, B.: Development of a stemming algorithm. Mech. Transl. Comput. Linguist. 11, 22–31 (1968)

    Google Scholar 

  28. MacKay, D.M.: Information, Mechanism and Meaning. The M.I.T. Press, Cambridge (1969)

  29. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

  30. Rapoport, A.: What is information? ETC Rev. Gen. Semant. 10(4), 5–12 (1953)

  31. Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60, 503–520 (2004)

    Article  Google Scholar 

  32. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends\(\textregistered \) Inf. Retr. 3(4), 333–389 (2009). doi:10.1561/1500000019

  33. Sandhaus, E.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008)

  34. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). doi:10.1145/505282.505283

  35. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423; 623–656 (1948)

  36. Siegler, M., Witbrock, M.: Improving the suitability of imperfect transcriptions for information retrieval from spoken documents. In: ICASSP’99, IEEE Press, pp. 505–508 (1999)

  37. Snickars, F., Weibull, J.W.: A minimum information principle: theory and practice. Reg. Sci. Urban Econ. 7(1), 137–168 (1977). doi:10.1016/0166-0462(77)90021-7

  38. Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 60, 493–502 (2004)

    Article  Google Scholar 

  39. Witten, I.H., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)

    Google Scholar 

  40. Yang, Y., Pedersen, J.O.A.: Comparative study on feature selection in text categorization. In: ICML’97, pp. 412–420 (1997) http://dl.acm.org/citation.cfm?id=645526.657137

  41. Zhang, D., Wang, J., Si, L.: Document clustering with universum. In: SIGIR’11, pp. 873–882 (2011) doi:10.1145/2009916.2010033

  42. Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: CIKM’02, pp. 515–524 (2002). doi:10.1145/584792.584877

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weimao Ke.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ke, W. Information-theoretic term weighting schemes for document clustering and classification. Int J Digit Libr 16, 145–159 (2015). https://doi.org/10.1007/s00799-014-0121-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-014-0121-3

Keywords

Navigation