Abstract
We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term’s (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms’ prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.
Similar content being viewed by others
Notes
The literature has used a variety of names in reference to KL divergence. While Kullback preferred discrimination information for the principle of minimum discrimination information (MDI) [21], the literature has often referred to it as divergence information or relative entropy.
Inference probabilities are never perfectly independent of one another given the degree of freedom. But to simplify the discussion and formulation, we use the independence assumption.
The term opposite does not indicate true vs. false information. Opposite information semantics are essentially to increase vs. to decrease the probability of an inference, e.g., good news vs. bad news about a candidate that may influence the outcome of an election.
References
Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). doi:10.1023/A:1022689900470
Aizawa, A.: The feature quantity: an information theoretic perspective of TFIDF-like measures. In: SIGIR’00, pp. 104–111 (2000). doi:10.1145/345508.345556
Amati, G., van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of carefull seeding. In: SIAM’07, pp. 1027–1035 (2007)
Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analyzing retrieval measures. In: SIGIR’05, pp. 27–34 (2005) doi:10.1145/1076034.1076042
Baierlein, R.: Atoms and Information Theory: An Introduction to Statistical Mechanics. W.H. Freeman and Company, New York (1971)
Berry, M.W.: Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, New York (2004)
Clinchant, S., Gaussier, E.: Information-based models for Ad Hoc IR. In: SIGIR’11, pp. 234–241 (2011)
Cover, T.M., Thomas, J.A.: Entropy, Relative Entropy and Mutual Information. Wiley, New York, pp. 12–49 (1991)
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: AAAI’98, pp. 509–516 (1998). http://dl.acm.org/citation.cfm?id=295240.295725
Dhillon, I.S., Mallela, S., Kumar, R.: A divisive information theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265–1287 (2003). http://dl.acm.org/citation.cfm?id=944919.944973
Fast, J.D.: Entropy: The Significance of the Concept of Entropy and Its Applications in Science and Technology. McGraw-Hill, New York (1962)
Fox, C.: Information and misinformation: an investigation of the notions of information, misinformation, informing, and misinforming. In: Contributions in Librarianship and Information Science. Greenwood Press, Westport (1983). http://books.google.com/books?id=TNHgAAAAMAAJ
Jain, A.K., Murty, M.N., Flynn, J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999). doi:10.1145/331499.331504
Jaynes, E.T. : Information theory and statistical mechanics. II. Phys. Rev. 108, 171–190 (1957). doi:10.1103/PhysRev.108.171
Ji, X., Xu, W.: Document clustering with prior knowledge. In: SIGIR’06, pp. 405–412 (1996). doi:10.1145/1148170.1148241
Kantor, P.B., Lee, J.J. :The maximum entropy principle in information retrieval. In: SIGIR’86, pp. 269–274 (1986). doi:10.1145/253168.253225
Ke, W.: Least Information Modeling for Information Retrieval. ArXiv preprint arXiv:1205.0312 (2012)
Ke, W., Mostafa, J., Fu, Y.: Collaborative classifier agents: studying the impact of learning in distributed document classification. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, New York, JCDL ’07, pp. 428–437 (2007). doi:10.1145/1255175.1255263
Knight, K.: Mining online text. Commun. ACM 42(11), 58–61 (1999). doi:10.1145/319382.319394
Kullback, S.: Letters to the editor: the Kullback–Leibler distance. Am. Stat. 41(4), 338–341 (1987). http://www.jstor.org/stable/2684769
Kullback, S., Leibler, A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951). doi:10.1214/aoms/1177729694
Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, pp. 331–339 (1995)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). http://dl.acm.org/citation.cfm?id=1005332.1005345
Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (2006). doi:10.1109/18.61115
Liu, T., Liu, S., Cheng, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003). AAAI Press, Washington, DC, pp. 488–495 (2003)
Lovins, B.: Development of a stemming algorithm. Mech. Transl. Comput. Linguist. 11, 22–31 (1968)
MacKay, D.M.: Information, Mechanism and Meaning. The M.I.T. Press, Cambridge (1969)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Rapoport, A.: What is information? ETC Rev. Gen. Semant. 10(4), 5–12 (1953)
Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60, 503–520 (2004)
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends\(\textregistered \) Inf. Retr. 3(4), 333–389 (2009). doi:10.1561/1500000019
Sandhaus, E.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). doi:10.1145/505282.505283
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423; 623–656 (1948)
Siegler, M., Witbrock, M.: Improving the suitability of imperfect transcriptions for information retrieval from spoken documents. In: ICASSP’99, IEEE Press, pp. 505–508 (1999)
Snickars, F., Weibull, J.W.: A minimum information principle: theory and practice. Reg. Sci. Urban Econ. 7(1), 137–168 (1977). doi:10.1016/0166-0462(77)90021-7
Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 60, 493–502 (2004)
Witten, I.H., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
Yang, Y., Pedersen, J.O.A.: Comparative study on feature selection in text categorization. In: ICML’97, pp. 412–420 (1997) http://dl.acm.org/citation.cfm?id=645526.657137
Zhang, D., Wang, J., Si, L.: Document clustering with universum. In: SIGIR’11, pp. 873–882 (2011) doi:10.1145/2009916.2010033
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: CIKM’02, pp. 515–524 (2002). doi:10.1145/584792.584877
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ke, W. Information-theoretic term weighting schemes for document clustering and classification. Int J Digit Libr 16, 145–159 (2015). https://doi.org/10.1007/s00799-014-0121-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-014-0121-3