TF–IDF

doi:10.1007/978-1-4899-7687-1_832

318 Accesses
1 Citations

TF–IDF (term frequency–inverse document frequency) is a term weighting scheme commonly used to represent textual documents as vectors (for purposes of classification, clustering, visualization, retrieval, etc.). Let T = { t₁, …, t_n} be the set of all terms occurring in the document corpus under consideration. Then a document d_i is represented by a n-dimensional real-valued vector x\(_{i} = (x_{i_{1}},\ldots,x_{in})\) with one component for each possible term from T.

The weight x_ij corresponding to term t_j in document d_i is usually a product of three parts: one which depends on the presence or frequency of t_j in d_i, one which depends on t_j’s presence in the corpus as a whole, and a normalization part which depends on d_j. The most common TF–IDF weighting is defined by \(x_{ij} =\mathrm{ TF}_{i} \cdot \mathrm{ IDF}_{j} \cdot (\sum _{j}(\mathrm{TF}_{ij}\mathrm{IDF}_{j})^{2})^{-1/2}\), where TF_ij is the term frequency (i.e., number of occurrences) of t_j in d_i, and IDFjis...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 699.99; Price excludes VAT (USA)

Hardcover Book: USD 949.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Editor information

Editors and Affiliations

The University of New South Wales, Sydney, NSW, Australia
Claude Sammut
Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
Geoffrey I. Webb

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

(2017). TF–IDF. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7687-1_832

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7687-1_832
Published: 14 April 2017
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7685-7
Online ISBN: 978-1-4899-7687-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics