To read this content please select one of the options below:

A new term‐weighting scheme for naïve Bayes text categorization

Marcelo Mendoza (Computer Science Department, Universidad Técnica Federico Santa María, Santiago, Chile)

International Journal of Web Information Systems

ISSN: 1744-0084

Article publication date: 30 March 2012

449

Abstract

Purpose

Automatic text categorization has applications in several domains, for example e‐mail spam detection, sexual content filtering, directory maintenance, and focused crawling, among others. Most information retrieval systems contain several components which use text categorization methods. One of the first text categorization methods was designed using a naïve Bayes representation of the text. Currently, a number of variations of naïve Bayes have been discussed. The purpose of this paper is to evaluate naïve Bayes approaches on text categorization introducing new competitive extensions to previous approaches.

Design/methodology/approach

The paper focuses on introducing a new Bayesian text categorization method based on an extension of the naïve Bayes approach. Some modifications to document representations are introduced based on the well‐known BM25 text information retrieval method. The performance of the method is compared to several extensions of naïve Bayes using benchmark datasets designed for this purpose. The method is compared also to training‐based methods such as support vector machines and logistic regression.

Findings

The proposed text categorizer outperforms state‐of‐the‐art methods without introducing new computational costs. It also achieves performance results very similar to more complex methods based on criterion function optimization as support vector machines or logistic regression.

Practical implications

The proposed method scales well regarding the size of the collection involved. The presented results demonstrate the efficiency and effectiveness of the approach.

Originality/value

The paper introduces a novel naïve Bayes text categorization approach based on the well‐known BM25 information retrieval model, which offers a set of good properties for this problem.

Keywords

Citation

Mendoza, M. (2012), "A new term‐weighting scheme for naïve Bayes text categorization", International Journal of Web Information Systems, Vol. 8 No. 1, pp. 55-72. https://doi.org/10.1108/17440081211222591

Publisher

:

Emerald Group Publishing Limited

Copyright © 2012, Emerald Group Publishing Limited

Related articles