Higher order feature selection for text classification

Bakus, Jan; Kamel, Mohamed S.

doi:10.1007/s10115-005-0209-6

Higher order feature selection for text classification

Regular Paper
Published: 09 September 2005

Volume 9, pages 468–491, (2006)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Jan Bakus¹ &
Mohamed S. Kamel²

202 Accesses
3 Altmetric
Explore all metrics

Abstract

In this paper. we present the MIFS-C variant of the mutual information feature-selection algorithms. We present an algorithm to find the optimal value of the redundancy parameter, which is a key parameter in the MIFS-type algorithms. Furthermore, we present an algorithm that speeds up the execution time of all the MIFS variants. Overall, the presented MIFS-C has comparable classification accuracy (in some cases even better) compared with other MIFS algorithms, while its running time is faster. We compared this feature selector with other feature selectors, and found that it performs better in most cases. The MIFS-C performed especially well for the breakeven and F-measure because the algorithm can be tuned to optimise these evaluation measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new feature selection method for handling redundant information in text classification

Article 01 February 2018

Feature Selection in Text Mining

Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Apté C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inform Syst (TOIS) 12(3):233–251
Google Scholar
Bakus J, Kamel M (2003) Information theoretic feature selection for document classification. In: Proceedings of the Eighth Canadian Workshop on Information Theory, pp 147–150. Waterloo, Ontario, Canada
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550
Article Google Scholar
Devijver PA, Kittler J (1982) Pattern Recognition: A Statistical Approach. Prentice Hall, Englewood Cliffs, NJ, USA
Google Scholar
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learning 29 (2/3):103–130
Google Scholar
Duda RO, Hart PE, Stork DG (eds) (2001) Pattern Classification, 2nd ed. Wiley, New York
Google Scholar
Dumais ST, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM-98), pp 148–155. Bethesda, MD, USA
Fano RM (1961) Transmission of Information: A Statistical Theory of Communication. MIT Press, Cambridge, MA
Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Article MATH Google Scholar
Ghiselli EE (1964) Theory of Psychological Measurement. McGraw Hill, New York
Google Scholar
Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 16th International Conference on Machine Learning (ICML-99), pp 359–366. Stanford, CA, USA
Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML-97), pp 143–151. Nashville, TN, USA,
Joachims T (2001) A statistical learning model for text classification with support vector machines. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-01), pp 128–136. New Orleans, LA, USA
Ko Y, Seo J (2000) Automatic text categorization by unsupervised learning. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING-00), pp 453–459. Saarbrücken, Germany
Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell J 97(1–2):273–324
Google Scholar
Koller D, Sahami M (1996) Toward optimal feature selection. In: Proceedings of the 13th International Conference on Machine Learning (ICML-96), pp 170–178. Bari, Italy
Kwak N, Choi C-H (1994) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159
Google Scholar
Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-92), pp 37–50. Copenhagen, Denmark
Lewis DD (1992) Representation and learning in information retrieval. PhD Thesis, Department of Computer and Information Science, University of Massachusetts, Amherst, MA, USA
Lewis DD, Schapire RE, Callan JP, Papka R (1996) Training algorithms for linear text classifiers. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-96), pp 307–315. Zurich, Switzerland
Liu H, Motoda H (1998) Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, Dordrecht
Google Scholar
McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Proceedings of the 1998 AAAI/ICML Workshop on Learning for Text Categorization, pp 41–48. Madison, WI, USA
Mladenić D, Grobelnik M (1998) Word sequences as features in text-learning. In: Proceedings of the Seventh Electrotechnical and Computer Science Conference (ERK-98), pp 145–148. Ljubljana, Slovenia
Mladenić D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive Bayes. In: Proceedings of the 16th International Conference on Machine Learning (ICML-99), pp 258–267. Bled, Slovenia
Press WH, Flannery BP, Teukolski SA, Vetterling WT (1988) Numerical Recipes in C. Cambridge University Press, Cambridge, UK
Google Scholar
Quinlan R (1993) C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo, CA, USA
Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inform Process Manage 24(5):513–523
Google Scholar
Salton G, Yang C, Wong A (1975) A vector-space model for automatic indexing. Commun ACM 18(11):613–620
Article Google Scholar
Schütze H, Hull DA, Pedersen JO (1995) A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-95), pp 229–237. Seattle, WA, USA
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surveys (CSUR) 34(1):1–47
Google Scholar
Sidlecki W, Sklanski J (1988) On automatic feature selection. Int J Pattern Recogn Artif Intell 2(2):197–220
Google Scholar
van Rijsbergen CJ, Harper DJ, Porter MF (1981) The selection of good search terms. Inform Process Manage 17(2):77–91
Google Scholar
Vapnik V (1995) The Nature of Statistical Learning Theory. Springer, Berlin Heidelberg New York
Google Scholar
Yang Y, Pedersen JP (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML-97), pp 412–420. Nashville, TN, USA
Yang Y, Slattery S, Ghani R (2002) A study of approaches to hypertext categorization. J Intell Inform Syst 18(2)

Download references

Author information

Authors and Affiliations

Department of Systems Design Engineering, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Jan Bakus
Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario, Canada
Mohamed S. Kamel

Authors

Jan Bakus
View author publications
You can also search for this author inPubMed Google Scholar
Mohamed S. Kamel
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jan Bakus.

Additional information

Jan Bakus received the B.A.Sc. and M.A.Sc. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 1996 and 1998, respectively, and Ph.D. degree in systems design engineering in 2005. He is currently working at Maplesoft, Waterloo, ON, Canada as an applications engineer, where he is responsible for the development of application specific toolboxes for the Maple scientific computing software.

His research interests are in the area of feature selection for text classification, text classification, text clustering, and information retrieval. He is the recipient of the Carl Pollock Fellowship award from the University of Waterloo and the Datatel Scholars Foundation scholarship from Datatel.

Mohamed S. Kamel holds a Ph.D. in computer science from the University of Toronto, Canada. He is at present Professor and Director of the Pattern Analysis and Machine Intelligence Laboratory in the Department of Electrical and Computing Engineering, University of Waterloo, Canada. Professor Kamel holds a Canada Research Chair in Cooperative Intelligent Systems.

Dr. Kamel's research interests are in machine intelligence, neural networks and pattern recognition with applications in robotics and manufacturing. He has authored and coauthored over 200 papers in journals and conference proceedings, 2 patents and numerous technical and industrial project reports. Under his supervision, 53 Ph.D. and M.A.Sc. students have completed their degrees.

Dr. Kamel is a member of ACM, AAAI, CIPS and APEO and has been named s Fellow of IEEE (2005). He is the editor-in-chief of the International Journal of Robotics and Automation, Associate Editor of the IEEE SMC, Part A, the International Journal of Image and Graphics, Pattern Recognition Letters and is a member of the editorial board of the Intelligent Automation and Soft Computing. He has served as a consultant to many Companies, including NCR, IBM, Nortel, VRP and CSA. He is a member of the board of directors and cofounder of Virtek Vision International in Waterloo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bakus, J., Kamel, M.S. Higher order feature selection for text classification. Knowl Inf Syst 9, 468–491 (2006). https://doi.org/10.1007/s10115-005-0209-6

Download citation

Received: 17 April 2004
Revised: 20 September 2004
Accepted: 27 November 2004
Published: 09 September 2005
Issue Date: April 2006
DOI: https://doi.org/10.1007/s10115-005-0209-6

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Higher order feature selection for text classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A new feature selection method for handling redundant information in text classification

Feature Selection in Text Mining

Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now