A tree-based algorithm for attribute selection

Baranauskas, José Augusto; Netto, Oscar Picchi; Nozawa, Sérgio Ricardo; Macedo, Alessandra Alaniz

doi:10.1007/s10489-017-1008-y

A tree-based algorithm for attribute selection

Published: 04 August 2017

Volume 48, pages 821–833, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

José Augusto Baranauskas ORCID: orcid.org/0000-0002-7501-7187¹,
Oscar Picchi Netto¹,
Sérgio Ricardo Nozawa² &
…
Alessandra Alaniz Macedo¹

608 Accesses
8 Citations
Explore all metrics

Abstract

This paper presents an improved version of a decision tree-based filter algorithm for attribute selection. This algorithm can be seen as a pre-processing step of induction algorithms of machine learning and data mining tasks. The filter was evaluated based on thirty medical datasets considering its execution time, data compression ability and AUC (Area Under ROC Curve) performance. On average, our filter was faster than Relief-F but slower than both CFS and Gain Ratio. However for low-density (high-dimensional) datasets, our approach selected less than 2% of all attributes at the same time that it did not produce performance degradation during its further evaluation based on five different machine learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature Selection Methods Based on Decision Rule and Tree Models

Algorithms for Attribute Selection and Knowledge Discovery

A Novel Attributes Partition Method for Decision Tree

References

Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507
Article Google Scholar
Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996). In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) From data mining to knowledge discovery: an overview. American Association for Artificial Intelligence, Menlo Park, pp 1–30
Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I et al (2006) Machine learning in bioinformatics. Brief Bioinform 7(1):86–112
Article Google Scholar
Foithong S, Pinngern O, Attachoo B (2011) Feature subset selection wrapper based on mutual information and rough sets. Expert Systems with Applications
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Morgan, Kaufmann
MATH Google Scholar
Ditzler G, Morrison J, Lan Y, Rosen G (2015) Fizzy: feature subset selection for metagenomics. BMC Biochem 16(1): 358. Available from: http://www.biomedcentral.com/1471-2105/16/358
Google Scholar
Mandal M, Mukhopadhyay A, Maulik U (2015) Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC. Med Biol Eng Comput 53(4):331–344. Available from. doi:10.1007/s11517-014-1238-7
Article Google Scholar
Purkayastha P, Rallapalli A, Bhanu Murthy NL, Malapati A, Yogeeswari P, Sriram D (2015) Effect of feature selection on kinase classification models. In: Muppalaneni NB, Gunjan VK (eds) Computational intelligence in medical informatics springerbriefs in applied sciences and technology. Springer, Singapore, pp 81–86. Available from: doi:10.1007/978-981-287-260-9_8
Google Scholar
Devaraj S, Paulraj S (2015) An efficient feature subset selection algorithm for classification of multidimensional dataset. The Scientific World Journal. 2015. (Article ID 821798):9 p Available from. doi:10.1155/2015/821798
Govindan G, Nair AS (2014) Sequence features and subset selection technique for the prediction of protein trafficking phenomenon in Eukaryotic non membrane proteins. International Journal of Biomedical Data Mining 3(2):1–9. Available from: http://www.omicsonline.com/open-access/sequence-features-and-subset-selection-technique-for-the-prediction-of-protein-trafficking-phenomenon-in-eukaryotic-non-membrane-proteins-2090-4924.1000109.php?aid=39406
Google Scholar
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. AI 97 (1–2):245–271
MathSciNet MATH Google Scholar
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324. Relevance. Available from: http://www.sciencedirect.com/science/article/pii/S000437029700043X
Article MATH Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1-3):389–422. Available from: doi:10.1023/A:1012487302797
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. Available from: http://dl.acm.org/citation.cfm?id=944919.944968
Özge Uncu, Tüşen IB (2007) A novel feature selection approach: Combining feature wrappers and filters. Inf Sci 177(2):449–466. Available from: http://www.sciencedirect.com/science/article/pii/S0020025506000806
Article MathSciNet MATH Google Scholar
Min H, Fangfang W (2010) Filter-wrapper hybrid method on feature. In: 2010 2nd WRI global congress on selection intelligent systems (GCIS), vol 3. IEEE, pp 98–101
Lan Y, Ren H, Zhang Y, Yu H, Zhao X (2011) A hybrid feature selection method using both filter and wrapper in mammography CAD. In: Proceedings of the 2011 international conference on IEEE image analysis and signal processing (IASP), pp 378– 382
Chapter Google Scholar
Estévez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Transn Neural Netw 20(2):189–201
Article Google Scholar
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Machine learning international conference, vol 20, p 856. Available from: http://www.public.asu.edu/~huanliu/papers/icml03.pdf
Google Scholar
Kira K, Rendell LA (1992) The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the 10th national conference on artificial intelligence. AAAI’92. AAAI Press, pp 129–134. Available from: http://dl.acm.org/citation.cfm?id=1867135.1867155
Google Scholar
Hall MA, Smith LA (1998) Practical feature subset selection for machine learning. In: McDonald C (ed) J Comput S ’98 Proceedings of the 21st Australasian computer science conference ACSC98, Perth, 4-6 February. Springer, Berlin, pp 181–191
Google Scholar
Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 17th international conference on machine learning. ICML ’00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; pp 359–366. Available from: http://dl.acm.org/citation.cfm?id=645529.657793
Google Scholar
Gao K, Khoshgoftaar T, Van Hulse J (2010) An evaluation of sampling on filter-based feature selection methods. In: Proceedings of the 23rd international florida artificial intelligence research society conference, pp 416–421
Google Scholar
Efron B, Tibshirani R (1997) Improvements on cross-validation: the 632+ bootstrap method. J Am Stat Assoc 92(438):548–560
MathSciNet MATH Google Scholar
Netto OP, Nozawa SR, Mitrowsky RAR, Macedo AA, Baranauskas JA, Lins CUN (2010) Applying decision trees to gene expression data from DNA microarrays: a Leukemia case study. In: XXX congress of the Brazilian computer society, X workshop on medical informatics, p 10
Google Scholar
Netto OP, Baranauskas JA (2012) An iterative decision tree threshold filter. In: XXXII congress of the Brazilian computer society, X workshop on medical informatics, p 10
Google Scholar
Quinlan JR (1993) C4.5: Programs for Machine Learning. San Francisco
Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees in a random forest?. In: Proceedings of the 8th international conference on machine learning and data mining in pattern recognition. MLDM’12. Springer-Verlag, Berlin Heidelberg, pp 154–168. Available from. doi:10.1007/978-3-642-31537-4_13
Chapter Google Scholar
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan, Kaufmann
MATH Google Scholar
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92
Article MathSciNet MATH Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300
MathSciNet MATH Google Scholar
Hall MA, Smith LA (1997) Feature subset selection: a correlation based filter approach. In: 1997 international conference on neural information processing and intelligent information systems. Springer, pp 855–858
Wang Y, Makedon F (2004) Application of Relief-F feature filtering algorithm to selecting informative genes for cancer classification using microarray data. In: Proceeding of the computational systems bioinformatics conference, 2004. CSB 2004, IEEE, pp 497–498
Google Scholar
Baranauskas JA, Monard MC (1999) The $\mathcal {MLL}++$ wrapper for feature subset selection using decision tree, production rule, instance based and statistical inducers: some experimental results. ICMC-USP vol 87 Available from: http://dcm.ffclrp.usp.br/augusto/publications/rt_87.pdf
Lee HD, Monard MC, Baranauskas JA Empirical Comparison of Wrapper and Filter Approaches for Feature Subset Selection. ICMC-USP; 1999. 94. Available from: http://dcm.ffclrp.usp.br/augusto/publications/rt_94.pdf
Kantardzic M (2011) Data mining: concepts, models, methods, and algorithms. Wiley-IEEE Press, Wiley
Frank A, Asuncion A (2010) UCI machine learning repository. Available from: http://archive.ics.uci.edu/ml
Institute B (2010) Cancer program data sets. Available from: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi

Download references

Acknowledgements

This work was partially funded by a joint grant between the National Research Council of Brazil (CNPq), and the Amazon State Research Foundation (FAPEAM) through the Program National Institutes of Science and Technology, INCT ADAPTA Project (Centre for Studies of Adaptations of Aquatic Biota of the Amazon). We are thankful to Cynthia M. Campos Prado Manso for thoroughly reading the draft of this paper.

Author information

Authors and Affiliations

Department of Computer Science and Mathematics, Faculty of Philosophy, Sciences and Languages at Ribeirao Preto, University of Sao Paulo (USP), Av. Bandeirantes, 3900, Ribeirão Preto, SP, 14040-901, Brazil
José Augusto Baranauskas, Oscar Picchi Netto & Alessandra Alaniz Macedo
Dow AgroSciences (Seeds, Traits, Oils), Av. Antonio Diederichsen, 400, Ribeirão Preto, SP, 14020-250, Brazil
Sérgio Ricardo Nozawa

Authors

José Augusto Baranauskas
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Picchi Netto
View author publications
You can also search for this author in PubMed Google Scholar
Sérgio Ricardo Nozawa
View author publications
You can also search for this author in PubMed Google Scholar
Alessandra Alaniz Macedo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José Augusto Baranauskas.

Appendix: Datasets

The experiments reported here used 30 datasets, all of them representing real medical data, such as gene expressions, surveys, and diagnoses. The medical domain often imposes difficult obstacles to learning algorithms: high dimensionality, a huge or very small amount of instances, several possible class values, unbalanced classes, etc. This sort of data is indicated for filters, not only because of its large dimension but also because filters have a computational efficiency over wrappers [36]. Table 5 shows a summary of the datasets, none of which have missing values for the class attribute.

Table 5 Summary of the datasets used in the experiments

Full size table

Since the number of attributes and instances on each dataset can influence the results, we have used the density metric D ₃ proposed by [28] partitioning datasets into 8 low-density (Density ≤ 1) and 22 high-density (Density > 1) datasets. We computed density as:

$$\text{Density} \triangleq \log_{A} \frac{N+1}{c+1} $$

where N represents the number of instances, A is the number of attributes, and c represents the number of classes.

Next we provide a brief description of each dataset. Breast Cancer, Lung Cancer, CNS (Central Nervous System Tumour Outcome), Colon, Lymphoma, Leukemia, Leukemia nom., WBC (Wisconsin Breast Cancer), WDBC (Wisconsin Diagnostic Breast Cancer), Lymphography and H. Survival (H. stands for Haberman’s) are all related to cancer and their attributes consist of clinical, laboratory and gene expression data. Leukemia and Leukemia nom. represent the same data, but the second one had its attributes discretized [25]. C. Arrhythmia (C. stands for Cardiac), Heart Statlog, HD Cleveland, HD Hungarian and HD Switz. (Switz. stands for Switzerland) are related to heart diseases and their attributes represent clinical and laboratory data. Allhyper, Allhypo, ANN Thyroid, Hypothyroid, Sick and Thyroid 0387 are a series of datasets related to thyroid conditions. Hepatitis and Liver Disorders are related to liver diseases, whereas C. Method (C. stands for Contraceptive), Dermatology, Pima Diabetes (Pima Indians Diabetes) and P. Patient (P. stands for Postoperative) are other datasets related to human conditions. Splice Junction is related to the task of predicting boundaries between exons and introns. E.Coli is related to protein localization sites. Datasets were obtained from the UCI Repository [37], Leukemia and Leukemia nom. were obtained from [38].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baranauskas, J., Netto, O., Nozawa, S. et al. A tree-based algorithm for attribute selection. Appl Intell 48, 821–833 (2018). https://doi.org/10.1007/s10489-017-1008-y

Download citation

Published: 04 August 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s10489-017-1008-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A tree-based algorithm for attribute selection

Abstract

Access this article

Similar content being viewed by others

Feature Selection Methods Based on Decision Rule and Tree Models

Algorithms for Attribute Selection and Knowledge Discovery

A Novel Attributes Partition Method for Decision Tree

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Datasets

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A tree-based algorithm for attribute selection

Abstract

Access this article

Similar content being viewed by others

Feature Selection Methods Based on Decision Rule and Tree Models

Algorithms for Attribute Selection and Knowledge Discovery

A Novel Attributes Partition Method for Decision Tree

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Datasets

Appendix: Datasets

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation