Strew index

Ahmed, Hasin A.; Bhattacharyya, Dhruba K.; Kalita, Jugal K.

doi:10.1007/s13721-015-0097-y

Strew index

An effective feature–class correlation measure

Original Article
Published: 20 August 2015

Volume 4, article number 24, (2015)
Cite this article

Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

Hasin A. Ahmed¹,
Dhruba K. Bhattacharyya¹ &
Jugal K. Kalita²

108 Accesses
1 Citation
Explore all metrics

Abstract

Machine learning can be broadly divided into supervised and unsupervised learning (Hastie et al. in The elements of statistical learning, Springer, New York, 2009). In supervised learning which is also known as classification, a classifier learns from some objects with known class labels and later assigns class labels to unknown objects based on acquired knowledge (Kotsiantis et al. in Proceedings of the 2007 conference on emerging artificial intelligence applications in computer engineering: real word AI systems with applications in ehealth, hci, information retrieval and pervasive technologies, http://dl.acm.org/citation.cfm?id=1566770.1566773, 2007). In unsupervised learning, objects are grouped without any class information (Jian et al. in ACM Comput Surv (CSUR) 31(3):264–323, 1999). On high-dimensional applications such as gene expression data analysis, machine learning becomes more challenging (Brown et al. in Proc Natl Acad Sci 97(1):262–267, 2000; Sturn et al. in Bioinformatics 18(1):207–208, 2002; Ahmed et al. in IEEE/ACM Trans Comput Biol Bioinform 6:1239–1252, 2014; Mahanta et al. in BMC Bioinform 13(Suppl 13):S4, 2012). Feature selection is a very important preprocessing task in supervised learning, especially on high-dimensional datasets. A number of feature subset evaluation measures have been proposed in the literature. In this paper, we seek an effective feature evaluation measure that deals with requirements imposed by various classifiers. We propose a measure named strew index to evaluate correspondence between a feature and class labels. The measure has been found very effective in evaluating a feature subset. The measure can also be used to evaluate correlation between a feature and labels with respect to a particular class. Another characteristic of the measure is its ability to handle both numeric and non-numeric features in a dataset without any conversion from non-numeric into numeric types. A filter approach is used to select relevant features with a high strew index from a number of UCI and gene expression datasets. The method outperforms other counterparts in most of the cases in terms of accuracy generated by different classifiers for different sizes of optimal feature subset over a number of UCI and gene expression datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Aha D, Bankert R (1994) Feature selection for case-based classification of cloud types: an empirical comparison. In: Proceedings of the 1994 AAAI workshop on case-based reasoning, pp 106–112
Ahmed HA, Mahanta P, Bhattacharyya DK, Kalita JK (2014) Shifting-and-scaling correlation based biclustering algorithm. IEEE/ACM Trans Comput Biol Bioinform 6:1239–1252
Article Google Scholar
Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2010) Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Log Soft Comput 17(2–3):255–287
Google Scholar
Almuallim H, Dietterich T (1991) Learning with many irrelevant features. In: Proceedings of the ninth national conference on artificial intelligence, vol 2, pp 547–552
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12):6745–6750
Article Google Scholar
Anastassiou D (2007) Computational analysis of the synergy among multiple interacting genes. Mol Syst Biol 3(1). doi:10.1038/msb4100124
Google Scholar
Bache K, Lichman M (2014) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed Nov 2014
Borah P, Ahmed HA, Bhattacharyya DK (2014) A statistical feature selection technique. Netw Model Anal Health Inform Bioinform 3(1):1–13
Article Google Scholar
Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci 97(1):262–267
Article Google Scholar
Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’10, pp 333–342. ACM, New York. DOI:10.1145/1835804.1835848
Chen Y, Li Y, Cheng XQ, Guo L (2006) Survey and taxonomy of feature selection algorithms in intrusion detection system. In: Information security and cryptology. Springer, pp 153–167
Czerniak J, Zarzycki H (2003) Application of rough sets in the presumptive diagnosis of urinary system diseases. Springer, US, pp 41–51. doi:10.1007/978-1-4419-9226-0_5
Google Scholar
Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1):155–176
Article MathSciNet MATH Google Scholar
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205
Article Google Scholar
Forina M, Leardi R, Armanino C, Lanteri S (1991) Parvus—an extendible package for data exploration, classification and correlation
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Article Google Scholar
Hall MA, Smith LA (1997) Feature subset selection: a correlation based filter approach. In: Proceedings of the international conference on neural information processing and intelligent information systems. Springer, pp 855–858
Hastie T, Tibshirani R, Friedman J, Franklin J (2005) The elements of statistical learning: data mining, inference and prediction. Math Intell 27(2):83–85
Google Scholar
Hastie T, Tibshirani R, Friedman J, Hastie T, Friedman J, Tibshirani R (2009) The elements of statistical learning, vol 2. Springer, New York
Book MATH Google Scholar
He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. Adv Neural Inf Process Syst 18:507
Google Scholar
Hu Q, Yu D, Liu J, Wu C (2008) Neighborhood rough set based heterogeneous feature subset selection. Inf Sci 178(18):3577–3594
Article MathSciNet MATH Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Article Google Scholar
Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C et al (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7(6):673–679
Article Google Scholar
Kira K, Rendell L (1992) The feature selection problem: Traditional methods and a new algorithm. In: Proceedings of the national conference on artificial intelligence. Wiley, pp 129–129
Kirkby R, Frank E, Reutemann P (2007) Weka explorer user guide for version 3-5-5. University of Waikato
Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97(1):273–324
Article MATH Google Scholar
Kononenko I, Šimec E, Robnik-Šikonja M (1997) Overcoming the myopia of inductive learning algorithms with relieff. Appl Intell 7(1):39–55
Article Google Scholar
Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. In: Proceedings of the 2007 conference on emerging artificial intelligence applications in computer engineering: real word AI systems with applications in eHealth, HCI, information retrieval and pervasive technologies. IOS Press, Amsterdam, pp 3–24. http://dl.acm.org/citation.cfm?id=1566770.1566773
Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15):2429–2437
Article Google Scholar
Mahanta P, Ahmed HA, Bhattacharyya DK, Kalita JK (2012) An effective method for network module extraction from microarray data. BMC Bioinform 13(Suppl 13):S4
Article Google Scholar
Martins DC, Braga-Neto UM, Hashimoto RF, Bittner ML, Dougherty ER (2008) Intrinsically multivariate predictive genes. Sel Topics Signal Process IEEE J 2(3):424–439
Article Google Scholar
Martins DC, De Oliveira EA, Braga-Neto UM, Hashimoto RF, Cesar RM (2013) Signal propagation in bayesian networks and its relationship with intrinsically multivariate predictive variables. Information Sciences 225:18–34
Article MathSciNet MATH Google Scholar
Min F, Hu Q, Zhu W (2014) Feature selection with test cost constraint. Int J Approx Reason 55(1):167–179
Article MathSciNet Google Scholar
Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Article Google Scholar
Molina LC, Belanche L, Nebot À (2002) Feature selection algorithms: A survey and experimental evaluation. In: Data mining, 2002. ICDM 2003. Proceedings. 2002 IEEE international conference. IEEE, pp 306–313
Narendra P, Fukunaga K (1977) A branch and bound algorithm for feature subset selection. Comput IEEE Trans 100(9):917–922
Article Google Scholar
Ng AY, Jordan MI, Weiss Y et al (2002) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 2:849–856
Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Anal Mach Intell IEEE Trans 27(8):1226–1238
Article Google Scholar
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Article Google Scholar
Sreeja N, Sankar A (2015) Pattern matching based classification using ant colony optimization based feature selection. Appl Soft Comput 31:91–102
Article Google Scholar
Sturn A, Quackenbush J, Trajanoski Z (2002) Genesis: cluster analysis of microarray data. Bioinformatics 18(1):207–208
Article Google Scholar
Sugumaran V, Muralidharan V, Ramachandran K (2007) Feature selection using decision tree and classification through proximal support vector machine for fault diagnostics of roller bearing. Mech Syst Signal Process 21(2):930–942
Article Google Scholar
Zhang M, Peña J, Robles V (2009) Feature selection for multi-label naive Bayes classification. Inf Sci 179(19):3218–3229
Article MATH Google Scholar
Zhao G, Wu Y, Chen F, Zhang J, Bai J (2015) Effective feature selection using feature vector graph for classification. Neurocomputing 151:376–389
Article Google Scholar
Zhong N, Dong J, Ohsuga S (2001) Using rough sets with heuristics for feature selection. J Intell Inf Syst 16(3):199–214
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Tezpur University, Assam, India
Hasin A. Ahmed & Dhruba K. Bhattacharyya
Department of Computer Science, University of Colorado, Colorado Springs, USA
Jugal K. Kalita

Authors

Hasin A. Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Dhruba K. Bhattacharyya
View author publications
You can also search for this author in PubMed Google Scholar
Jugal K. Kalita
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dhruba K. Bhattacharyya.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahmed, H.A., Bhattacharyya, D.K. & Kalita, J.K. Strew index. Netw Model Anal Health Inform Bioinforma 4, 24 (2015). https://doi.org/10.1007/s13721-015-0097-y

Download citation

Received: 07 March 2015
Revised: 13 July 2015
Accepted: 02 August 2015
Published: 20 August 2015
DOI: https://doi.org/10.1007/s13721-015-0097-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Strew index

Abstract

Access this article

Similar content being viewed by others

Machine-Learning Algorithms for Feature Selection from Gene Expression Data

IFS: An Incremental Feature Selection Method to Classify High-Dimensional Data

FeatureSelect: a software for feature selection based on machine learning approaches

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Strew index

Abstract

Access this article

Similar content being viewed by others

Machine-Learning Algorithms for Feature Selection from Gene Expression Data

IFS: An Incremental Feature Selection Method to Classify High-Dimensional Data

FeatureSelect: a software for feature selection based on machine learning approaches

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation