Multi-label symbolic value partitioning through random walks

doi:10.1016/j.neucom.2020.01.046

Neurocomputing

Volume 387, 28 April 2020, Pages 195-209

https://doi.org/10.1016/j.neucom.2020.01.046 Get rights and content

Highlights

•
We introduce the multi-label symbolic value partition problem.
•
We convert the new problem into a clustering problem for attribute values.
•
Thirteen Mulan datasets are used to compare with seven feature selection algorithms based on five measures.
•
Comparison study shows that the propose algorithm outperforms other algorithms in most datasets.

Abstract

Feature selection and symbolic value partitioning are effective knowledge reduction techniques in the field of data mining. A large body of feature selection methods has been proposed for multi-label data. By contrast, symbolic value partitioning for such data has not been studied. In this paper, we propose the multi-label symbolic value partitioning through random walks algorithm with two stages. In the first stage, an undirected weighted graph is constructed for each attribute. Each node corresponds to an attribute value and the weight of each edge corresponds to the similarity between two nodes. Similarity is defined based on the attribute value distribution for each label. In the second stage, a random walk algorithm is used to cluster attribute values. The average weight serves as the separation operator to sharpen the inter-cluster edges. We tested the new algorithm and seven popular feature selection algorithms on 13 datasets. The experimental results demonstrate the effectiveness of the proposed algorithm in reducing the data size and improving classification accuracy.

Introduction

Multi-label learning has attracted much attention in recent years. It is rooted in various fields, including image processing [1], [2], text classification [3], and bioinformatics [4], [5]. There are different multi-label learning problems [6], [7], [8], including classification [9], [10], [11], label distribution learning [12], [13], [14], and dimensionality reduction [15], [16]. Consequently, many multi-label learning algorithms have been proposed, including multi-label k-nearest neighbor (ML-kNN) [17], weighted linear loss multiple birth support vector machine based on information granulation [18], support vector machine for multi-label classification based on rank [19], support vector machine for multi-label classification with a zero label [20], multi-label learning with global and local correlation [21], multi-label dimensionality reduction via dependence maximization [22], and multi-label feature selection based on max-dependency and min-redundancy (MDMR) [23].

As with single-label learning, multi-label learning also encounters the curse of dimensionality [24], [25]. Multi-label data, such as text [26], images [27], [28], and gene sequences [29], are represented by high-dimensional eigenvectors. The high dimension makes the sample distribution sparse, which increases the computational complexity and degrades the performance of the classification model. Hence, a number of dimensionality reduction techniques [30], [31] have been developed, such as feature selection [32], [33] and feature extraction [34], [35]. The former selects the optimal subset from the original feature set according to specific criteria, and the latter aims to map the original feature set to a low-dimensional space and generate new features through specific transformations.

Multi-label feature selection [36], [37] can be broadly divided into two categories [38]: problem transformation (PT) and algorithm adaptation (AA). PT methods transform the multi-label learning problem into a series of single-label learning problems. Hence, they can effectively use existing single-label feature selection algorithms. Researchers designed two transformation approaches, that is, binary relevance (BR) [39], [40] and label power-set (LP) [41], and two indicators, that is, relief (RF) and information gain (IG). Subsequently, four PT-based feature selection algorithms (RF-BR, RF-LP, IG-BR, and IG-LP) were proposed. AA methods modify existing single-label feature selection algorithms so that they can be applied to multi-label data. MLNB [42] adapts traditional naïve Bayes classifiers to solve the multi-label feature selection problem. NRPS [43] uses the neighborhood relationship preserving score for multi-label feature selection, which was inspired by similarity preservation.

Symbolic value partitioning is another knowledge reduction technique. Similar to discretization [44], this approach decreases the size of the attribute domains. The difference is that it is applicable to symbolic instead of numeric data. In fact, symbolic value partitioning is more general than discretization and feature selection. Some symbolic value partitioning algorithms for a single-label have been proposed. Nguyen et al. [45] proposed a rough set approach to convert the symbolic value partitioning problem into a graph coloring problem. Wen and Min [46] proposed a granular computing framework with adaptive granule construction and selection. However, to the best of our knowledge, the respective multi-label problem has not been studied yet.

In this paper, we propose the multi-label symbolic value partitioning through random walks (MSPR) algorithm. Fig. 1 illustrates the framework of the new algorithm. It consists of two stages: graph construction and clustering. In the graph construction stage, an undirected weighted graph is constructed for each attribute. The weight of each edge represents the similarity between two attribute values. Similarity is calculated from the local information provided by one attribute and all labels. In the clustering stage, we use a random walk algorithm to cluster the attribute values on each weighted graph. A key parameter is the separating threshold, which determines whether an edge should span the boundaries of the two clusters. It is calculated by iterating the separation process on each graph. Moreover, the neighborhood similarity method is used for edge separation.

The main contributions of this paper are threefold. First, we introduce the multi-label symbolic value partition problem. This is an important problem in several applications. For instance, in medicine, when multiple labels represent multiple diagnoses (so-called comorbidity), a simplification of the attribute values can help to provide a better understanding of the problem, and a faster and more effective decision on new patients [47], [48]. Second, we convert the new problem into a clustering problem for attribute values. To the best of our knowledge, this is the first time clustering techniques have been applied to attribute value partitioning. Third, we apply a random walk algorithm based on graph clustering to the new problem. The constructed graph indicates the degree of association between attribute values.

Experiments were performed on 13 benchmark multi-label datasets to quantify the performance of the MSPR algorithm. The datasets were selected from different application areas: bioinformatics, video, images, semantic scene analysis, and text categorization. The number of instances ranged from 194 to 48,536, the number of attributes ranged from 19 to 500, and the number of labels ranged from 6 to 174. We compared the MSPR algorithm with four feature selection methods [38] using ReliefF and information gain as feature importance measures: ReliefF-Binary Relevance (RF-BR), ReliefF-Label Powerset (RF-LP), Information Gain-Binary Relevance (IG-BR), and Information Gain-Label Powerset (IG-LP). We also compared the MSPR algorithm with two feature selection methods using information-theoretic approaches as feature importance measures: MDMR [23] and fast information-theoretic multi-label feature ranking (FIMF) [49]. Additionally, we compared the MSPR algorithm with an embedded multi-label feature selection method with manifold regularization called manifold regularized discriminative feature selection for multi-label learning (MDFS) [50]. The results demonstrated that the MSPR algorithm outperformed the other algorithms on most datasets.

The remainder of this paper is organized as follows: In Section 2, we describe some basic concepts that are used throughout the paper. In Section 3, we present and analyze the MSPR algorithm. In Section 4, we present experimental results with analysis. Finally, in Section 5, we draw some conclusions and discuss future work.

Section snippets

Preliminaries

In this section, we review the main concepts that will be used in the discussion, including multi-label decision system, partition, clustering, and graph. We also redefine certain concepts in terms of attribute value partitioning. Table 1 lists notation used throughout the paper.

Algorithm

In this section, we first define the MSPR problem. Then we discuss the general framework of our approach using two subroutines. Finally, we discuss the heuristic function of weighted graph construction and the clustering method based on a random walk algorithm.

Experiments

We conducted experiments on 13 multi-label datasets to verify the performance of our proposed method and compared the results with those of seven other feature selection methods.

Conclusions

In this paper, we proposed a solution to the multi-label symbolic value partition problem. An efficient MSPR algorithm that consists of two stages was proposed to solve this issue. The goal of the proposed algorithm is to enhance the generalization ability and, simultaneously, help the classifier to obtain good classification performance. We compared the MSPR with seven popular feature selection algorithms on 13 datasets. The experimental results demonstrated that the MSPR algorithm achieved

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China [grant numbers 61573321, 41631179, 41604114, 61976194]; the Key Laboratory of Oceanographic Big Data Mining and Application of Zhejiang Province [grant number OBDMA201601]; and the Scientific Research Starting Project of SWPU [grant number 2018QHR007].

Liu-Ying Wen received her M.S. degree from the School of Computer, Central China Normal University, Wuhan, China, in 2009, and her Ph.D. degree from the Petroleum Engineering and Technology, Southwest Petroleum University, Chengdu, China, in 2017. She is currently a lecturer of Southwest Petroleum University, Chengdu, China. Her current research interests include dimensionality reduction, granular computing, data mining.

References (53)

SunL. et al.
Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification
Inf. Sci. (Ny)
(2019)
LiuF.-L. et al.
A comparison study of similarity measures for covering-based neighborhood classifiers
Inf. Sci. (Ny)
(2018)
DingS.-F. et al.
Weighted linear loss multiple birth support vector machine based on information granulation for multi-class classification
Pattern Recognit.
(2017)
XuJ.H.
An efficient multi-label support vector machine with a zero label
Expert Syst. Appl.
(2012)
LinY.-J. et al.
Multi-label feature selection based on max-dependency and min-redundancy
Neurocomputing
(2015)
XuS.-P. et al.
Multi-label learning with label-specific feature reduction
Knowl. Based Syst.
(2016)
YuY. et al.
Neighborhood rough sets based multi-label classification for automatic image annotation
Int. J. Approximate Reasoning
(2013)
QianJ. et al.
Hierarchical attribute reduction algorithms for big data using mapreduce
Knowl. Based Syst.
(2015)
ZhaoH. et al.
Cost-sensitive feature selection based on adaptive neighborhood granularity with multi-level confidence
Inf. Sci. (Ny)
(2016)
DingS.-F. et al.
A multiway p-spectral clustering algorithm
Knowl. Based Syst.
(2019)

XuJ.-H. et al.

A multi-label feature extraction algorithm via maximizing feature variance and feature-label dependence simultaneously

Knowl. Based Syst.

(2016)

N. SpolaôR et al.

A comparison of multi-label feature selection methods using the problem transformation approach

Electron. Notes Theor. Comput. Sci.

(2013)

E. Hüllermeier et al.

Label ranking by learning pairwise preferences

Artif. Intell.

(2008)

L. Rokach et al.

Ensemble methods for multi-label classification

Expert. Syst. Appl.

(2014)

ZhangM.-L. et al.

Feature selection for multi-label naive Bayes classification

Inf. Sci. (Ny)

(2009)

C. Sideris et al.

A flexible data-driven comorbidity feature extraction framework

Comput. Biol. Med.

(2016)

LeeJ. et al.

Fast multi-label feature selection based on information-theoretic feature ranking

Pattern Recognit.

(2015)

ZhangJ. et al.

Manifold regularized discriminative feature selection for multi-label learning

Pattern Recognit.

(2019)

MinF. et al.

Rough sets approach to symbolic value partition

Int. J. Approximate Reasoning

(2008)

MinF. et al.

Feature selection with test cost constraint

Int. J. Approaximate Reasoning

(2014)

WangJ. et al.

Cnn-rnn: s unified framework for multi-label image classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

WangM. et al.

Automatic image annotation via local multi-label classification

Proceedings of the International Conference on Content-based Image and Video Retrieval

(2008)

ChenW.-Z. et al.

Document transformation for multi-label feature selection in text categorization

Proceedings of the Seventh IEEE International Conference on Data Mining

(2007)

Y. Saeys et al.

A review of feature selection techniques in bioinformatics

Bioinformatics

(2007)

G. Tsoumakas et al.

Mining multi-label data

Data Mining and Knowledge Discovery Handbook

(2009)

YuH.-F. et al.

Large-scale multi-label learning with missing labels

Proceedings of the International Conference on Machine Learning

(2014)

Cited by (4)

An incremental random walk algorithm for sampling continuous fitness landscapes
2023, Neurocomputing
Fitness landscape analysis (FLA) is used to mathematically characterize optimization problems. It is critical to get appropriate sampling points using random walk (RW) algorithms to perform FLA on continuous optimization problems. However, most of the existing approaches face the problems of poor robustness of the RW algorithm and inadequate coverage of the fitness landscape. In this study, an incremental random walk (IRW) algorithm is proposed to sample the fitness landscape for continuous optimization problems. IRW includes two major improvements: (i) An incremental perturbation mechanism is proposed to generate perturbation variables to enhance the diverse distribution of sampling points. (ii) The problem of getting trapped in the local region is alleviated by using the mirrored boundary handling method. The experimental results tested on various cases demonstrate the excellence of IRW in terms of coverage. Moreover, IRW has shown superior reliability at different problem dimensions on other different benchmark functions. Consequently, IRW is well-suited as an alternative for sampling continuous fitness landscapes under a variety of problem dimensions.
KGA: integrating KPCA and GAN for microbial data augmentation
2023, International Journal of Machine Learning and Cybernetics
Hierarchical multilabel classification by exploiting label correlations
2022, International Journal of Machine Learning and Cybernetics
Decision-Theoretic Rough Set: A Fusion Strategy
2020, IEEE Access

Chao-Guang Luo is a graduate student at the School of Computer Science, Southwest Petroleum University. His current research interests include multi-label learning, feature selection.

Wei-Zhi Wu received the B.Sc. degree in mathematics from Zhejiang Normal University, Jinhua, China, in 1986, the M.Sc. degree in mathematics from East China Normal University, Shanghai, China, in 1992, and the Ph.D. degree in applied mathematics from Xian Jiaotong University, Xian, China, in 2002. He is currently a Professor with the School of Mathematics, Physics, and Information Science, Zhejiang Ocean University, Zhejiang, China. His current research interests include approximate reasoning, rough sets, random sets, formal concept analysis, and granular computing.

Fan Min received the M.S. and Ph.D. degrees from the School of Computer Science and Engineering, University of Electronics Science and Technology of China, Chengdu, China, in 2000 and 2003, respectively. He visited the University of Vermont, Burlington, Vermont, from 2008 to 2009. He is currently a professor with Southwest Petroleum University, Chengdu. He has published more than 100 refereed papers in various journals and conferences, including the Information Sciences, International Journal of Approximate Reasoning, and Knowledge-Based Systems. His current research interests include data mining, recommender systems, active learning and granular computing.

View full text

Multi-label symbolic value partitioning through random walks

Highlights

Abstract

Introduction

Section snippets

Preliminaries

Algorithm

Experiments

Conclusions

Declaration of Competing Interest

Acknowledgments

Inf. Sci. (Ny)

Inf. Sci. (Ny)

Pattern Recognit.

Expert Syst. Appl.

Neurocomputing

Knowl. Based Syst.

Int. J. Approximate Reasoning

Knowl. Based Syst.

Inf. Sci. (Ny)

Knowl. Based Syst.

Knowl. Based Syst.

Electron. Notes Theor. Comput. Sci.

Artif. Intell.

Expert. Syst. Appl.

Inf. Sci. (Ny)

Comput. Biol. Med.

Pattern Recognit.

Pattern Recognit.

Int. J. Approximate Reasoning

Int. J. Approaximate Reasoning

Cnn-rnn: s unified framework for multi-label image classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Automatic image annotation via local multi-label classification

Proceedings of the International Conference on Content-based Image and Video Retrieval

Document transformation for multi-label feature selection in text categorization

Proceedings of the Seventh IEEE International Conference on Data Mining

A review of feature selection techniques in bioinformatics

Bioinformatics

Mining multi-label data

Data Mining and Knowledge Discovery Handbook

Large-scale multi-label learning with missing labels

Proceedings of the International Conference on Machine Learning