Stochastic nonparallel hyperplane support vector machine for binary classification problems and no-free-lunch theorems

Ashish Sharma ORCID: orcid.org/0000-0003-0093-8181¹

296 Accesses
3 Citations
Explore all metrics

Abstract

In this paper, the binary classification problem is considered and its solution is proposed as the formulated classification model, based on genetic algorithm (GA) and nonparallel hyperplane support vector machine (NHSVM), termed as stochastic nonparallel hyperplane support vector machine (SNHSVM). As GA provably violates the non-revisiting condition of the no-free-lunch theorems for optimization (NFLO), then SNHSVM have the natural property that NFLO do not apply to it. All the experiments are performed in a scenario in which no-free-lunch theorems for machine learning (NFLM) do not apply on all the compared machines. The hypothesis is that in such a scenario some classifier can perform better than others. The experiments are performed on the real world UCI datasets and the SNHSVM is compared with the state of art support vector based classifiers with performance measure as accuracy. SNHSVM achieves the highest accuracy in 100% of the cases and the Friedman test confirms the better performance of SNHSVM on all of the datasets used. These results validate the hypothesis empirically while apart from SNHSVM the NFLM floats up for the other compared classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 3

References

Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Google Scholar
Haykin SS (2009) Neural networks and learning machines. Pearson, Upper Saddle River
Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Alzubi J, Nayyar A, Kumar A (2018) Machine learning from theory to algorithms: an overview. In: Journal of physics: conference series, vol. 1142, no. 1, p. 012012. IOP Publishing
Zhang Y, Zhao Y (2014) Applications of support vector machines in astronomy. In: Astronomical data analysis software and systems XXIII, vol. 485, p. 239
Díaz J, Acosta J, González R, Cota J, Sifuentes E, Nebot À (2018) Modeling the control of the central nervous system over the cardiovascular system using support vector machines. Comput Biol Med 1(93):75–83
Article Google Scholar
Li H, Liang Y, Xu Q (2009) Support vector machines and its applications in chemistry. Chemom Intell Lab Syst 95(2):188–198
Article Google Scholar
Azzam M, Awad M, Zeaiter J (2018) Application of evolutionary neural networks and support vector machines to model NO_x emissions from gas turbines. J Environ Chem Eng 6:1044–1052
Article Google Scholar
Yan J, Jin J, Chen F, Yu G, Yin H, Wang W (2018) Urban flash flood forecast using support vector machine and numerical simulation. J Hydroinform 20(1):221–231
Article Google Scholar
Wang F, Liu S, Ni W, Xu Z, Qiu Z, Wan Z, Pan Z (2019) Imbalanced data classification algorithm with support vector machine kernel extensions. Evol Intell 12(3):341–347
Article Google Scholar
Mangasarian OL, Wild EW (2006) Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Trans Pattern Anal Mach Intell 28(1):69–74
Article Google Scholar
Khemchandani R, Chandra S (2007) Twin support vector machines for pattern classification. IEEE Trans Pattern Anal Mach Intell 29(5):905–910
Article Google Scholar
Shao YH, Zhang CH, Wang XB, Deng NY (2011) Improvements on twin support vector machines. IEEE Trans Neural Netw 22(6):962–968
Article Google Scholar
Qi Z, Tian Y, Shi Y (2013) Robust twin support vector machine for pattern classification. Pattern Recognit 46(1):305–316
Article Google Scholar
Qi Z, Tian Y, Shi Y (2012) Laplacian twin support vector machine for semi-supervised classification. Neural Netw 35:46–53
Article Google Scholar
Shao YH, Chen WJ, Deng NY (2014) Nonparallel hyperplane support vector machine for binary classification problems. Inf Sci 1(263):22–35
Article MathSciNet Google Scholar
Ding S, Zhang X, Yu J (2016) Twin support vector machines based on fruit fly optimization algorithm. Int J Mach Learn Cybern 7(2):193–203
Article Google Scholar
Wang Z, Shao YH, Wu TR (2013) A GA-based model selection for smooth twin parametric-margin support vector machine. Pattern Recognit 46(8):2267–2277
Article Google Scholar
Zhang X, Qiu D, Chen F (2015) Support vector machine with parameter optimization by a novel hybrid method and its application to fault diagnosis. Neurocomputing. 3(149):641–651
Article Google Scholar
Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82
Article Google Scholar
Wolpert DH (2002) The supervised learning no-free-lunch theorems. In: Roy R, Köppen M, Ovaska S, Furuhashi T, Hoffmann F (eds) Soft computing and industry, pp. 25–42. Springer, London
Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390
Article Google Scholar
Wolpert DH (1996) The existence of a priori distinctions between learning algorithms. Neural Comput 8(7):1391–1420
Article Google Scholar
Liu F, Zhou Z (2015) A new data classification method based on chaotic particle swarm optimization and least square-support vector machine. Chemom Intell Lab Syst 15(147):147–156
Article Google Scholar
Zhai S, Jiang T (2015) A new sense-through-foliage target recognition method based on hybrid differential evolution and self-adaptive particle swarm optimization-based support vector machine. Neurocomputing. 3(149):573–584
Article Google Scholar
Zhai S, Jiang T (2014) A novel particle swarm optimization trained support vector machine for automatic sense-through-foliage target recognition system. Knowl Based Syst 1(65):50–59
Article Google Scholar
Pai PF, Hong WC (2005) Support vector machines with simulated annealing algorithms in electricity load forecasting. Energy Convers Manag 46(17):2669–2688
Article Google Scholar
Zhai S, Pan J, Luo H, Fu S, Chen H (2016) A new sense-through-foliage target recognition method based on hybrid particle swarm optimization-based wavelet twin support vector machine. Measurement 1(80):58–70
Article Google Scholar
Sartakhti JS, Afrabandpey H, Saraee M (2017) Simulated annealing least squares twin support vector machine (SA-LSTSVM) for pattern classification. Soft Comput 21(15):4361–4373
Article Google Scholar
Yang XS (2014) Nature-inspired optimization algorithms. Elsevier, Amsterdam
MATH Google Scholar
Nayyar A, Le DN, Nguyen NG (eds) (2018) Advances in swarm intelligence for optimizing problems in computer science. CRC Press, Boca Raton
Google Scholar
Xing B, Gao WJ (2016) Innovative computational intelligence: a rough guide to 134 clever algorithms. Springer, Berlin, p 105
MATH Google Scholar
Khemchandani R, Chandra S (2016) Twin support vector machines: models, extensions and applications. Springer, Berlin
MATH Google Scholar
Holland JH (1975) Adaption in natural and artificial systems. The University of Michigan Press, Ann Arbor
MATH Google Scholar
De Jong KA (1975) Analysis of the behavior of a class of genetic adaptive systems. University of Michigan, Techniqual Report No 185
Nayyar A, Nguyen NG (2018) Introduction to swarm intelligence. Adv Swarm Intell Optim Probl Comput Sci 3:53–78
Google Scholar
Nayyar A, Garg S, Gupta D, Khanna A (2018) Evolutionary computation: theory and algorithms. In: Nayyar A, Le D-N, Nguyen NG (eds) Advances in swarm intelligence for optimizing problems in computer science, pp 1–26. Chapman and Hall/CRC
Wright AH, Zhao Y (1999) Markov chain models of genetic algorithms. In: Proceedings of the 1st annual conference on genetic and evolutionary computation, vol 1, pp 734–741. Morgan Kaufmann Publishers Inc
Suzuki J (1995) A Markov chain analysis on simple genetic algorithms. IEEE Trans Systems Man Cybern 25(4):655–659
Article Google Scholar
Nix AE, Vose MD (1992) Modeling genetic algorithms with Markov chains. Ann Math Artif Intell 5(1):79–88
Article MathSciNet Google Scholar
Goldberg DE, Segrest P (1987) Finite Markov chain analysis of genetic algorithms. In: Proceedings of the second international conference on genetic algorithms, vol 1, p 1
Sewell M, Shawe-Taylor J (2012) Forecasting foreign exchange rates using kernel methods. Expert Syst Appl 39(9):7652–7662
Article Google Scholar
Aggarwal CC (2015) Data mining: the textbook. Springer, Berlin
MATH Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):1–27
Article Google Scholar
Blake CL, Merz CJ (1998) UCI repository for machine learning databases. Department of Information and Computer Science, University of California, Irvine [Online]. http://www.ics.uci.edu/~mlearn/MLRepository.htm
Demšar J (2009) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30
MathSciNet MATH Google Scholar
He J, Kang L (1999) On the convergence rates of genetic algorithms. Theor Comput Sci 229(1–2):23–39
Article MathSciNet Google Scholar
Rudolph G (1994) Convergence analysis of canonical genetic algorithms. IEEE Trans Neural Netw 5(1):96–101
Article Google Scholar

Download references

Acknowledgements

I am highly thankful to Mr. Arvind Upadhyay (IES-IPSA, India), the complete computer science department IES-IPSA and Dr. D. H. Wolpert for their helpful suggestions. I am heartily indebted to the anonymous reviewers whose invaluable comments pushed me beyond my thinking limits.

Author information

Authors and Affiliations

Institute of Engineering and Science, IPS Academy, Rajendra Nagar A.B. Road, Indore, 452012, India
Ashish Sharma

Authors

Ashish Sharma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashish Sharma.

Ethics declarations

Conflict of interest

The author declares no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Here, I analyze the GA as presented in Algorithm-I mathematically and prove the theorems that are used in the paper.

Then, consider the optimization problem (this is same as (16)):

$$max \left\{ {f\left( x \right); x \in X} \right\}$$

(24)

where $X$ is a bounded and compact measurable space. $f\left( x \right)$ is called the objective function and the set $S_{OPT}$ is the set of all $x$ such that for each $x \in S_{OPT}$, $f\left( x \right)$ is maximum of $f$, formally:

$$S_{OPT} = \left\{ {x;\left| {f\left( x \right) - f_{max} } \right| = 0} \right\},$$

where $f_{max}$ is the maximum value of $f$. In this analysis it is assumed that $S_{OPT}$ is not empty and contains at least one element. Let $\phi \left( \cdot \right)$ is a measure on $X$, then $\phi \left( {S_{OPT} } \right) = 0$ as $S_{OPT}$ contains a few elements usually, so for the ease of the analysis it is assumed that $S_{OPT - \varepsilon } = \left\{ {x;\left| {f\left( x \right) - f_{max} } \right| \le \varepsilon } \right\}$ is the set that I mean by $S_{OPT}$ henceforth, where $\varepsilon$ is a small positive.

The problem (24) or (16) are same and is solved with the genetic algorithm GA which was described previously in the paper (as Algorithm-I) and now for that GA a mathematical model is presented for its convergence and performance analysis. The GA will be modeled as a Markov Chain next.

A Markov chain is a random process that is completely described by its state transition probability matrix. Therefore, to describe a Marko chain one needs to specify its states and the state transition probabilities. Let, GA operates on the binary strings (solutions) each having the fixed length $\left( l \right)$. Then, the total number of possible strings is $r = 2^{l}$. In what follows italic p stands for probability, while non-italic p stands for population (the context also clear this distinction). Let, the set of all possible strings is $\varOmega \left( { = X} \right)$ and $\varOmega^{n}$ is the set of all possible n-string sets, i.e. let $\text p = \left\{ {x_{1} ,x_{2} \ldots x_{n} ;x_{i} \in \varOmega \;for\;all\;1 \le i \le n} \right\},$ then $\varOmega^{n} = \left\{ {\text p;\text p \subseteq \varOmega } \right\}$. The set p is called a population having a size $n$ and the number of possible populations is N. Let $\text p\;is\;represented\;as\;\varvec{x}\;\left( {bold\;\varvec{x}} \right)$ and the fitness of x is defined as:

$$f\left( \varvec{x} \right) = max \left\{ {f\left( {x_{i} } \right); x_{i} \in \varvec{x}} \right\}.$$

The set of all populations that contains the global optimal string is defined as:

$$S_{OPT}^{n} = \left\{ {\varvec{x};\left( {\exists x_{i} \in \varvec{x}} \right)\left( {x_{i} \in S_{OPT} } \right)} \right\}.$$

Definition 1

Let Z is a ${\text{r}} \times N$ matrix (rows of Z are all possible strings and columns of Z are all possible populations) defined as follows:$Z = \left[ {z_{i,j} } \right]$, where $z_{i,j}$ is the number of occurrences of string i in population j.

It is clear that some enumeration of all possible strings and populations must exist so that Z is well defined. One can agree on any kind of possible enumerations. One interpretation is based on Z as follows: the strings are numbered from $0$ to $r - 1$ and let $p_{i}$ and $p_{j}$ be any two populations as column vector of Z then $p_{i}$ comes before $p_{j}$ if interpreting the transpose of $p_{i}$ and $p_{j}$ as integer’s yields $p_{i} < p_{j} .$

Definition 2

The number of possible populations is

$$N = \left( {\begin{array}{*{20}c} {n + r - 1} \\ {r - 1} \\ \end{array} } \right).$$

Definition 3

The states of the Markov chain are the N possible populations.

Definition 4

The transition probabilities are given by an $N \times N$ transition probability matrix $\left( {\text{Q}} \right)$ in which an entry $Q_{ij}$ gives the probability that the kth generation (population) will be ${\text{p}}_{j}$ given that the $k - 1{\text{st}}$ generation is ${\text{p}}_{i}$.

$Q_{ij}$ can be calculated as follows with respect to $p_{i} \left( y \right)$, where $p_{i} \left( y \right)$ is the probability of producing string $y$ in the $i + 1{\text{th}}$ generation assuming the currect generation is ith having population ${\text{p}}_{i}$. Let the next generation is ${\text{p}}_{j}$ having $z_{y,j}$ occurrences of string y and the probability of such occurrences of string y is (almost) given by $\left\{ {p_{i} \left( y \right)} \right\}^{{z_{y,j} }}$. We can fill the n vacancies for strings in a population (${\text{say}}\;{\text{p}}_{j}$) as follows:

The binomial coefficient gives us the number of ways of choosing $z_{0,j }$ occurrences of string 0 as:

$$\left( {\begin{array}{*{20}c} n \\ {{\text{z}}_{0,j} } \\ \end{array} } \right).$$

After that there are $n - z_{0,j}$ vacancies remains to be filled, so for string 1 the number of ways of choosing $z_{1,j}$ occurrences of it is:

$$\left( {\begin{array}{*{20}c} {n - z_{0,j} } \\ {z_{1,j} } \\ \end{array} } \right).$$

Continuing as above for all the strings the total combination is:

$$\left( {\begin{array}{*{20}c} n \\ {z_{0,j} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {n - z_{0,j} } \\ {z_{1,j} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {n - z_{0,j} - z_{1j} - \cdots - z_{r - 2,j} } \\ {z_{r - 1,j} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {n!} \\ {z_{{\left( {0,j} \right)}} ! z_{{\left( {1,j} \right)}} ! \ldots z_{{\left( {r - 1,j} \right)}} !} \\ \end{array} } \right).$$

So, $Q_{ij}$ can be written as a multinomial distribution with parameters $n, p_{i} \left( 0 \right), \ldots , p_{i} \left( {r - 1} \right)$ as:

$$Q_{ij} = n!\mathop \prod \limits_{y = 0}^{r - 1} \frac{{\left\{ {p_{i} \left( y \right)} \right\}^{{z_{y,j} }} }}{{z_{y,j} !}}.$$

(25)

Definition 5

(Model-I) [40] The Markov chain model for GA is described by $Q_{ij}$ whose states are as defined in Definition 3.

Now, it can be proved that the GA can break the non-revisiting condition of NFLO as Theorem 1.

Theorem 1

The GA as defined in Definition 5 have a positive probability of breaking the non-revisiting condition of NFLO.

Proof

According to the above discussion and the definition of $Q_{ij}$ the probability of repeating a string y (in the next generation) is given by $p_{i} \left( y \right)$ and moreover the probability that there are $z_{y,j}$ occurrences of string y (in the next generation) is given by $\left\{ {p_{i} \left( y \right)} \right\}^{{z_{y,j} }}$. Note that it can’t be 0 otherwise ${\text{Q}}_{\text{ij}}$ itself becomes 0 (and the GA will never progress) which is impossible.

Model-I as defined in Definition 5 explicitly contains $p_{i} \left( y \right)$ which forces to incorporate definitions of genetic operators (mutation, crossover and selection) which I can do but I want to be as general as possible so a Markov chain model independent of any such assumptions of genetic operators will be very useful. Such a model is defined next.

Let $\xi \left( t \right)$, $\xi_{M} \left( t \right),$ $\xi_{C} \left( t \right)$, $\xi_{S} \left( t \right)$ and $\xi \left( {t + 1} \right)$ are random variables on $\varOmega^{n}$ defined as follows:

(i)
$\xi \left( t \right)$ is the population state at time t.
(ii)
$\xi_{M} \left( t \right)$ is the population state after the application of mutation operator.
(iii)
$\xi_{C} \left( t \right)$ is the population state after the application of crossover operator.
(iv)
$\xi_{S} \left( t \right)$ is the population state after the application of selection operator.
(v)
$\xi \left( {t + 1} \right) = \xi_{S} \left( t \right)$.

In the GA as described in Algorithm-I all the genetic operators (mutation, crossover and section) are invariant in time, therefore the transition probability functions are also invariant in time.

Definition 6

GA operators are defined as their respective transition probability function as follows:Let $A \subseteq \varOmega^{n}$

(i)
Mutation transition probability function defines the mutation operator as follows:

$$p_{M} \left( {\varvec{x},A} \right) = p_{M} \left( {\xi_{M} \left( t \right) \in A |\xi \left( t \right) = \varvec{x}} \right)$$
(ii)
Crossover transition probability function defines the crossover operator as follows:

$$p_{C} \left( {\varvec{x},A} \right) = p_{C} \left( {\xi_{C} \left( t \right) \in A |\xi_{M} \left( t \right) = \varvec{x}} \right)$$
(iii)
Selection transition probability function defines the selection operator as follows:

$$p_{S} \left( {\varvec{x},A} \right) = p_{S} \left( {\xi_{S} \left( t \right) \in A |\xi_{C} \left( t \right) = \varvec{x}} \right)$$

Definition 7

(Model-II) [47] The GA is defined mathematically as the Markov chain $\left\{ {\xi \left( t \right);t \in Z^{ + } } \right\}$ with the state space $\varOmega^{n}$ and the transition probability given by Kolmogorov–Chapman equation as:

$$p\left( {\varvec{x},A} \right) = \smallint_{\varvec{y}} \smallint_{\varvec{z}} p_{M} \left( {\varvec{x},d\varvec{y}} \right)p_{C} \left( {\varvec{y},d\varvec{z}} \right)p_{S} \left( {\varvec{z},A} \right).$$

(26)

Note that (25) and (26) are basically the same. Models I and II are partial models of the GA (described in Algorithm-I) the complete model is defined in Definition 10.

Theorem 2

The Markov chains of model-I and model-II are ergodic.

Proof

This fact directly follows from the inclusion of the mutation operator in the chains as because of mutation, it is possible to go from every state to every state. For model-I, I assume the same definition of mutation operator as for model-II as defined in Definition 6.

Theorem 3

In an ergodic Markov chain the expected transition time from any state to any other is finite irrespective of the states.

Proof

This is Theorem 5 in [48].

Next, two properties of great importance are defined:

Definition 8

(i)
Property 1 (□1): Maintain the best solution found so far over time.
(ii)
Property 2 (□2): It must be possible to go from any state of the Markov chain to any other state.

□1 says that the best solution found so far must be preserved so that it will be not lost. □1 holds in GA by forcing the requirement on GA operators that they will not lose the best solution. □2 is basically the accessibility property saying that all the states must be accessible starting from any state. This property makes the set $S_{OPT}^{n}$ accessible. In model-II (and therefore in model-I) □2 holds because of Theorem 2 and if □1 is forced to hold then the Markov chain of model-II becomes absorbing as the states containing the global optimal solutions $S_{OPT}^{n}$ becomes absorbing. The following result says that all the states can reach the optimal state(s) and in an expected finite time.

Theorem 4

In Model-II satisfying □1 and □2 it is possible to reach the global optimal set $S_{OPT}^{n}$ and the expected transition time from any initial state to any state in $S_{OPT}^{n}$ is finite.

Proof

By Theorem 2 the Model-II is basically ergodic and by Theorem 3 in model-II the expected time to go from one state to another is finite irrespective of the states. □1 makes the states in $S_{OPT}^{n}$ absorbing which means if the Markov chain enters any of the states in $S_{OPT}^{n}$ then it will not leave $S_{OPT}^{n}$. □1 does not change other states so it is still possible to go from any state to any state in $S_{OPT}^{n}$ in a finite time which is the basic property of model-II by Theorems 2 and 3. In fact now the model-II with □1 and □2 becomes an absorbing Markov chain. Therefore, $S_{OPT}^{n}$ is reachable in an expected finite time from any initial state.

Definition 9

(Model-III): Model-III is defined as Model-II obeying properties □1 and □2.

Definition 10

(GA): The GA as described in Algorithm-I is mathematically defined by model-III.

Now, the definition of convergence is needed so that it can be defined for the GA. Then,

Definition 11

Let $A \subseteq \varOmega^{n}$ is any measurable set and $\upmu_{1}$ and $\upmu_{2}$ are probability measures defined on $\varOmega^{n}$ (which is a measurable space) then the total variation distance between $\upmu_{1}$ and $\upmu_{2}$ is defined as:

$$\left| {|\upmu_{1} -\upmu_{2} } \right|| = \mathop {\sup }\limits_{{A \subseteq \varOmega^{n} }} \left| {\upmu_{1} \left( A \right) -\upmu_{2} \left( A \right)} \right|.$$

Definition 12

Let π is a probability distribution on $\varOmega^{n}$, then it is invariant for the Markov chain $\left\{ {\xi \left( t \right);t = 1,2, \ldots } \right\}$ having transition function $p\left( {\varvec{x},A} \right)$ if and only if

$$\uppi\left( {\text{A}} \right) = \smallint_{{\varOmega^{n} }} p\left( {\varvec{x},A} \right)\pi d\varvec{x}$$

for all measurable sets $A \subseteq \varOmega^{n}$.

Definition 13

Let $\left\{ {\xi \left( t \right);t \in Z^{ + } } \right\}$ is a Markov chain with the transition function $p\left( {\varvec{x},A} \right)$ and $\upmu_{t}$ is the probability distribution of $\xi \left( t \right)$. If there is an invariant probability distribution π for the Markov chain such that $\mathop {\lim }\limits_{{{\text{t}} \to \infty }} \left| {\left| {\upmu_{\text{t}} -\uppi} \right|} \right| = 0,$ then $\upmu_{t}$ is said to converge to π and if there exists a set $A \subseteq \varOmega^{n}$ such that $\uppi\left( {\text{A}} \right) = 1$ then $\upmu_{t}$ is said to converge to the set A globally.

Theorem 5

Assume $\varOmega^{n}$ is a bounded and compact measurable space. The Markov chain $\left\{ {\xi \left( t \right);t \in Z^{ + } } \right\}$ given by Model-III converges and there exists some positive $\delta > 0$ and some time $t_{0} > 0$ and an invariant probability distribution π for the Markov chain such that,

$$\pi \left( {S_{OPT}^{n} } \right) = 1,$$

and starting from any initial distribution $\xi \left( 1 \right) = \varvec{x}$ it holds for all time $t > 0$

$$\left| {\left| {\mu_{t} - \pi } \right|} \right| \le \left( {1 - \delta } \right)^{{\left[ {\frac{t}{{t_{0} }}} \right] - 1}},$$

where $\mu_{t}$ is the probability distribution of $\xi \left( t \right)$.

Proof

This is Theorem 5 in [47].

Theorem 6

SNHSVM converges globally.

Proof

This directly follows from the definition of the SNHSVM and Theorem 5. Specifically, SNHSVM is a combination of the GA and NHSVM. But, NHSVM is presented to GA as (19) in which the objective function is (18) which is an example of the problem (16) or (24) for which Theorem 5 says that the GA will converge globally. This completes the proof.

Appendix 2

Here, I demonstrate how the accuracy is calculated for each classification model that is compared in the paper. The reader must recall the scenario of the experiments i.e. $\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle\cdot}$}}{{\dot{\text{S}} }}3$. In $\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle\cdot}$}}{{\dot{\text{S}} }}3$ the important point with respect to experiments is that each dataset ${\text{D}}$ is divided into two disjoint sets ${\text{D}}_{1}$ and ${\text{D}}_{2}$, and training is performed on ${\text{D}}_{1}$ and testing is performed on ${\text{D}}_{2}$. For statistical fairness this division is done 10 times differently for each dataset and the average accuracy is recorded. That is, a dataset is divided into different parts exactly like the tenfold cross validation (for training and testing purposes), but the difference here is that for each fold the respective model is run and the accuracy is recorded. Then, the accuracy is averaged (summed up for each fold and divided by 10) to get the values as recorded in Table 3 (in linear case) and 4 (in non-linear case).

This is not a tenfold cross validation in which a model with same parameters setting is run on all tenfold, rather in the experimental setting of this paper a model is run on each fold, i.e. optimizing parameters for each fold. To avoid confusion I will call a fold a section. The example of the calculation of SNHSVM accuracy on Heart-c dataset in the linear case below will clear the above picture.

Section	${\text{C}}_{1}$	${\text{C}}_{2}$	Accuracy
1	1.6829	0.5684	80.0000
2	1.8657	0.1667	86.6667
3	2.2970	3.6835	76.6667
4	8.0610	0.2241	993.3333
5	1.4677	1.1306	86.6667
6	8.5830	0.2286	80.0000
7	8.3979	3.7509	90.0000
8	4.0602	6.7324	96.6667
9	4.3920	6.4008	90.0000
10	1.8827	3.1817	93.9394
Average accuracy			85.39395
Standard deviation			7.6419

All the accuracies for all the models are rounded off to two decimal places in a standard way so the average accuracy of SNHSVM on heart-c dataset in the linear case is $85.39\% \pm 07.64$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sharma, A. Stochastic nonparallel hyperplane support vector machine for binary classification problems and no-free-lunch theorems. Evol. Intel. 15, 215–234 (2022). https://doi.org/10.1007/s12065-020-00503-8

Download citation

Received: 27 August 2019
Revised: 28 July 2020
Accepted: 27 September 2020
Published: 21 November 2020
Issue Date: March 2022
DOI: https://doi.org/10.1007/s12065-020-00503-8

Stochastic nonparallel hyperplane support vector machine for binary classification problems and no-free-lunch theorems

Abstract

Access this article

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Theorem 1

Proof

Definition 6

Definition 7

Theorem 2

Proof

Theorem 3

Proof

Definition 8

Theorem 4

Proof

Definition 9

Definition 10

Definition 11

Definition 12

Definition 13

Theorem 5

Proof

Theorem 6

Proof

Appendix 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation