Skip to main content

Advertisement

Log in

Stochastic nonparallel hyperplane support vector machine for binary classification problems and no-free-lunch theorems

  • Research Paper
  • Published:
Evolutionary Intelligence Aims and scope Submit manuscript

Abstract

In this paper, the binary classification problem is considered and its solution is proposed as the formulated classification model, based on genetic algorithm (GA) and nonparallel hyperplane support vector machine (NHSVM), termed as stochastic nonparallel hyperplane support vector machine (SNHSVM). As GA provably violates the non-revisiting condition of the no-free-lunch theorems for optimization (NFLO), then SNHSVM have the natural property that NFLO do not apply to it. All the experiments are performed in a scenario in which no-free-lunch theorems for machine learning (NFLM) do not apply on all the compared machines. The hypothesis is that in such a scenario some classifier can perform better than others. The experiments are performed on the real world UCI datasets and the SNHSVM is compared with the state of art support vector based classifiers with performance measure as accuracy. SNHSVM achieves the highest accuracy in 100% of the cases and the Friedman test confirms the better performance of SNHSVM on all of the datasets used. These results validate the hypothesis empirically while apart from SNHSVM the NFLM floats up for the other compared classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  2. Haykin SS (2009) Neural networks and learning machines. Pearson, Upper Saddle River

    Google Scholar 

  3. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  4. Alzubi J, Nayyar A, Kumar A (2018) Machine learning from theory to algorithms: an overview. In: Journal of physics: conference series, vol. 1142, no. 1, p. 012012. IOP Publishing

  5. Zhang Y, Zhao Y (2014) Applications of support vector machines in astronomy. In: Astronomical data analysis software and systems XXIII, vol. 485, p. 239

  6. Díaz J, Acosta J, González R, Cota J, Sifuentes E, Nebot À (2018) Modeling the control of the central nervous system over the cardiovascular system using support vector machines. Comput Biol Med 1(93):75–83

    Article  Google Scholar 

  7. Li H, Liang Y, Xu Q (2009) Support vector machines and its applications in chemistry. Chemom Intell Lab Syst 95(2):188–198

    Article  Google Scholar 

  8. Azzam M, Awad M, Zeaiter J (2018) Application of evolutionary neural networks and support vector machines to model NOx emissions from gas turbines. J Environ Chem Eng 6:1044–1052

    Article  Google Scholar 

  9. Yan J, Jin J, Chen F, Yu G, Yin H, Wang W (2018) Urban flash flood forecast using support vector machine and numerical simulation. J Hydroinform 20(1):221–231

    Article  Google Scholar 

  10. Wang F, Liu S, Ni W, Xu Z, Qiu Z, Wan Z, Pan Z (2019) Imbalanced data classification algorithm with support vector machine kernel extensions. Evol Intell 12(3):341–347

    Article  Google Scholar 

  11. Mangasarian OL, Wild EW (2006) Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Trans Pattern Anal Mach Intell 28(1):69–74

    Article  Google Scholar 

  12. Khemchandani R, Chandra S (2007) Twin support vector machines for pattern classification. IEEE Trans Pattern Anal Mach Intell 29(5):905–910

    Article  Google Scholar 

  13. Shao YH, Zhang CH, Wang XB, Deng NY (2011) Improvements on twin support vector machines. IEEE Trans Neural Netw 22(6):962–968

    Article  Google Scholar 

  14. Qi Z, Tian Y, Shi Y (2013) Robust twin support vector machine for pattern classification. Pattern Recognit 46(1):305–316

    Article  Google Scholar 

  15. Qi Z, Tian Y, Shi Y (2012) Laplacian twin support vector machine for semi-supervised classification. Neural Netw 35:46–53

    Article  Google Scholar 

  16. Shao YH, Chen WJ, Deng NY (2014) Nonparallel hyperplane support vector machine for binary classification problems. Inf Sci 1(263):22–35

    Article  MathSciNet  Google Scholar 

  17. Ding S, Zhang X, Yu J (2016) Twin support vector machines based on fruit fly optimization algorithm. Int J Mach Learn Cybern 7(2):193–203

    Article  Google Scholar 

  18. Wang Z, Shao YH, Wu TR (2013) A GA-based model selection for smooth twin parametric-margin support vector machine. Pattern Recognit 46(8):2267–2277

    Article  Google Scholar 

  19. Zhang X, Qiu D, Chen F (2015) Support vector machine with parameter optimization by a novel hybrid method and its application to fault diagnosis. Neurocomputing. 3(149):641–651

    Article  Google Scholar 

  20. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82

    Article  Google Scholar 

  21. Wolpert DH (2002) The supervised learning no-free-lunch theorems. In: Roy R, Köppen M, Ovaska S, Furuhashi T, Hoffmann F (eds) Soft computing and industry, pp. 25–42. Springer, London

  22. Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390

    Article  Google Scholar 

  23. Wolpert DH (1996) The existence of a priori distinctions between learning algorithms. Neural Comput 8(7):1391–1420

    Article  Google Scholar 

  24. Liu F, Zhou Z (2015) A new data classification method based on chaotic particle swarm optimization and least square-support vector machine. Chemom Intell Lab Syst 15(147):147–156

    Article  Google Scholar 

  25. Zhai S, Jiang T (2015) A new sense-through-foliage target recognition method based on hybrid differential evolution and self-adaptive particle swarm optimization-based support vector machine. Neurocomputing. 3(149):573–584

    Article  Google Scholar 

  26. Zhai S, Jiang T (2014) A novel particle swarm optimization trained support vector machine for automatic sense-through-foliage target recognition system. Knowl Based Syst 1(65):50–59

    Article  Google Scholar 

  27. Pai PF, Hong WC (2005) Support vector machines with simulated annealing algorithms in electricity load forecasting. Energy Convers Manag 46(17):2669–2688

    Article  Google Scholar 

  28. Zhai S, Pan J, Luo H, Fu S, Chen H (2016) A new sense-through-foliage target recognition method based on hybrid particle swarm optimization-based wavelet twin support vector machine. Measurement 1(80):58–70

    Article  Google Scholar 

  29. Sartakhti JS, Afrabandpey H, Saraee M (2017) Simulated annealing least squares twin support vector machine (SA-LSTSVM) for pattern classification. Soft Comput 21(15):4361–4373

    Article  Google Scholar 

  30. Yang XS (2014) Nature-inspired optimization algorithms. Elsevier, Amsterdam

    MATH  Google Scholar 

  31. Nayyar A, Le DN, Nguyen NG (eds) (2018) Advances in swarm intelligence for optimizing problems in computer science. CRC Press, Boca Raton

    Google Scholar 

  32. Xing B, Gao WJ (2016) Innovative computational intelligence: a rough guide to 134 clever algorithms. Springer, Berlin, p 105

    MATH  Google Scholar 

  33. Khemchandani R, Chandra S (2016) Twin support vector machines: models, extensions and applications. Springer, Berlin

    MATH  Google Scholar 

  34. Holland JH (1975) Adaption in natural and artificial systems. The University of Michigan Press, Ann Arbor

    MATH  Google Scholar 

  35. De Jong KA (1975) Analysis of the behavior of a class of genetic adaptive systems. University of Michigan, Techniqual Report No 185

  36. Nayyar A, Nguyen NG (2018) Introduction to swarm intelligence. Adv Swarm Intell Optim Probl Comput Sci 3:53–78

    Google Scholar 

  37. Nayyar A, Garg S, Gupta D, Khanna A (2018) Evolutionary computation: theory and algorithms. In: Nayyar A, Le D-N, Nguyen NG (eds) Advances in swarm intelligence for optimizing problems in computer science, pp 1–26. Chapman and Hall/CRC

  38. Wright AH, Zhao Y (1999) Markov chain models of genetic algorithms. In: Proceedings of the 1st annual conference on genetic and evolutionary computation, vol 1, pp 734–741. Morgan Kaufmann Publishers Inc

  39. Suzuki J (1995) A Markov chain analysis on simple genetic algorithms. IEEE Trans Systems Man Cybern 25(4):655–659

    Article  Google Scholar 

  40. Nix AE, Vose MD (1992) Modeling genetic algorithms with Markov chains. Ann Math Artif Intell 5(1):79–88

    Article  MathSciNet  Google Scholar 

  41. Goldberg DE, Segrest P (1987) Finite Markov chain analysis of genetic algorithms. In: Proceedings of the second international conference on genetic algorithms, vol 1, p 1

  42. Sewell M, Shawe-Taylor J (2012) Forecasting foreign exchange rates using kernel methods. Expert Syst Appl 39(9):7652–7662

    Article  Google Scholar 

  43. Aggarwal CC (2015) Data mining: the textbook. Springer, Berlin

    MATH  Google Scholar 

  44. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):1–27

    Article  Google Scholar 

  45. Blake CL, Merz CJ (1998) UCI repository for machine learning databases. Department of Information and Computer Science, University of California, Irvine [Online]. http://www.ics.uci.edu/~mlearn/MLRepository.htm

  46. Demšar J (2009) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30

    MathSciNet  MATH  Google Scholar 

  47. He J, Kang L (1999) On the convergence rates of genetic algorithms. Theor Comput Sci 229(1–2):23–39

    Article  MathSciNet  Google Scholar 

  48. Rudolph G (1994) Convergence analysis of canonical genetic algorithms. IEEE Trans Neural Netw 5(1):96–101

    Article  Google Scholar 

Download references

Acknowledgements

I am highly thankful to Mr. Arvind Upadhyay (IES-IPSA, India), the complete computer science department IES-IPSA and Dr. D. H. Wolpert for their helpful suggestions. I am heartily indebted to the anonymous reviewers whose invaluable comments pushed me beyond my thinking limits.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashish Sharma.

Ethics declarations

Conflict of interest

The author declares no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Here, I analyze the GA as presented in Algorithm-I mathematically and prove the theorems that are used in the paper.

Then, consider the optimization problem (this is same as (16)):

$$max \left\{ {f\left( x \right); x \in X} \right\}$$
(24)

where \(X\) is a bounded and compact measurable space. \(f\left( x \right)\) is called the objective function and the set \(S_{OPT}\) is the set of all \(x\) such that for each \(x \in S_{OPT}\), \(f\left( x \right)\) is maximum of \(f\), formally:

$$S_{OPT} = \left\{ {x;\left| {f\left( x \right) - f_{max} } \right| = 0} \right\},$$

where \(f_{max}\) is the maximum value of \(f\). In this analysis it is assumed that \(S_{OPT}\) is not empty and contains at least one element. Let \(\phi \left( \cdot \right)\) is a measure on \(X\), then \(\phi \left( {S_{OPT} } \right) = 0\) as \(S_{OPT}\) contains a few elements usually, so for the ease of the analysis it is assumed that \(S_{OPT - \varepsilon } = \left\{ {x;\left| {f\left( x \right) - f_{max} } \right| \le \varepsilon } \right\}\) is the set that I mean by \(S_{OPT}\) henceforth, where \(\varepsilon\) is a small positive.

The problem (24) or (16) are same and is solved with the genetic algorithm GA which was described previously in the paper (as Algorithm-I) and now for that GA a mathematical model is presented for its convergence and performance analysis. The GA will be modeled as a Markov Chain next.

A Markov chain is a random process that is completely described by its state transition probability matrix. Therefore, to describe a Marko chain one needs to specify its states and the state transition probabilities. Let, GA operates on the binary strings (solutions) each having the fixed length \(\left( l \right)\). Then, the total number of possible strings is \(r = 2^{l}\). In what follows italic p stands for probability, while non-italic p stands for population (the context also clear this distinction). Let, the set of all possible strings is \(\varOmega \left( { = X} \right)\) and \(\varOmega^{n}\) is the set of all possible n-string sets, i.e. let \(\text p = \left\{ {x_{1} ,x_{2} \ldots x_{n} ;x_{i} \in \varOmega \;for\;all\;1 \le i \le n} \right\},\) then \(\varOmega^{n} = \left\{ {\text p;\text p \subseteq \varOmega } \right\}\). The set p is called a population having a size \(n\) and the number of possible populations is N. Let \(\text p\;is\;represented\;as\;\varvec{x}\;\left( {bold\;\varvec{x}} \right)\) and the fitness of x is defined as:

$$f\left( \varvec{x} \right) = max \left\{ {f\left( {x_{i} } \right); x_{i} \in \varvec{x}} \right\}.$$

The set of all populations that contains the global optimal string is defined as:

$$S_{OPT}^{n} = \left\{ {\varvec{x};\left( {\exists x_{i} \in \varvec{x}} \right)\left( {x_{i} \in S_{OPT} } \right)} \right\}.$$

Definition 1

Let Z is a \({\text{r}} \times N\) matrix (rows of Z are all possible strings and columns of Z are all possible populations) defined as follows:\(Z = \left[ {z_{i,j} } \right]\), where \(z_{i,j}\) is the number of occurrences of string i in population j.

It is clear that some enumeration of all possible strings and populations must exist so that Z is well defined. One can agree on any kind of possible enumerations. One interpretation is based on Z as follows: the strings are numbered from \(0\) to \(r - 1\) and let \(p_{i}\) and \(p_{j}\) be any two populations as column vector of Z then \(p_{i}\) comes before \(p_{j}\) if interpreting the transpose of \(p_{i}\) and \(p_{j}\) as integer’s yields \(p_{i} < p_{j} .\)

Definition 2

The number of possible populations is

$$N = \left( {\begin{array}{*{20}c} {n + r - 1} \\ {r - 1} \\ \end{array} } \right).$$

Definition 3

The states of the Markov chain are the N possible populations.

Definition 4

The transition probabilities are given by an \(N \times N\) transition probability matrix \(\left( {\text{Q}} \right)\) in which an entry \(Q_{ij}\) gives the probability that the kth generation (population) will be \({\text{p}}_{j}\) given that the \(k - 1{\text{st}}\) generation is \({\text{p}}_{i}\).

\(Q_{ij}\) can be calculated as follows with respect to \(p_{i} \left( y \right)\), where \(p_{i} \left( y \right)\) is the probability of producing string \(y\) in the \(i + 1{\text{th}}\) generation assuming the currect generation is ith having population \({\text{p}}_{i}\). Let the next generation is \({\text{p}}_{j}\) having \(z_{y,j}\) occurrences of string y and the probability of such occurrences of string y is (almost) given by \(\left\{ {p_{i} \left( y \right)} \right\}^{{z_{y,j} }}\). We can fill the n vacancies for strings in a population (\({\text{say}}\;{\text{p}}_{j}\)) as follows:

The binomial coefficient gives us the number of ways of choosing \(z_{0,j }\) occurrences of string 0 as:

$$\left( {\begin{array}{*{20}c} n \\ {{\text{z}}_{0,j} } \\ \end{array} } \right).$$

After that there are \(n - z_{0,j}\) vacancies remains to be filled, so for string 1 the number of ways of choosing \(z_{1,j}\) occurrences of it is:

$$\left( {\begin{array}{*{20}c} {n - z_{0,j} } \\ {z_{1,j} } \\ \end{array} } \right).$$

Continuing as above for all the strings the total combination is:

$$\left( {\begin{array}{*{20}c} n \\ {z_{0,j} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {n - z_{0,j} } \\ {z_{1,j} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {n - z_{0,j} - z_{1j} - \cdots - z_{r - 2,j} } \\ {z_{r - 1,j} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {n!} \\ {z_{{\left( {0,j} \right)}} ! z_{{\left( {1,j} \right)}} ! \ldots z_{{\left( {r - 1,j} \right)}} !} \\ \end{array} } \right).$$

So, \(Q_{ij}\) can be written as a multinomial distribution with parameters \(n, p_{i} \left( 0 \right), \ldots , p_{i} \left( {r - 1} \right)\) as:

$$Q_{ij} = n!\mathop \prod \limits_{y = 0}^{r - 1} \frac{{\left\{ {p_{i} \left( y \right)} \right\}^{{z_{y,j} }} }}{{z_{y,j} !}}.$$
(25)

Definition 5

(Model-I) [40] The Markov chain model for GA is described by \(Q_{ij}\) whose states are as defined in Definition 3.

Now, it can be proved that the GA can break the non-revisiting condition of NFLO as Theorem 1.

Theorem 1

The GA as defined in Definition 5 have a positive probability of breaking the non-revisiting condition of NFLO.

Proof

According to the above discussion and the definition of \(Q_{ij}\) the probability of repeating a string y (in the next generation) is given by \(p_{i} \left( y \right)\) and moreover the probability that there are \(z_{y,j}\) occurrences of string y (in the next generation) is given by \(\left\{ {p_{i} \left( y \right)} \right\}^{{z_{y,j} }}\). Note that it can’t be 0 otherwise \({\text{Q}}_{\text{ij}}\) itself becomes 0 (and the GA will never progress) which is impossible.

Model-I as defined in Definition 5 explicitly contains \(p_{i} \left( y \right)\) which forces to incorporate definitions of genetic operators (mutation, crossover and selection) which I can do but I want to be as general as possible so a Markov chain model independent of any such assumptions of genetic operators will be very useful. Such a model is defined next.

Let \(\xi \left( t \right)\), \(\xi_{M} \left( t \right),\) \(\xi_{C} \left( t \right)\), \(\xi_{S} \left( t \right)\) and \(\xi \left( {t + 1} \right)\) are random variables on \(\varOmega^{n}\) defined as follows:

  1. (i)

    \(\xi \left( t \right)\) is the population state at time t.

  2. (ii)

    \(\xi_{M} \left( t \right)\) is the population state after the application of mutation operator.

  3. (iii)

    \(\xi_{C} \left( t \right)\) is the population state after the application of crossover operator.

  4. (iv)

    \(\xi_{S} \left( t \right)\) is the population state after the application of selection operator.

  5. (v)

    \(\xi \left( {t + 1} \right) = \xi_{S} \left( t \right)\).

In the GA as described in Algorithm-I all the genetic operators (mutation, crossover and section) are invariant in time, therefore the transition probability functions are also invariant in time.

Definition 6

GA operators are defined as their respective transition probability function as follows:Let \(A \subseteq \varOmega^{n}\)

  1. (i)

    Mutation transition probability function defines the mutation operator as follows:

    $$p_{M} \left( {\varvec{x},A} \right) = p_{M} \left( {\xi_{M} \left( t \right) \in A |\xi \left( t \right) = \varvec{x}} \right)$$
  2. (ii)

    Crossover transition probability function defines the crossover operator as follows:

    $$p_{C} \left( {\varvec{x},A} \right) = p_{C} \left( {\xi_{C} \left( t \right) \in A |\xi_{M} \left( t \right) = \varvec{x}} \right)$$
  3. (iii)

    Selection transition probability function defines the selection operator as follows:

    $$p_{S} \left( {\varvec{x},A} \right) = p_{S} \left( {\xi_{S} \left( t \right) \in A |\xi_{C} \left( t \right) = \varvec{x}} \right)$$

Definition 7

(Model-II) [47] The GA is defined mathematically as the Markov chain \(\left\{ {\xi \left( t \right);t \in Z^{ + } } \right\}\) with the state space \(\varOmega^{n}\) and the transition probability given by Kolmogorov–Chapman equation as:

$$p\left( {\varvec{x},A} \right) = \smallint_{\varvec{y}} \smallint_{\varvec{z}} p_{M} \left( {\varvec{x},d\varvec{y}} \right)p_{C} \left( {\varvec{y},d\varvec{z}} \right)p_{S} \left( {\varvec{z},A} \right).$$
(26)

Note that (25) and (26) are basically the same. Models I and II are partial models of the GA (described in Algorithm-I) the complete model is defined in Definition 10.

Theorem 2

The Markov chains of model-I and model-II are ergodic.

Proof

This fact directly follows from the inclusion of the mutation operator in the chains as because of mutation, it is possible to go from every state to every state. For model-I, I assume the same definition of mutation operator as for model-II as defined in Definition 6.

Theorem 3

In an ergodic Markov chain the expected transition time from any state to any other is finite irrespective of the states.

Proof

This is Theorem 5 in [48].

Next, two properties of great importance are defined:

Definition 8

  1. (i)

    Property 1 (□1): Maintain the best solution found so far over time.

  2. (ii)

    Property 2 (□2): It must be possible to go from any state of the Markov chain to any other state.

□1 says that the best solution found so far must be preserved so that it will be not lost. □1 holds in GA by forcing the requirement on GA operators that they will not lose the best solution. □2 is basically the accessibility property saying that all the states must be accessible starting from any state. This property makes the set \(S_{OPT}^{n}\) accessible. In model-II (and therefore in model-I) □2 holds because of Theorem 2 and if □1 is forced to hold then the Markov chain of model-II becomes absorbing as the states containing the global optimal solutions \(S_{OPT}^{n}\) becomes absorbing. The following result says that all the states can reach the optimal state(s) and in an expected finite time.

Theorem 4

In Model-II satisfying □1 and □2 it is possible to reach the global optimal set \(S_{OPT}^{n}\) and the expected transition time from any initial state to any state in \(S_{OPT}^{n}\) is finite.

Proof

By Theorem 2 the Model-II is basically ergodic and by Theorem 3 in model-II the expected time to go from one state to another is finite irrespective of the states. □1 makes the states in \(S_{OPT}^{n}\) absorbing which means if the Markov chain enters any of the states in \(S_{OPT}^{n}\) then it will not leave \(S_{OPT}^{n}\). □1 does not change other states so it is still possible to go from any state to any state in \(S_{OPT}^{n}\) in a finite time which is the basic property of model-II by Theorems 2 and 3. In fact now the model-II with □1 and □2 becomes an absorbing Markov chain. Therefore, \(S_{OPT}^{n}\) is reachable in an expected finite time from any initial state.

Definition 9

(Model-III): Model-III is defined as Model-II obeying properties □1 and □2.

Definition 10

(GA): The GA as described in Algorithm-I is mathematically defined by model-III.

Now, the definition of convergence is needed so that it can be defined for the GA. Then,

Definition 11

Let \(A \subseteq \varOmega^{n}\) is any measurable set and \(\upmu_{1}\) and \(\upmu_{2}\) are probability measures defined on \(\varOmega^{n}\) (which is a measurable space) then the total variation distance between \(\upmu_{1}\) and \(\upmu_{2}\) is defined as:

$$\left| {|\upmu_{1} -\upmu_{2} } \right|| = \mathop {\sup }\limits_{{A \subseteq \varOmega^{n} }} \left| {\upmu_{1} \left( A \right) -\upmu_{2} \left( A \right)} \right|.$$

Definition 12

Let π is a probability distribution on \(\varOmega^{n}\), then it is invariant for the Markov chain \(\left\{ {\xi \left( t \right);t = 1,2, \ldots } \right\}\) having transition function \(p\left( {\varvec{x},A} \right)\) if and only if

$$\uppi\left( {\text{A}} \right) = \smallint_{{\varOmega^{n} }} p\left( {\varvec{x},A} \right)\pi d\varvec{x}$$

for all measurable sets \(A \subseteq \varOmega^{n}\).

Definition 13

Let \(\left\{ {\xi \left( t \right);t \in Z^{ + } } \right\}\) is a Markov chain with the transition function \(p\left( {\varvec{x},A} \right)\) and \(\upmu_{t}\) is the probability distribution of \(\xi \left( t \right)\). If there is an invariant probability distribution π for the Markov chain such that \(\mathop {\lim }\limits_{{{\text{t}} \to \infty }} \left| {\left| {\upmu_{\text{t}} -\uppi} \right|} \right| = 0,\) then \(\upmu_{t}\) is said to converge to π and if there exists a set \(A \subseteq \varOmega^{n}\) such that \(\uppi\left( {\text{A}} \right) = 1\) then \(\upmu_{t}\) is said to converge to the set A globally.

Theorem 5

Assume \(\varOmega^{n}\) is a bounded and compact measurable space. The Markov chain \(\left\{ {\xi \left( t \right);t \in Z^{ + } } \right\}\) given by Model-III converges and there exists some positive \(\delta > 0\) and some time \(t_{0} > 0\) and an invariant probability distribution π for the Markov chain such that,

$$\pi \left( {S_{OPT}^{n} } \right) = 1,$$

and starting from any initial distribution \(\xi \left( 1 \right) = \varvec{x}\) it holds for all time \(t > 0\)

$$\left| {\left| {\mu_{t} - \pi } \right|} \right| \le \left( {1 - \delta } \right)^{{\left[ {\frac{t}{{t_{0} }}} \right] - 1}},$$

where \(\mu_{t}\) is the probability distribution of \(\xi \left( t \right)\).

Proof

This is Theorem 5 in [47].

Theorem 6

SNHSVM converges globally.

Proof

This directly follows from the definition of the SNHSVM and Theorem 5. Specifically, SNHSVM is a combination of the GA and NHSVM. But, NHSVM is presented to GA as (19) in which the objective function is (18) which is an example of the problem (16) or (24) for which Theorem 5 says that the GA will converge globally. This completes the proof.

Appendix 2

Here, I demonstrate how the accuracy is calculated for each classification model that is compared in the paper. The reader must recall the scenario of the experiments i.e. \(\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle\cdot}$}}{{\dot{\text{S}} }}3\). In \(\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle\cdot}$}}{{\dot{\text{S}} }}3\) the important point with respect to experiments is that each dataset \({\text{D}}\) is divided into two disjoint sets \({\text{D}}_{1}\) and \({\text{D}}_{2}\), and training is performed on \({\text{D}}_{1}\) and testing is performed on \({\text{D}}_{2}\). For statistical fairness this division is done 10 times differently for each dataset and the average accuracy is recorded. That is, a dataset is divided into different parts exactly like the tenfold cross validation (for training and testing purposes), but the difference here is that for each fold the respective model is run and the accuracy is recorded. Then, the accuracy is averaged (summed up for each fold and divided by 10) to get the values as recorded in Table 3 (in linear case) and 4 (in non-linear case).

This is not a tenfold cross validation in which a model with same parameters setting is run on all tenfold, rather in the experimental setting of this paper a model is run on each fold, i.e. optimizing parameters for each fold. To avoid confusion I will call a fold a section. The example of the calculation of SNHSVM accuracy on Heart-c dataset in the linear case below will clear the above picture.

Section

\({\text{C}}_{1}\)

\({\text{C}}_{2}\)

Accuracy

1

1.6829

0.5684

80.0000

2

1.8657

0.1667

86.6667

3

2.2970

3.6835

76.6667

4

8.0610

0.2241

993.3333

5

1.4677

1.1306

86.6667

6

8.5830

0.2286

80.0000

7

8.3979

3.7509

90.0000

8

4.0602

6.7324

96.6667

9

4.3920

6.4008

90.0000

10

1.8827

3.1817

93.9394

Average accuracy

  

85.39395

Standard deviation

  

7.6419

All the accuracies for all the models are rounded off to two decimal places in a standard way so the average accuracy of SNHSVM on heart-c dataset in the linear case is \(85.39\% \pm 07.64\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, A. Stochastic nonparallel hyperplane support vector machine for binary classification problems and no-free-lunch theorems. Evol. Intel. 15, 215–234 (2022). https://doi.org/10.1007/s12065-020-00503-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12065-020-00503-8

Keywords

Navigation