A Comprehensive Methodology to Find Closed Mathematical Models from Experimental Data

Kuri-Morales, Angel

doi:10.1007/978-3-030-21077-9_3

Angel Kuri-Morales¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11524))

Included in the following conference series:

Mexican Conference on Pattern Recognition

1370 Accesses

Abstract

A method which allows us to find a polynomial model of an unknown function from a set of tuples with n independent variables and m tuples is presented. A plausible model of an explicit approximation is found. The number of monomials of the model is determined by a previously trained neural network (NNt). The degrees of the monomials are bounded from the Universal Approximation Theorem yielding a reduced polynomial form. The coefficients of the model are found by exploring the space of possible monomials with a genetic algorithm (GA). The polynomial defined by every training pair is approximated by the Ascent Algorithm which yields the coefficients of best fit with the L_∞ norm. The L2 error corresponding to these coefficients is calculated and becomes the fitness function of the GA. A detailed example for a well known experimental data set is presented. It is shown that the model derived from the method yields a classification error of <7% for the cross-validation data set. A detailed description of the NNt is included. The approximation to 46 datasets tackled with our method is discussed. In all these cases the approximation accuracy was better than 94%. The tool described herein yields general models with explanatory characteristics not found in other methods.

You have full access to this open access chapter, Download conference paper PDF

Polynomial Multivariate Approximation with Genetic Algorithms

Removing the Black-Box from Machine Learning

Fundamental limits to learning closed-form mathematical models from data

Article Open access 24 February 2023

Keywords

1 Introduction

In this paper we address the problem of obtaining a general algebraic model for a set of given data pairs $ (\vec{x},y) $. By “model” we mean a function $ F(\vec{x}) $ derived from a set of n independent variables and m given data pairs assumed to stem from an unknown function $ y = f(\vec{x}) $, such that the error $ \varepsilon = \left\| {f(\vec{x}) - F(\vec{x})} \right\|_{2} $ is minimized.

In a previous work [1] a methodology is proposed to find a multivariate polynomial model to fit a dataset in which, instead of generating the full range of terms in (1), generates the combination of a preselected number of them. This is accomplished by the ensemble of a genetic algorithm and a regression algorithm, where the individuals of the genetic algorithm represent a set of monomials with different combination of variables and their exponents. The regression algorithm computes the coefficients of the terms in such way that the minimax error norm (or L_∞) is minimized. The current work is based on this methodology. We complement it by finding a neural network (NNt) which yields the minimum needed number of terms.

The rest of the paper is organized as follows. In Sect. 2 we describe

1.
How to select the model’s form
2.
How to approximate, from an arbitrary training set of pairs $ (\vec{x},y) $, a polynomial model.
3.
How to avoid the exponential number of terms as the number of variables and the maximum degrees considered increases
4.
How to determine the degrees d₁, …, d_n of (1) needed for $ F(\vec{x}) $ to be a universal approximator.

In Sect. 3 we focus on the problem of how to determine the minimum number of terms in $ F(\vec{x}) $. In Sect. 4 we present the application of the methodology to a case of study. In Sect. 5 we present our conclusions.

2 Previous Results

In order to obtain a comprehensive methodology leading to the goal of finding closed mathematical (polynomial) models from experimental data we rely, at the offset, on previous works. The reader interested in the details is encouraged to dwell on the references.

2.1 Selecting the Model’s Form

A basic issue to be considered is how to select the form of $ F(\vec{x}) $. One possible approach to determine such form is suggested by the Weierstrass approximation theorem (WAT), which states that every continuous function defined on a closed interval [a, b] can be uniformly approximated as closely as desired by a polynomial function [2].

From the WAT we assume that there exists a full polynomial approximation

$$ F(\vec{x}) = \sum\limits_{{i_{1} = 0}}^{{d_{i} }} {\sum\limits_{{i_{2} = 0}}^{{d_{2} }} \ldots } \sum\limits_{{i_{n} = 0}}^{{d_{n} }} {c_{{i_{1} i_{2} \ldots i_{n} }} } x_{1}^{{i_{1} }} x_{2}^{{i_{2} }} \ldots x_{n}^{{i_{n} }} $$

(1)

We consider all possible linear combinations of the products of the independent variables x₁, …, x_n raised to a given degree d₁, …, d_n. The set of allowed degrees d₁, …, d_n will be denoted with D, the set of coefficients $ c_{{i_{1} i_{2} \ldots i_{n} }} $ will be denoted with C, the number of terms of the polynomial will be denoted by T. The purpose of the algorithm we implemented is to express $ F(\vec{x}) $ as a more compact polynomial. By “algebraic model”, therefore, we mean a function of the form

$$ F(\vec{x}) = \sum\limits_{i = 0}^{t} {K_{i} x_{1}^{{t_{1} }} x_{2}^{{t_{2} }} \ldots x_{n}^{{t_{n} }} } $$

(2)

2.2 Approximation of a Polynomial Model

The number of monomials in (1) is M = (d₁ + 1)…(d_n + 1). Consider, for example, the well known (number of web hits = 1,020,829) classification problem in [3]. The data pairs in it (n = 13 and m = 178) may be accommodated in a $ 178 \times 14 $ data base. This is a classification problem where the values of the tuples determine one of three classes. For convenience we shall assume that the number of the class (the dependent variable y) is denoted with 0, 1 and 2 and, furthermore, that it is placed on the last column of the matrix. Even assuming the conservative d₁ = … = d₁₃ = 2 we find that M = 4¹³ = 67,108,864. The exponential growth of M has been referred to as “the curse of dimensionality” [4]. With the development of NNs this problem has been successfully circumvented [5] albeit without yielding an explicit model, i.e. one in which the relation between the independent variables is exposed. NNs are frequently cited as suffering from the “black box” disadvantage. The main motivation in this work is to solve this kind of problems without losing the explanatory advantages of an algebraic model. To achieve this we apply a method where the model of (1) is replaced by the one in (3), where the monomials (or “terms”) in (1) are selectively retained/discarded.

$$ F(\vec{x}) = \sum\limits_{{i_{1} = 0}}^{{d_{i} }} {\sum\limits_{{i_{2} = 0}}^{{d_{2} }} \ldots } \sum\limits_{{i_{n} = 0}}^{{d_{n} }} {\mu_{{i_{1} i_{2} \ldots i_{n} }} c_{{i_{1} i_{2} \ldots i_{n} }} } x_{1}^{{i_{1} }} x_{2}^{{i_{2} }} \ldots x_{n}^{{i_{n} }} $$

(3)

where

$$ \mu_{{i_{1} i_{2} \ldots i_{n} }} = \left\{ {\begin{array}{*{20}l} 0 \hfill & {if\;c_{{i_{1} i_{2} \ldots i_{n} }} \;is\;not\;retained} \hfill \\ 1 \hfill & {if\;c_{{i_{1} i_{2} \ldots i_{n} }} \;is\;retained} \hfill \\ \end{array} } \right. $$

(4)

The basic idea is to set T and D which, immediately, allows us to avoid the curse of dimensionality. The way in which T and D are determined is discussed in what follows.

2.3 Determining the Coefficients of the Monomials

Once D and T are determined we may determine the $ \mu_{{i_{1} i_{2} \ldots i_{n} }} $ indirectly by applying the genetic algorithm (EGA) described in [6], as follows. The individuals of the population of the EGA represent the combination of powers $ i_{1} , \ldots ,i_{n} $ corresponding to each of the T monomials in (2) for which the μ_i1i2…in are retained. This is exemplified in Fig. 1 where n = 3, 0 ≤ d_i ≤ 9 and T = 5. For simplicity, the coefficients $ c_{{i_{1} i_{2} \ldots i_{n} }} $ in (2) have been replaced with C₁, …, C₅.

Every individual of the EGA represents a polynomial consisting of T monomials. From them we find a data base of size $ T \times m $ where every one of its tuples is a map from the original $ \vec{x} $ into the monomials proposed by the EGA plus the value of the dependent variable, as illustrated in Fig. 2. The data base from which this figure is originated consists of 50 $ (\vec{x},y) $ pairs, in this case of the form $ (x_{1} ,x_{2} ,x_{3} ,F) $.

From the original data a mapping analogous to the one illustrated in Fig. 2 is obtained.

To find C we implemented an approximation algorithm known as the Ascent Algorithm (AA) [7]. C_ij is the coefficient induced by the i-th individual during the j-th generation. Using the AA, we find C_ij and its associated fit error ε_ij. After the evolutionary process the coefficients of the best individual (corresponding to ε_ij) will yield $ F(\vec{x}) $. The $ \mu_{{i_{1} i_{2} \ldots i_{n} }} $ are never explicitly obtained. The C resulting from the EGA tacitly determines the T retained monomials whose coefficients minimize $ \varepsilon $.

2.4 Degree of the Terms in the Polynomial

In [1] it was shown that, from the Universal Approximation Theorem [8], any unknown function $ f(\vec{x}) $ may be approximated with a polynomial of the form of (5)

$$ F(\vec{x}) = k + \sum\limits_{i = 1}^{T} {ci\,\left( {\prod\limits_{j = 1}^{n} {x^{dj} } } \right)} $$

(5)

if

$$ \sum {dj} \in L\;\;\forall j $$

(6)

where $ \sum {dj} \in L $ means that the maximum degree (maxdeg) of any term (the summation of the degrees associated to each of the independent variables) must belong to set L = {1, 3, 5, 7, 9, 11, 15, 21, 25, 33, 35, 37, 45, 49, 55, 63, 77, 81, 99, 121}.

The way in which L was arrived at may be found in [1]. Notice, however, that even if maxdeg of the i-th monomial is odd, the powers of the variables in it may take any value leading to one of those in L.

The algebraic model of (1) is thusly simplified and the EGA can be made to include in its population only individuals whose coefficients comply with (6). To illustrate this fact, the data of Fig. 1 [consisting of 50 pairs $ (\vec{x},y) $] was approximated with the EGA now considering Eq. (5). The resulting coefficients are as shown in Fig. 3. In an “indexed coefficient” like Ci₁, …, i_n, the indices denote the degrees associated to the i-th variable. The degree of the term is shown on the first column.

3 Minimum Number of Terms

Once we have a method to determine, via EGA, the most adequate coefficient for every term in (5) it only remains to specify the most adequate value for T. The solution to this problem completes the specification of the method. An attempt to solve this problem is to appeal to statistical criteria.

The first step was to set lower and upper practical limits on the number of terms. We reasoned that we were not interested in very small values for T. Very low values are hardly prone to yield good models. Therefore, we set a lower value of T = 3. On the other side of the spectrum we decided to focus on T ≤ 13. Higher values are seldom of practical interest since large sets of coefficients are cumbersome and difficult to analyze. At the end of the day algebraic models allow us to search for patterns in the relations which are exposed by the monomials and the possible relations embedded. Very large Ts will make this search too complex in general.

Therefore, we collected 46 datasets from the University of California Machine Learning dataset repository [9] and the Knowledge Extraction Evolutionary Learning dataset repository [10]. To begin with, 32 of these datasets were chosen and were subjected to the procedure outlined above. For every $ {\mathbf{T}} \in [3,13] $ a polynomial was found and the number of terms corresponding to the best fit was recorded. A total of $ 352\;(11 \times 32) $ polynomials, therefore, were calculated. Table 1 shows the values of best T from the 32 selected data sets. For every one of the 32 datasets 11 polynomials (T = 3, …, 13) were found. From these the lower bound of the number of terms was determined. The best values of T were then used to train NNt. The polynomials were calculated as a function of the following variables: (a) Number of attributes, (b) Number tuples, (c) Optimally encoded size, (d) Compression ratio [the quotient of the (dataset’s original size)/(optimally compressed size)]. These attributes were selected from a much larger original subset of such attributes. The ones we selected were shown to be the more relevant from a Principal Components Analysis.

Table 1. Table of the best values of T for the selected datasets

Full size table

The remaining 14 datasets were used as cross validation data for the trained NN model. The corresponding results are shown in Table 2. From the results a statistically highly significant lower bound on T may be inferred. NNt may be fully specified from: (a) Its architecture, (b) The associated weights, (c) The definition of the activation function.

Table 2. Validation stage models

Full size table

We can calculate the average RMS error and the standard deviation. From $ P(\mu T - k\sigma T \le T \le \mu T + k\sigma T) > \mathit{1} - \mathit{1}/k^{\mathit{2}} $ (Chebyshev’s theorem [13]). Making k = 3, the probability of finding the smallest expected value of T above the average value at k = −3 standard deviations is ≈88%. Thereafter the mathematical model for the minimum number of terms as a function of the RMS error within the indicated limits may be calculated. We trained a NN from these data yielding NNt. The architecture of NNt is shown in Fig. 4.

NNt was trained using the backpropagation algorithm. To improve convergence the learning rate is set in the hidden and output layers. The learning rate was tested in the hidden layer with the values .1, .2, .3 and .25; the learning rate in the output layer was tested for the values .1, .01 and .001.

Activation Function.

Every neuron is a perceptron whose activation function is $ 1/(1 + e^{ - x} ) $.

Momentum.

Amomentum term was added, which allows the adjustments on the weights to depend also on the previous changes. If the momentum is equal to 0, the change depends only on the error gradient, while a value of 1 means that the change will only depend on the last change. The value of the momentum in the hidden and output layers was tested for values 0.3, 0.5 and .8. The weights were initialized with uniform random numbers in the range [−0.1, 0.1] ∈ R. The learning strategy used was stochastic in which each input creates a weight adjustment. The NN was trained for 10,000 epochs with a test every 100 epochs. NNt was validated vs. the remaining 14 data sets.

The above information was fed to NNt to get T. In Table 3 we see the estimated T which are expected to be near the optimum fit. RMS(T-i) denotes the RMS error when NNt yields i for the value of T. The actual best value of T was always within i ± 2 of the T suggested by NNt.

Table 3. Terms calculated by NNt for the validation sets

Full size table

The weights of NNt are shown in Fig. 5. Every neuron is, as stated, a perceptron whose activation function is $ 1/(1 + e^{ - x} ) $. Because of this the input values were scaled into [0, 1). Given the values of the 4 input attributes a simple spreadsheet may easily calculate the suggested number of terms. An example is shown in Fig. 6.

Since the values must be scaled the upper and lower values of the variables are needed. These are shown in the lines titled “Max” and “Min”. The corresponding scaled values are shown in the line “Scaled_01”.
“Compressed size” refers to the size (bytes) of the original dataset after it is compressed using the PPMZ Algorithm [11]. This yields a practical approximation to the information contained in the dataset^{Footnote 1}.

4 Case of Study

To illustrate the process we take the data base from the U.S. Census [14]. From the web page we get the data illustrated in Fig. 7.

A partial view of the data set is shown in Fig. 8. This data is both numerical and categorical (i.e. non-numerical). The numerical attributes are scaled into [0, 1) and, subsequently, the non-numerical attributes were encoded into [0, 1) using the CESAMO algorithm [15]. The original DB is illustrated in Fig. 8. The corresponding encoded DB is illustrated in Fig. 9. This DB was further divided into a training data (TRN) set and a test data set (TST).

TRN was used to train (find) a polynomial and cross validated with TST. The number of terms T of the polynomial, as illustrated in Fig. 6, was originally determined to be 9. The actual resulting polynomial is illustrated in Fig. 10.

EpsiTh denotes the worst case approximation error in TST. In every column starting with the second row we find (a) The degree of the monomial, (b) The powers of the independent variables in every monomial, (c) The coefficient associated to the monomial. The value of T is 7 because the EGA may yield monomials of identical monomials are odd (1, 3, 5, 7). The genetic algorithm was allowed to choose from any combination of the degrees in set L. It may be seen that it picked terms whose highest power for any one variable is 3 and the corresponding highest degree is 7.

degrees and these are merged in the end. As specified, the degrees of every one of the are odd (1, 3 ,5, 7). The genetic algorithm was allowed to choose from any combination of the degrees in set L. It may be seen that it picked terms whose highest power for any one variable is 3 and the corresponding highest degree is 7.

When the polynomial was evaluated on the test data we determined that, of the 235 tuples of TST, 231 were correctly classified for an effectiveness ratio of 93.83%.

5 Conclusions

A method allowing us to find a polynomial model of an unknown function from a set of experimental tuples was explored. The allowed powers of the monomials in such polynomial were derived from the Universal Approximation Theorem. The coefficients of the monomials were gotten by exploring the space of such monomials with the EGA. Every individual’s represented polynomial is found by the Ascent Algorithm. The AA yields the coefficients of best fit with the L∞ norm. From these coefficients the L2 error is calculated and becomes the fitness function of the GA. In it, every individual represents a plausible polynomial in which the number of terms (T) is determined by a trained neural network (NNt) with the best value of T (3 ≤ T ≤ 13) for 32 representative real world data sets, i.e. $ 352\;(11 \times 32) $ polynomials were calculated. NNt yields T when the number of tuples, the number of attributes, the size of the compressed data set and the compression ratio are input. The method was used to solve a case study where the evaluated polynomial over the test data set yielded an effectiveness ratio of 93.83%. This illustrates a method which yields the number of terms and the coefficients and degree of every one of the independent variables for each term purely by machine learning.

Notes

1.
The true amount of information in a data set is exactly expressible by Kolmogorov’s Complexity (KC) which corresponds to the most compact representation of the set under scrutiny. Since KC is known to be incomputable [12] we have chosen the PPM (Prediction by Partial Matching) algorithm as our compression standard; a good practical approximation to KC.

References

Kuri-Morales, A., Cartas-Ayala, A.: Polynomial multivariate approximation with genetic algorithms. In: Sokolova, M., van Beek, P. (eds.) AI 2014. LNCS (LNAI), vol. 8436, pp. 307–312. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06483-3_30
Chapter Google Scholar
Cotter, N.E.: The Stone-Weierstrass theorem and its application to neural networks. IEEE Trans. Neural Netw. 1(4), 290–295 (1990)
Article MathSciNet Google Scholar
https://archive.ics.uci.edu/ml/datasets/wine. Wine Data Set: Accessed 27 Nov 2018
Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River (1994)
MATH Google Scholar
Bastian, M., Heymann, S., Jacomy, M.: Gephi: an open source software for exploring and manipulating networks. In: ICWSM, vol. 8, no. 2009, pp. 361–362 (2009)
Google Scholar
Kuri-Morales, A.F., Aldana-Bobadilla, E., López-Peña, I.: The best genetic algorithm II. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013. LNCS (LNAI), vol. 8266, pp. 16–29. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-45111-9_2
Chapter Google Scholar
Cheney, W., Kincaid, D.: Linear Algebra: Theory and Applications. The Australian Mathematical Society, p. 110 (2009)
Google Scholar
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)
Article MathSciNet Google Scholar
Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17 (2011)
Google Scholar
Moffat, A.: Implementing the PPM data compression scheme. IEEE Trans. Commun. 38(11), 1917–1921 (1990)
Article Google Scholar
Vitányi, P.M.B., Li, M.: An Introduction to Kolmogorov Complexity and Its Applications. TCS, vol. 34. Springer, New York (2008). https://doi.org/10.1007/978-0-387-49820-1
Book MATH Google Scholar
Saw, J.G., Yang, M.C.K., Mo, T.C.: Chebyshev inequality with estimated mean and variance. Am. Stat. 38(2), 130–132 (1984)
MathSciNet Google Scholar
Kohavi, R., Becker, B.: Data Mining and Visualization. Silicon Graphics. https://archive.ics.uci.edu/ml/datasets/Census+Income
Kuri-Morales, A.: Transforming mixed data bases for machine learning: a case study. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds.) MICAI 2018. LNCS (LNAI), vol. 11288, pp. 157–170. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04491-6_12
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Tecnológico Autónomo de México, Río Hondo No. 1, 01000, México, D.F., Mexico
Angel Kuri-Morales

Authors

Angel Kuri-Morales
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Angel Kuri-Morales .

Editor information

Editors and Affiliations

National Institute of Astrophysics, Optics and Electronics, Puebla, Mexico
Jesús Ariel Carrasco-Ochoa
National Institute of Astrophysics, Optics and Electronics, Puebla, Mexico
José Francisco Martínez-Trinidad
Autonomous University of Puebla , Puebla, Mexico
José Arturo Olvera-López
National Polytechnic Institute of Mexico , Querétaro, Mexico
Joaquín Salas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuri-Morales, A. (2019). A Comprehensive Methodology to Find Closed Mathematical Models from Experimental Data. In: Carrasco-Ochoa, J., Martínez-Trinidad, J., Olvera-López, J., Salas, J. (eds) Pattern Recognition. MCPR 2019. Lecture Notes in Computer Science(), vol 11524. Springer, Cham. https://doi.org/10.1007/978-3-030-21077-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-21077-9_3
Published: 18 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21076-2
Online ISBN: 978-3-030-21077-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Comprehensive Methodology to Find Closed Mathematical Models from Experimental Data

Abstract

Similar content being viewed by others

Polynomial Multivariate Approximation with Genetic Algorithms

Removing the Black-Box from Machine Learning

Fundamental limits to learning closed-form mathematical models from data

Keywords

1 Introduction