Finding the Optimal Sample Based on Shannon’s Entropy and Genetic Algorithms

Aldana-Bobadilla, Edwin; Alfaro-Pérez, Carlos

doi:10.1007/978-3-319-27060-9_29

Edwin Aldana-Bobadilla^15,16 &
Carlos Alfaro-Pérez^15,16

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9413))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1157 Accesses
2 Citations

Abstract

A common task in data analysis is to find the appropriate data sample whose properties allow us to infer the parameters of the data population. The most frequently dilemma related to sampling is how to determine the optimal size of the sample. To solve it, there are typical methods based on asymptotical results from the Central Limit Theorem. However, the effectiveness of such methods is bounded by several considerations as the sampling strategy (simple, stratified, cluster-based, etc.), the size of the population or even the dimensionality of the space of the data. In order to avoid such constraints, we propose a method based on a measure of information of the data in terms of Shannon’s Entropy. Our idea is to find the optimal sample of size N whose information is as similar as possible to the information of the population, subject to several constraints. Finding such sample represents a hard optimization problem whose feasible space disallows the use of traditional optimization techniques. To solve it, we resort to Genetic Algorithms. We test our method with synthetic datasets; the results show that our method is suitable. For completeness, we used a dataset from a real problem; the results confirm the effectiveness of our proposal and allow us to visualize different applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
To avoid ambiguities, the word population and the term P refer to the dataset to be sampled rather than the set of candidate solutions of EGA. Instead, this last set is denoted as C.

References

Mukhopadhyay, P.: Theory and Methods of Survey Sampling. PHI Learning Pvt. Ltd., New Delhi (2009)
Google Scholar
Sukhatme, P.V.: Sampling Theory of Surveys with Applications. Iowa State University Press, Ames (1957)
Google Scholar
Israel, G.D.: Sampling the evidence of extension program impact. University of Florida Cooperative Extension Service, Institute of Food and Agriculture Sciences, EDIS (1992)
Google Scholar
Cochran, W.G.: Sampling Techniques. Wiley, New York (2007)
Google Scholar
Särndal, C.-E., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, New York (2003)
MATH Google Scholar
Barany, I., Vu, V.: Central limit theorems for Gaussian polytopes. Ann. Probab. 34, 1593–1621 (2007)
Article MathSciNet Google Scholar
Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)
Article Google Scholar
Glover, F.: Tabu search-part I. ORSA J. Comput. 1(3), 190–206 (1989)
Article MATH Google Scholar
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
Article MATH MathSciNet Google Scholar
Dorigo, M., Birattari, M.: Ant colony optimization. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 36–39. Springer, New York (2010)
Google Scholar
Kennedy, J.: Particle Swarm Optimization. Encyclopedia of Machine Learning, pp. 760–766. Springer, New York (2010)
Google Scholar
Spears, W.M., et al.: An overview of evolutionary computation. In: Brazdil, P.B. (ed.) ECML 1993. LNCS, vol. 667, pp. 442–459. Springer, Heidelberg (1993)
Chapter Google Scholar
Geritz, S.A.H., Mesze, G., Metz, J.A.J.: Evolutionarily singular strategies and the adaptive growth and branching of the evolutionary tree. Evol. Ecol. 12(1), 35–57 (1998)
Article Google Scholar
Kim, J.-H., Myung, H.: Evolutionary programming techniques for constrained optimization problems. IEEE Trans. Evol. Comput. 1(2), 129–140 (1997)
Article Google Scholar
Koza, J.R., Bennett III, F.H., Stiffelman, O.: Genetic programming as a Darwinian invention machine. In: Langdon, W.B., Fogarty, T.C., Nordin, P., Poli, R. (eds.) EuroGP 1999. LNCS, vol. 1598, pp. 93–108. Springer, Heidelberg (1999)
Chapter Google Scholar
Goldberg, D.E., Holland, J.H.: Genetic algorithms and machine learning. Mach. Learn. 3(2), 95–99 (1988)
Article Google Scholar
Rudolph, G.: Convergence analysis of canonical genetic algorithms. IEEE Trans. Neural Netw. 5(1), 96–101 (1994)
Article Google Scholar
Kuri-Morales, A., Aldana-Bobadilla, E.: The best genetic algorithm I. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013, Part II. LNCS, vol. 8266, pp. 1–15. Springer, Heidelberg (2013)
Chapter Google Scholar
Morales, A.K., Quezada, C.V.: A universal eclectic genetic algorithm for constrained optimization. Proceedings of the 6th European Congress on Intelligent Techniques and Soft Computing, vol. 1 (1998)
Google Scholar
Shannon, C.E.: A note on the concept of entropy. Bell Syst. Tech. J. 27, 379–423 (1948)
Article MATH MathSciNet Google Scholar
Shannon, C.E., Weaver, W.: The mathematical theory of information (1949)
Google Scholar
Hyndman, R.J., Fan, Y.: Sample quantiles in statistical packages. Am. Stat. 50(4), 361–365 (1996)
Google Scholar
Hyndman, R.J.: The problem with Sturges’ rule for constructing histograms. Monash University (1995)
Google Scholar
Doane, D.P.: Aesthetic frequency classifications. Am. Stat. 30(4), 181–183 (1976)
MathSciNet Google Scholar
Soo, N.H., Halim, Y.: Feature selection methodology in quality data mining (2004)
Google Scholar
White, D.J., Anandalingam, G.: A penalty function approach for solving bi-level linear programs. J. Glob. Optim. 3(4), 397–419 (1993)
Article MATH MathSciNet Google Scholar
Kuri-Morales, Á.F., Gutiérrez-García, J.O.: Penalty function methods for constrained optimization with genetic algorithms: a statistical analysis. In: Coello Coello, C.A., de Albornoz, Á., Sucar, L., Battistutti, O.C. (eds.) MICAI 2002. LNCS (LNAI), vol. 2313, pp. 108–117. Springer, Heidelberg (2002)
Chapter Google Scholar
Kuri-Morales, A., Rodríguez-Erazo, F.: A search space reduction methodology for data mining in large databases. Eng. Appl. Artif. Intell. 22(1), 57–65 (2009)
Article Google Scholar
Lichman, M.: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/datasets/Gisette. University of California, School of Information and Computer Science, Irvine (2013)

Download references

Author information

Authors and Affiliations

Facultad de Ingeniería UNAM, Ciudad Universitaria, México D.F, Mexico
Edwin Aldana-Bobadilla & Carlos Alfaro-Pérez
Facultad de Estudios Superiores-UNAM, Naucalpan, Estado de México, Mexico
Edwin Aldana-Bobadilla & Carlos Alfaro-Pérez

Authors

Edwin Aldana-Bobadilla
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Alfaro-Pérez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edwin Aldana-Bobadilla .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Grigori Sidorov
Facultad de ciencias, Universidad Autónoma Nacional, México, Distrito Federal, Mexico
Sofía N. Galicia-Haro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aldana-Bobadilla, E., Alfaro-Pérez, C. (2015). Finding the Optimal Sample Based on Shannon’s Entropy and Genetic Algorithms. In: Sidorov, G., Galicia-Haro, S. (eds) Advances in Artificial Intelligence and Soft Computing. MICAI 2015. Lecture Notes in Computer Science(), vol 9413. Springer, Cham. https://doi.org/10.1007/978-3-319-27060-9_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-27060-9_29
Published: 30 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27059-3
Online ISBN: 978-3-319-27060-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics