Skip to main content

Finding the Optimal Sample Based on Shannon’s Entropy and Genetic Algorithms

  • Conference paper
  • First Online:
Advances in Artificial Intelligence and Soft Computing (MICAI 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9413))

Included in the following conference series:

Abstract

A common task in data analysis is to find the appropriate data sample whose properties allow us to infer the parameters of the data population. The most frequently dilemma related to sampling is how to determine the optimal size of the sample. To solve it, there are typical methods based on asymptotical results from the Central Limit Theorem. However, the effectiveness of such methods is bounded by several considerations as the sampling strategy (simple, stratified, cluster-based, etc.), the size of the population or even the dimensionality of the space of the data. In order to avoid such constraints, we propose a method based on a measure of information of the data in terms of Shannon’s Entropy. Our idea is to find the optimal sample of size N whose information is as similar as possible to the information of the population, subject to several constraints. Finding such sample represents a hard optimization problem whose feasible space disallows the use of traditional optimization techniques. To solve it, we resort to Genetic Algorithms. We test our method with synthetic datasets; the results show that our method is suitable. For completeness, we used a dataset from a real problem; the results confirm the effectiveness of our proposal and allow us to visualize different applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    To avoid ambiguities, the word population and the term P refer to the dataset to be sampled rather than the set of candidate solutions of EGA. Instead, this last set is denoted as C.

References

  1. Mukhopadhyay, P.: Theory and Methods of Survey Sampling. PHI Learning Pvt. Ltd., New Delhi (2009)

    Google Scholar 

  2. Sukhatme, P.V.: Sampling Theory of Surveys with Applications. Iowa State University Press, Ames (1957)

    Google Scholar 

  3. Israel, G.D.: Sampling the evidence of extension program impact. University of Florida Cooperative Extension Service, Institute of Food and Agriculture Sciences, EDIS (1992)

    Google Scholar 

  4. Cochran, W.G.: Sampling Techniques. Wiley, New York (2007)

    Google Scholar 

  5. Särndal, C.-E., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, New York (2003)

    MATH  Google Scholar 

  6. Barany, I., Vu, V.: Central limit theorems for Gaussian polytopes. Ann. Probab. 34, 1593–1621 (2007)

    Article  MathSciNet  Google Scholar 

  7. Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)

    Article  Google Scholar 

  8. Glover, F.: Tabu search-part I. ORSA J. Comput. 1(3), 190–206 (1989)

    Article  MATH  Google Scholar 

  9. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  10. Dorigo, M., Birattari, M.: Ant colony optimization. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 36–39. Springer, New York (2010)

    Google Scholar 

  11. Kennedy, J.: Particle Swarm Optimization. Encyclopedia of Machine Learning, pp. 760–766. Springer, New York (2010)

    Google Scholar 

  12. Spears, W.M., et al.: An overview of evolutionary computation. In: Brazdil, P.B. (ed.) ECML 1993. LNCS, vol. 667, pp. 442–459. Springer, Heidelberg (1993)

    Chapter  Google Scholar 

  13. Geritz, S.A.H., Mesze, G., Metz, J.A.J.: Evolutionarily singular strategies and the adaptive growth and branching of the evolutionary tree. Evol. Ecol. 12(1), 35–57 (1998)

    Article  Google Scholar 

  14. Kim, J.-H., Myung, H.: Evolutionary programming techniques for constrained optimization problems. IEEE Trans. Evol. Comput. 1(2), 129–140 (1997)

    Article  Google Scholar 

  15. Koza, J.R., Bennett III, F.H., Stiffelman, O.: Genetic programming as a Darwinian invention machine. In: Langdon, W.B., Fogarty, T.C., Nordin, P., Poli, R. (eds.) EuroGP 1999. LNCS, vol. 1598, pp. 93–108. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  16. Goldberg, D.E., Holland, J.H.: Genetic algorithms and machine learning. Mach. Learn. 3(2), 95–99 (1988)

    Article  Google Scholar 

  17. Rudolph, G.: Convergence analysis of canonical genetic algorithms. IEEE Trans. Neural Netw. 5(1), 96–101 (1994)

    Article  Google Scholar 

  18. Kuri-Morales, A., Aldana-Bobadilla, E.: The best genetic algorithm I. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013, Part II. LNCS, vol. 8266, pp. 1–15. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  19. Morales, A.K., Quezada, C.V.: A universal eclectic genetic algorithm for constrained optimization. Proceedings of the 6th European Congress on Intelligent Techniques and Soft Computing, vol. 1 (1998)

    Google Scholar 

  20. Shannon, C.E.: A note on the concept of entropy. Bell Syst. Tech. J. 27, 379–423 (1948)

    Article  MATH  MathSciNet  Google Scholar 

  21. Shannon, C.E., Weaver, W.: The mathematical theory of information (1949)

    Google Scholar 

  22. Hyndman, R.J., Fan, Y.: Sample quantiles in statistical packages. Am. Stat. 50(4), 361–365 (1996)

    Google Scholar 

  23. Hyndman, R.J.: The problem with Sturges’ rule for constructing histograms. Monash University (1995)

    Google Scholar 

  24. Doane, D.P.: Aesthetic frequency classifications. Am. Stat. 30(4), 181–183 (1976)

    MathSciNet  Google Scholar 

  25. Soo, N.H., Halim, Y.: Feature selection methodology in quality data mining (2004)

    Google Scholar 

  26. White, D.J., Anandalingam, G.: A penalty function approach for solving bi-level linear programs. J. Glob. Optim. 3(4), 397–419 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  27. Kuri-Morales, Á.F., Gutiérrez-García, J.O.: Penalty function methods for constrained optimization with genetic algorithms: a statistical analysis. In: Coello Coello, C.A., de Albornoz, Á., Sucar, L., Battistutti, O.C. (eds.) MICAI 2002. LNCS (LNAI), vol. 2313, pp. 108–117. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  28. Kuri-Morales, A., Rodríguez-Erazo, F.: A search space reduction methodology for data mining in large databases. Eng. Appl. Artif. Intell. 22(1), 57–65 (2009)

    Article  Google Scholar 

  29. Lichman, M.: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/datasets/Gisette. University of California, School of Information and Computer Science, Irvine (2013)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edwin Aldana-Bobadilla .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Aldana-Bobadilla, E., Alfaro-Pérez, C. (2015). Finding the Optimal Sample Based on Shannon’s Entropy and Genetic Algorithms. In: Sidorov, G., Galicia-Haro, S. (eds) Advances in Artificial Intelligence and Soft Computing. MICAI 2015. Lecture Notes in Computer Science(), vol 9413. Springer, Cham. https://doi.org/10.1007/978-3-319-27060-9_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27060-9_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27059-3

  • Online ISBN: 978-3-319-27060-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics