Skip to main content

Realistic Synthetic Data Generation: The ATEN Framework

  • Conference paper
  • First Online:
Biomedical Engineering Systems and Technologies (BIOSTEC 2018)

Abstract

Getting access to real medical data for research is notoriously difficult. Even when data exist they are usually incomplete and subject to restrictions due to confidentiality and privacy. Synthetic data (SD) are best replacements for real data but must be verifiably realistic. There is little or no investigation into systematically achieving realism in SD. This work investigates this problem, and contributes the ATEN framework, which incorporates three component approaches: (1) THOTH for synthetic data generation (SDG); (2) RA for characterising realism is SD, and (3) HORUS for validating realism in SD. The framework is found promising after its use in generating the realistic synthetic EHR (RS-EHR) for labour and birth. This framework is significant in guaranteeing realism in SDG projects. Future efforts focus on further validation of ATEN in a controlled multi-stream SDG process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. McGraw-Hill: McGraw-Hill Dictionary of Scientific and Technical Terms, 6th edn. McGraw-Hill, London (2003)

    Google Scholar 

  2. Rubin, D.: Discussion: statistical disclosure limitation. J. Off. Stat. 9, 461–468 (1993)

    Google Scholar 

  3. Alter, H.: Creation of a synthetic data set by linking records of the Canadian survey of consumer finances with the family expenditure survey. Ann. Econ. Soc. Meas. 3(2), 373–397 (1994)

    Google Scholar 

  4. Wolff, E.: Estimates of the 1969 size distribution of household wealth in the US from a synthetic data base Trans.). In: Smith, J. (ed.) Modelling the Distribution and Intergenerational Transmission of Wealth. University of Chicago Press, Chicago (1980)

    Google Scholar 

  5. Green, P.E., Rao, V.R.: Conjoint measurement for quantifying judgmental data. J. Mark. Res. 8(3), 355–363 (1971)

    Google Scholar 

  6. Birkin, M., Clarke, M.: SYNTHESIS – a synthetic spatial information system for urban and regional analysis: methods and examples. Environ. Plan. 20(1), 1645–1671 (1998)

    Google Scholar 

  7. Stedinger, J., Taylor, M.: Synthetic streamflow generation: model verification and validation. Water Resour. Res. 18(4), 909–918 (1982)

    Article  Google Scholar 

  8. Geweke, J., Porter-Hudak, S.: The estimation and application of long memory series models. J. Time Ser. Anal. 4(4), 221–238 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  9. Graham, V.A., Hollands, K., Unny, T.E.: A time series model for Kt with application to global synthetic weather generation. Sol. Energy 40(2), 83–92 (1988)

    Article  Google Scholar 

  10. Delleur, J., Kavvas, M.: Stochastic models for monthly rainfall forecasting and synthetic generation. J. Appl. Meteorol. 17, 1528–1536 (1978)

    Article  Google Scholar 

  11. Barse, E., Kvarnstrom, H., Jonsson, E.: Synthesizing test data for fraud detection systems. Paper presented at the 19th Annual Computer Security Applications Conference (2003)

    Google Scholar 

  12. Houkjaer, K., Torp, K., Wind, R.: Simple and realistic data generation. Paper presented at the VLDB 2006 (2006)

    Google Scholar 

  13. Mouza, C., et al.: Towards an automatic detection of sensitive information in a database. Paper presented at the 2nd International Conference on Advances in Database Knowledge and Database Applications (2010)

    Google Scholar 

  14. Whiting, M., Haack, J., Varley, C.: Creating realistic, scenario-based synthetic data for test and evaluation of information analytics software. Paper presented at the 2008 Workshop on Beyond Time and Errors: Novel Evaluation Methods for Information Visualisation (BELIV 2008) (2008)

    Google Scholar 

  15. Gargiulo, F., Ternes, S., Huet, S., Deffuant, G.: An iterative approach for generating statistically realistic populations of households. PLOS ONE 5(1), e8828 (2010)

    Article  Google Scholar 

  16. Srikanthan, R.M.T.: Stochastic generation of annual, monthly and daily climate data: a review. Hydrol. Earth Syst. Sci. Discuss. 5(4), 653–670 (2001)

    Article  Google Scholar 

  17. Wan, L., Zhu, J., Bertino, L., Wang, H.: Initial ensemble generation and validation for ocean data assimilation using HYCOM in the Pacific. Ocean Dyn. 58, 81 (2008)

    Article  Google Scholar 

  18. Killourhy, K., Maxion, R.: Toward realistic and artefact-free insider-threat data. Paper presented at the 23rd Annual Computer Security Applications Conference (CSAC) (2007)

    Google Scholar 

  19. Sperotto, A., Sadre, R., Van Vliet, F., Pras, A.: A labelled data set for flow-based intrusion detection. Paper presented at the 9th IEEE International Workshop on IP Operations and Management (IPOM 2009) (2009)

    Google Scholar 

  20. Zanero, S.: Flaws and frauds in the evaluation of IDS/IPS technologies. Paper presented at the Forum of Incident Response and Security Teams (FIRST 2007) (2007)

    Google Scholar 

  21. Ascoli, G., Krichmar, J., Nasuto, S., Senft, S.: Generation, description and storage of dendritic morphology data. Philos. Trans. R. Soc. Lond. 365, 1131–1145 (2001)

    Article  Google Scholar 

  22. Bozkurt, M., Harman, M.: Automatically generating realistic test input from web services. Paper presented at the 6th International Symposium on Service Oriented System Engineering (2011)

    Google Scholar 

  23. Drechsler, J., Reiter, J.: An empirical evaluation of easily implemented, non-parametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)

    Article  Google Scholar 

  24. Gymrek, M., McGuire, A., Golan, D., Halperin, E., Erlich, Y.: Identifying personal genomes by surname. Science 339(6117), 321–324 (2013). https://doi.org/10.1126/science.1229566

    Article  Google Scholar 

  25. Ohm, P.: Broken promises of privacy: responding to the surprising failure of anonymisation. UCLA Law Rev. 57, 1701 (2010)

    Google Scholar 

  26. Sweeney, L., Abu, A., Winn, J.: Identifying Participants in the Personal Genome Project by Name. Data Privacy Lab, Harvard University (2013)

    Google Scholar 

  27. Lundin, E., Kvarnström, H., Jonsson, E.: A synthetic fraud data generation methodology. In: Deng, R., Bao, F., Zhou, J., Qing, S. (eds.) ICICS 2002. LNCS, vol. 2513, pp. 265–277. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36159-6_23

    Chapter  Google Scholar 

  28. Stratigopoulos, H., Mir, S., Makris, Y.: Enrichment of limited training sets in machine-learning-based analog/RF test. Paper presented at the DATE 2009 (2009)

    Google Scholar 

  29. Wu, X., Wang, Y., Zheng, Y.: Privacy preserving database application testing. Paper presented at the WPES 2003 (2003)

    Google Scholar 

  30. McLachlan, S., et al.: Learning health systems: the research community awareness challenge. BCS J. Innov. Health Inform. 25(1), 038–040 (2018)

    Article  Google Scholar 

  31. Jaderberg, M., K. Simonyan, A. Vedaldi and A. Zisserman. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv:1406.2227

  32. Penduff, T., Barnier, B., Molines, J., Madec, G.: On the use of current meter data to assess the realism of ocean model simulations. Ocean Model. 11(3), 399–416 (2006)

    Article  Google Scholar 

  33. Putnam, H.: Realism and reason. In: Proceedings and Addresses of the American Philosophical Association, vol. 50, no. 6, pp. 483–498 (1977)

    Article  Google Scholar 

  34. Barlas, Y.: Formal aspects of model validity and validation in system dynamics. Syst. Dyn. Rev. 12(3), 183–210 (1996)

    Article  Google Scholar 

  35. Carley, K.: Validating Computational Models. Carnegie Mellon University, Cambridge (1996)

    Google Scholar 

  36. Brinkhoff, T.: Generating traffic data. IEEE Data Eng. Bull. 26(2), 19–25 (2003)

    Google Scholar 

  37. Giannotti, F., Mazzoni, A., Puntoni, S., Renso, C.: Synthetic generation of cellular network positioning data. Paper presented at the 13th Annual ACM International Workshop on Geographic Information Systems (2005)

    Google Scholar 

  38. Stodden, V.: The scientific method in practice: reproducibility in the computational sciences. SSRN Paper 1550193. MIT Sloan School of Management (2010)

    Google Scholar 

  39. Collins, H.: Changing Order: Replication and Induction in Scientific Practice. University of Chicago Press, Chicago (1992)

    Google Scholar 

  40. Moss, P.: Can there be validity without reliability? Educ. Res. 23(2), 5–12 (1994)

    Article  Google Scholar 

  41. Tsvetovat, M., Carley, K.: Generation of realistic social network datasets for testing of analysis and simulation tools. Technical report 9. DTIC (2005)

    Google Scholar 

  42. Richardson, I., Thomson, M., Infield, D.: A high-resolution domestic building occupancy model for energy demand simulations. Energy Build. 40(8), 1560–1566 (2008)

    Article  Google Scholar 

  43. Domingo-Ferrer, J.: Marginality: a numerical mapping for enhanced exploitation of taxonomic attributes. In: Torra, V., Narukawa, Y., López, B., Villaret, M. (eds.) MDAI 2012. LNCS (LNAI), vol. 7647, pp. 367–381. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34620-0_33

    Chapter  Google Scholar 

  44. Efstratiadis, A., Dialynas, Y., Kozanis, S., Koutsoyiannis, D.: A multivariate stochastic model for the generation of synthetic time series at multiple time scales reproducing long-term persistence. Environ. Model. Softw. 62, 139–152 (2014)

    Article  Google Scholar 

  45. Van den Bulcke, T., et al.: SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinform. 7(1), 43 (2006)

    Article  Google Scholar 

  46. Mateo-Sanz, J.M., Martínez-Ballesté, A., Domingo-Ferrer, J.: Fast generation of accurate synthetic microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 298–306. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_24

    Chapter  Google Scholar 

  47. Gafurov, T., Usaola, J., Prodanovic, M.: Incorporating spatial correlation into stochastic generation of solar radiation data. Sol. Energy 115, 74–84 (2015)

    Article  Google Scholar 

  48. Brissette, F.P., Khalili, M., Leconte, R.: Efficient stochastic generation of multi-site synthetic precipitation data. J. Hydrol. 345(3), 121–133 (2007)

    Article  Google Scholar 

  49. Gainotti, S., et al.: Improving the informed consent process in international collaborative rare disease research: effective consent for effective research. Eur. J. Hum. Genet. 24, 1248 (2016)

    Article  Google Scholar 

  50. Arifin, S.M.N., Madey, G.R.: Verification, validation, and replication methods for agent-based modeling and simulation: lessons learned the hard way! In: Yilmaz, L. (ed.) Concepts and Methodologies for Modeling and Simulation. SFMA, pp. 217–242. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15096-3_10

    Chapter  Google Scholar 

  51. Greene, J.C., Caracelli, V., Graham, W.F.: Toward a conceptual framework for mixed-method evaluation designs. Educ. Eval. Policy Anal. 11(3), 255–274 (1989)

    Article  Google Scholar 

  52. McLachlan, S., Dube, K., Gallagher, T., Daley, B., Walonoski, J.: The ATEN framework for creating the realistic synthetic electronic health record. Paper presented at the 11th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2018), Madiera, Portugal (2018)

    Google Scholar 

  53. Lydiard, T.: Overview of the current practice and research initiatives for the verification and validation of KBS. Knowl. Eng. Rev. 7(2), 101–113 (1992)

    Article  Google Scholar 

  54. Ishigami, M., Cumings, J., Zetti, A., Chen, S.: A simple method for the continuous production of carbon nanotubes. Chem. Phys. Lett. 319(5), 457–459 (2000)

    Article  Google Scholar 

  55. Mahmoud, E.: Accuracy in forecasting: a survey. J. Forecast. 3(2), 139–159 (1984)

    Article  Google Scholar 

  56. Nicoletti, I., Migliorati, G., Pagliacci, M., Grignani, F., Riccardi, C.: A rapid and simple method for measuring thymocyte apoptosis by propidium iodide staining and flow cytometry. J. Immunol. Methods 139(2), 271–279 (1991)

    Article  Google Scholar 

  57. Rosevear, A.: Immobilised biocatalysts – a critical review. J. Chem. Technol. Biotechnol. 34(3), 127–150 (1984)

    Article  Google Scholar 

  58. Parnas, D., Clements, P.: A rational design process: how and why to fake it. IEEE Trans. Softw. Eng. 2, 251–257 (1986)

    Article  Google Scholar 

  59. Winkler, W.E.: Masking and re-identification methods for public-use microdata: overview and research problems. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 231–246. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_18

    Chapter  Google Scholar 

  60. Andoulsi, I., Wilson, P.: Understanding liability in eHealth: towards greater clarity at European Union level. In: George, C., Whitehouse, D., Duquenoy, P. (eds.) eHealth: Legal, ethical and governance challenges, pp. 165–180. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-22474-4_7

    Chapter  Google Scholar 

  61. Mwogi, T., Biondich, P., Grannis, S.: An evaluation of two methods for generating synthetic HL7 segments reflecting real-world health information exchange transactions. Paper presented at the AMIA Annual Symposium Proceedings (2014)

    Google Scholar 

  62. McLachlan, S., Dube, K., Gallagher, T.: Using CareMaps and health statistics for generating the realistic synthetic electronic healthcare record. Paper presented at the International Conference on Healthcare Informatics (ICHI 2016), Chicago, USA (2016)

    Google Scholar 

  63. Cassa, C., Olson, K., Mandl, K.: System to generate semisynthetic data sets of outbreak clusters for evaluation of outbreak-detection performance. Morb. Mortal. Wkly Rep. (MMWR) 53, 231 (2004)

    Google Scholar 

  64. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: towards a unifying framework. KDD 96, 82–88 (1996)

    Google Scholar 

  65. Fernandez-Arteaga, V., et al.: Association between completed suicide and environmental temperature in a Mexican population, using the KDD approach. Comput. Methods Programs Biomed. 135, 219–224 (2016)

    Article  Google Scholar 

  66. Holzinger, A., Dehmer, M., Jurisica, I.: Knowledge discovery and interactive data mining in Bopinformatics: state-of-the-art, future challenges and research directions. BMC Bioinform. 15(6), I1 (2014)

    Article  Google Scholar 

  67. Mitra, S., Pal, S., Mitra, P.: Data mining in soft computing framework: a survey. IEEE Trans. Neural Netw. 13(1), 3–14 (2002)

    Article  Google Scholar 

  68. Nijssen, G.M., Halpin, T.A.: Conceptual Schema and Relational Database Design: A Fact Oriented Approach. Prentice Hall Inc., Upper Saddle River (1989)

    Google Scholar 

  69. Han, J., Cai, Y., Cercone, N.: Data-driven discovery of quantitative rules in relational databases. IEEE Trans. Knowl. Data Eng. 5(1), 29–40 (1993)

    Article  Google Scholar 

  70. Sanderson, M., Croft, B.: Deriving concept hierarchies from text. Paper presented at the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999)

    Google Scholar 

  71. Barnes, C.A.: Concepts Hierarchies for Extensible Databases. Naval Postgraduate School, Monterey (1990)

    Google Scholar 

  72. Ganter, B., Willie, R.: Applied lattice theory: formal concept analysis. In: General Latice Theory. Birkhauser, Basel (1997)

    Google Scholar 

  73. Rodriguez-Jiminez, J., Cordero, P., Enciso, M., Rudolph, S.: Concept lattices with negative information: a characterisation theorem. Inf. Sci. 369(51), 51–62 (2016)

    Article  Google Scholar 

  74. Bex, G., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. Paper presented at the 32nd International Conference on Very Large Databases (2006)

    Google Scholar 

  75. Laranjeiro, N., Vieira, M., Madeira, H.: Improving web services robustness. Paper presented at the IEEE International Conference on Web Services ICWS 2009 (2009)

    Google Scholar 

  76. Oreskes, N., Shrader-Frechette, K., Belitz, K.: Verification, validation and confirmation of numerical models in the earth sciences. Science 263(5147), 641–646 (1994)

    Article  Google Scholar 

  77. McLachlan, S.: Realism in synthetic data generation. Master of Philosophy in Science MPhil, Massey University, Palmerston North, New Zealand (2017). Available from database

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Scott McLachlan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

McLachlan, S., Dube, K., Gallagher, T., Simmonds, J.A., Fenton, N. (2019). Realistic Synthetic Data Generation: The ATEN Framework. In: Cliquet Jr., A., et al. Biomedical Engineering Systems and Technologies. BIOSTEC 2018. Communications in Computer and Information Science, vol 1024. Springer, Cham. https://doi.org/10.1007/978-3-030-29196-9_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-29196-9_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-29195-2

  • Online ISBN: 978-3-030-29196-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics