Abstract
Getting access to real medical data for research is notoriously difficult. Even when data exist they are usually incomplete and subject to restrictions due to confidentiality and privacy. Synthetic data (SD) are best replacements for real data but must be verifiably realistic. There is little or no investigation into systematically achieving realism in SD. This work investigates this problem, and contributes the ATEN framework, which incorporates three component approaches: (1) THOTH for synthetic data generation (SDG); (2) RA for characterising realism is SD, and (3) HORUS for validating realism in SD. The framework is found promising after its use in generating the realistic synthetic EHR (RS-EHR) for labour and birth. This framework is significant in guaranteeing realism in SDG projects. Future efforts focus on further validation of ATEN in a controlled multi-stream SDG process.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
McGraw-Hill: McGraw-Hill Dictionary of Scientific and Technical Terms, 6th edn. McGraw-Hill, London (2003)
Rubin, D.: Discussion: statistical disclosure limitation. J. Off. Stat. 9, 461–468 (1993)
Alter, H.: Creation of a synthetic data set by linking records of the Canadian survey of consumer finances with the family expenditure survey. Ann. Econ. Soc. Meas. 3(2), 373–397 (1994)
Wolff, E.: Estimates of the 1969 size distribution of household wealth in the US from a synthetic data base Trans.). In: Smith, J. (ed.) Modelling the Distribution and Intergenerational Transmission of Wealth. University of Chicago Press, Chicago (1980)
Green, P.E., Rao, V.R.: Conjoint measurement for quantifying judgmental data. J. Mark. Res. 8(3), 355–363 (1971)
Birkin, M., Clarke, M.: SYNTHESIS – a synthetic spatial information system for urban and regional analysis: methods and examples. Environ. Plan. 20(1), 1645–1671 (1998)
Stedinger, J., Taylor, M.: Synthetic streamflow generation: model verification and validation. Water Resour. Res. 18(4), 909–918 (1982)
Geweke, J., Porter-Hudak, S.: The estimation and application of long memory series models. J. Time Ser. Anal. 4(4), 221–238 (1983)
Graham, V.A., Hollands, K., Unny, T.E.: A time series model for Kt with application to global synthetic weather generation. Sol. Energy 40(2), 83–92 (1988)
Delleur, J., Kavvas, M.: Stochastic models for monthly rainfall forecasting and synthetic generation. J. Appl. Meteorol. 17, 1528–1536 (1978)
Barse, E., Kvarnstrom, H., Jonsson, E.: Synthesizing test data for fraud detection systems. Paper presented at the 19th Annual Computer Security Applications Conference (2003)
Houkjaer, K., Torp, K., Wind, R.: Simple and realistic data generation. Paper presented at the VLDB 2006 (2006)
Mouza, C., et al.: Towards an automatic detection of sensitive information in a database. Paper presented at the 2nd International Conference on Advances in Database Knowledge and Database Applications (2010)
Whiting, M., Haack, J., Varley, C.: Creating realistic, scenario-based synthetic data for test and evaluation of information analytics software. Paper presented at the 2008 Workshop on Beyond Time and Errors: Novel Evaluation Methods for Information Visualisation (BELIV 2008) (2008)
Gargiulo, F., Ternes, S., Huet, S., Deffuant, G.: An iterative approach for generating statistically realistic populations of households. PLOS ONE 5(1), e8828 (2010)
Srikanthan, R.M.T.: Stochastic generation of annual, monthly and daily climate data: a review. Hydrol. Earth Syst. Sci. Discuss. 5(4), 653–670 (2001)
Wan, L., Zhu, J., Bertino, L., Wang, H.: Initial ensemble generation and validation for ocean data assimilation using HYCOM in the Pacific. Ocean Dyn. 58, 81 (2008)
Killourhy, K., Maxion, R.: Toward realistic and artefact-free insider-threat data. Paper presented at the 23rd Annual Computer Security Applications Conference (CSAC) (2007)
Sperotto, A., Sadre, R., Van Vliet, F., Pras, A.: A labelled data set for flow-based intrusion detection. Paper presented at the 9th IEEE International Workshop on IP Operations and Management (IPOM 2009) (2009)
Zanero, S.: Flaws and frauds in the evaluation of IDS/IPS technologies. Paper presented at the Forum of Incident Response and Security Teams (FIRST 2007) (2007)
Ascoli, G., Krichmar, J., Nasuto, S., Senft, S.: Generation, description and storage of dendritic morphology data. Philos. Trans. R. Soc. Lond. 365, 1131–1145 (2001)
Bozkurt, M., Harman, M.: Automatically generating realistic test input from web services. Paper presented at the 6th International Symposium on Service Oriented System Engineering (2011)
Drechsler, J., Reiter, J.: An empirical evaluation of easily implemented, non-parametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)
Gymrek, M., McGuire, A., Golan, D., Halperin, E., Erlich, Y.: Identifying personal genomes by surname. Science 339(6117), 321–324 (2013). https://doi.org/10.1126/science.1229566
Ohm, P.: Broken promises of privacy: responding to the surprising failure of anonymisation. UCLA Law Rev. 57, 1701 (2010)
Sweeney, L., Abu, A., Winn, J.: Identifying Participants in the Personal Genome Project by Name. Data Privacy Lab, Harvard University (2013)
Lundin, E., Kvarnström, H., Jonsson, E.: A synthetic fraud data generation methodology. In: Deng, R., Bao, F., Zhou, J., Qing, S. (eds.) ICICS 2002. LNCS, vol. 2513, pp. 265–277. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36159-6_23
Stratigopoulos, H., Mir, S., Makris, Y.: Enrichment of limited training sets in machine-learning-based analog/RF test. Paper presented at the DATE 2009 (2009)
Wu, X., Wang, Y., Zheng, Y.: Privacy preserving database application testing. Paper presented at the WPES 2003 (2003)
McLachlan, S., et al.: Learning health systems: the research community awareness challenge. BCS J. Innov. Health Inform. 25(1), 038–040 (2018)
Jaderberg, M., K. Simonyan, A. Vedaldi and A. Zisserman. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv:1406.2227
Penduff, T., Barnier, B., Molines, J., Madec, G.: On the use of current meter data to assess the realism of ocean model simulations. Ocean Model. 11(3), 399–416 (2006)
Putnam, H.: Realism and reason. In: Proceedings and Addresses of the American Philosophical Association, vol. 50, no. 6, pp. 483–498 (1977)
Barlas, Y.: Formal aspects of model validity and validation in system dynamics. Syst. Dyn. Rev. 12(3), 183–210 (1996)
Carley, K.: Validating Computational Models. Carnegie Mellon University, Cambridge (1996)
Brinkhoff, T.: Generating traffic data. IEEE Data Eng. Bull. 26(2), 19–25 (2003)
Giannotti, F., Mazzoni, A., Puntoni, S., Renso, C.: Synthetic generation of cellular network positioning data. Paper presented at the 13th Annual ACM International Workshop on Geographic Information Systems (2005)
Stodden, V.: The scientific method in practice: reproducibility in the computational sciences. SSRN Paper 1550193. MIT Sloan School of Management (2010)
Collins, H.: Changing Order: Replication and Induction in Scientific Practice. University of Chicago Press, Chicago (1992)
Moss, P.: Can there be validity without reliability? Educ. Res. 23(2), 5–12 (1994)
Tsvetovat, M., Carley, K.: Generation of realistic social network datasets for testing of analysis and simulation tools. Technical report 9. DTIC (2005)
Richardson, I., Thomson, M., Infield, D.: A high-resolution domestic building occupancy model for energy demand simulations. Energy Build. 40(8), 1560–1566 (2008)
Domingo-Ferrer, J.: Marginality: a numerical mapping for enhanced exploitation of taxonomic attributes. In: Torra, V., Narukawa, Y., López, B., Villaret, M. (eds.) MDAI 2012. LNCS (LNAI), vol. 7647, pp. 367–381. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34620-0_33
Efstratiadis, A., Dialynas, Y., Kozanis, S., Koutsoyiannis, D.: A multivariate stochastic model for the generation of synthetic time series at multiple time scales reproducing long-term persistence. Environ. Model. Softw. 62, 139–152 (2014)
Van den Bulcke, T., et al.: SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinform. 7(1), 43 (2006)
Mateo-Sanz, J.M., MartÃnez-Ballesté, A., Domingo-Ferrer, J.: Fast generation of accurate synthetic microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 298–306. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_24
Gafurov, T., Usaola, J., Prodanovic, M.: Incorporating spatial correlation into stochastic generation of solar radiation data. Sol. Energy 115, 74–84 (2015)
Brissette, F.P., Khalili, M., Leconte, R.: Efficient stochastic generation of multi-site synthetic precipitation data. J. Hydrol. 345(3), 121–133 (2007)
Gainotti, S., et al.: Improving the informed consent process in international collaborative rare disease research: effective consent for effective research. Eur. J. Hum. Genet. 24, 1248 (2016)
Arifin, S.M.N., Madey, G.R.: Verification, validation, and replication methods for agent-based modeling and simulation: lessons learned the hard way! In: Yilmaz, L. (ed.) Concepts and Methodologies for Modeling and Simulation. SFMA, pp. 217–242. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15096-3_10
Greene, J.C., Caracelli, V., Graham, W.F.: Toward a conceptual framework for mixed-method evaluation designs. Educ. Eval. Policy Anal. 11(3), 255–274 (1989)
McLachlan, S., Dube, K., Gallagher, T., Daley, B., Walonoski, J.: The ATEN framework for creating the realistic synthetic electronic health record. Paper presented at the 11th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2018), Madiera, Portugal (2018)
Lydiard, T.: Overview of the current practice and research initiatives for the verification and validation of KBS. Knowl. Eng. Rev. 7(2), 101–113 (1992)
Ishigami, M., Cumings, J., Zetti, A., Chen, S.: A simple method for the continuous production of carbon nanotubes. Chem. Phys. Lett. 319(5), 457–459 (2000)
Mahmoud, E.: Accuracy in forecasting: a survey. J. Forecast. 3(2), 139–159 (1984)
Nicoletti, I., Migliorati, G., Pagliacci, M., Grignani, F., Riccardi, C.: A rapid and simple method for measuring thymocyte apoptosis by propidium iodide staining and flow cytometry. J. Immunol. Methods 139(2), 271–279 (1991)
Rosevear, A.: Immobilised biocatalysts – a critical review. J. Chem. Technol. Biotechnol. 34(3), 127–150 (1984)
Parnas, D., Clements, P.: A rational design process: how and why to fake it. IEEE Trans. Softw. Eng. 2, 251–257 (1986)
Winkler, W.E.: Masking and re-identification methods for public-use microdata: overview and research problems. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 231–246. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_18
Andoulsi, I., Wilson, P.: Understanding liability in eHealth: towards greater clarity at European Union level. In: George, C., Whitehouse, D., Duquenoy, P. (eds.) eHealth: Legal, ethical and governance challenges, pp. 165–180. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-22474-4_7
Mwogi, T., Biondich, P., Grannis, S.: An evaluation of two methods for generating synthetic HL7 segments reflecting real-world health information exchange transactions. Paper presented at the AMIA Annual Symposium Proceedings (2014)
McLachlan, S., Dube, K., Gallagher, T.: Using CareMaps and health statistics for generating the realistic synthetic electronic healthcare record. Paper presented at the International Conference on Healthcare Informatics (ICHI 2016), Chicago, USA (2016)
Cassa, C., Olson, K., Mandl, K.: System to generate semisynthetic data sets of outbreak clusters for evaluation of outbreak-detection performance. Morb. Mortal. Wkly Rep. (MMWR) 53, 231 (2004)
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: towards a unifying framework. KDD 96, 82–88 (1996)
Fernandez-Arteaga, V., et al.: Association between completed suicide and environmental temperature in a Mexican population, using the KDD approach. Comput. Methods Programs Biomed. 135, 219–224 (2016)
Holzinger, A., Dehmer, M., Jurisica, I.: Knowledge discovery and interactive data mining in Bopinformatics: state-of-the-art, future challenges and research directions. BMC Bioinform. 15(6), I1 (2014)
Mitra, S., Pal, S., Mitra, P.: Data mining in soft computing framework: a survey. IEEE Trans. Neural Netw. 13(1), 3–14 (2002)
Nijssen, G.M., Halpin, T.A.: Conceptual Schema and Relational Database Design: A Fact Oriented Approach. Prentice Hall Inc., Upper Saddle River (1989)
Han, J., Cai, Y., Cercone, N.: Data-driven discovery of quantitative rules in relational databases. IEEE Trans. Knowl. Data Eng. 5(1), 29–40 (1993)
Sanderson, M., Croft, B.: Deriving concept hierarchies from text. Paper presented at the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999)
Barnes, C.A.: Concepts Hierarchies for Extensible Databases. Naval Postgraduate School, Monterey (1990)
Ganter, B., Willie, R.: Applied lattice theory: formal concept analysis. In: General Latice Theory. Birkhauser, Basel (1997)
Rodriguez-Jiminez, J., Cordero, P., Enciso, M., Rudolph, S.: Concept lattices with negative information: a characterisation theorem. Inf. Sci. 369(51), 51–62 (2016)
Bex, G., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. Paper presented at the 32nd International Conference on Very Large Databases (2006)
Laranjeiro, N., Vieira, M., Madeira, H.: Improving web services robustness. Paper presented at the IEEE International Conference on Web Services ICWS 2009 (2009)
Oreskes, N., Shrader-Frechette, K., Belitz, K.: Verification, validation and confirmation of numerical models in the earth sciences. Science 263(5147), 641–646 (1994)
McLachlan, S.: Realism in synthetic data generation. Master of Philosophy in Science MPhil, Massey University, Palmerston North, New Zealand (2017). Available from database
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
McLachlan, S., Dube, K., Gallagher, T., Simmonds, J.A., Fenton, N. (2019). Realistic Synthetic Data Generation: The ATEN Framework. In: Cliquet Jr., A., et al. Biomedical Engineering Systems and Technologies. BIOSTEC 2018. Communications in Computer and Information Science, vol 1024. Springer, Cham. https://doi.org/10.1007/978-3-030-29196-9_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-29196-9_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29195-2
Online ISBN: 978-3-030-29196-9
eBook Packages: Computer ScienceComputer Science (R0)