Studies in the use of data mining, prediction algorithms, and a universal exchange and inference language in the analysis of socioeconomic health data

https://doi.org/10.1016/j.compbiomed.2019.103369Get rights and content

Highlights

  • Socioeconomic factors that are important determinants of health on the national and global scale.

  • Data mining and prediction algorithms can stand beside classical correlation analysis in analysis of socioeconomic data.

  • The method used was somewhat unusual, having a basis in Dirac’s quantum mechanics and a Theory of Expected Information.

  • A significant negative association found between population health and equity was surprising, at least to the authors.

  • To explore this, comparison is made between classical Pearson’s correlation and the possibility of Simpson’s paradox.

Abstract

While clinical and biomedical information in digital form has been escalating, it is socioeconomic factors that are important determinants of health on the national and global scale. We show how collective use of data mining and prediction algorithms to analyze socioeconomic population health data can stand beside classical correlation analysis in routine data analysis. The underlying theoretical basis is the Dirac notation and algebra that is a scientific standard but unusual outside of the physical sciences, combined with a theory of expected information first developed for analyzing sparse data but still largely confined to bioinformatics. The latter was important here because the records analyzed (which are for US counties and equivalents, not patients) are very few by contemporary data mining standards. The approach is very unlikely to be familiar to socioeconomic researchers, so the theory and the advantages of our inference nets over the Bayes Net are reviewed here, mostly using socioeconomic examples. While our expertise and focus is in regard to novel analytical methods rather than socioeconomics per se, a significant negative (countertrending) relationship between population health and equity was initially surprising, at least to the present authors. This encouraged deeper exploration including that of the relationship between our data mining methods and traditional Pearson's correlation. The latter is susceptible to giving wrong conclusions if a phenomenon called Simpson's paradox applies, so this is also investigated. Also discussed is that, even for very few records, associative data mining can still demand significant computational resources due to a combinatorial explosion.

Section snippets

Background

The explosive growth in medical data in digital form has primarily been in regard to clinical and biomedical data [1]. However, this has been primarily of a clinical and scientific nature. The ability of individuals to access and afford the fruits of growing medical knowledge is also a strong determinant of health. Data concerning that is essentially socioeconomic health data (SHD). Of course, researchers and organizations have studied data of this general kind for many years. Notably, in

The classical basis and notation

The underpinnings of the basic method (as well some further aspects by us and others [[44], [45], [46], [47], [48], [49], [50]] used in the present paper), do not replace “classical” data analysis and knowledge representation. However, we argue that they generalize and extend them. Our starting point is the familiar conditional probability, a purely real-valued scalar value on the interval 0 … 1, primarily historically introduced by de Moivre. It is also used building block and variable in BNs

Data

The original data [3] comprised one record for each of 500 counties or similar regions that are top scorers in population health, additionally split into 4 classes of 100 that represent urban high performing, urban up-and-coming, rural high performing, and rural up-and-coming. Along the above grouping (urban high performing, etc. plus the class Overall) the data comprises only scores for population health with 9 other types of score such as Economy, Infrastructure, and so on each of which

Preliminary studies and overview of principal statistical results

Although this paper is primarily methodological (and not medical or socioeconomic research) it is instructive to follow, with some discussion, a story for real data in a practical use-case context. A very early step in most studies is to generate a large Q-UEL tag (a collection of summer data for each specified study) called the SSMETHOD (statistical summary) tag. The following tag labeled POPHEALTH-SURVEY-SUMMARY was generated by DiracMiner in its original Perl version [20] which carries the

Comparing and contrasting our main approach with other methods

In this paper we brought together data mining of both structured (e.g. spreadsheet analysis) and unstructured data mining (i.e. text analytic, natural language processing), and automated construction of probabilistic inference nets that in this study used probabilities derived from the structured approach. However, structured data mining was the tool most used. How might our particular variation on that method be classified? Most broadly described, the main feature used here was categorical

Primary conclusions

The combined use of tools and modes of use described in this paper appears capable of adding significant value to the analysis of socioeconomic health data. Because the significant negative correlations between scores for equity and population health, economy etc. were unexpected at least to the authors, confirmation by several techniques and measures, including long-established Pearson's correlation, was particularly important for us. So was some consideration of Simpson's paradox that can

References (95)

  • President's council of advisors on science and technology, report to the president realizing the full potential of health information technology to improve healthcare for Americans: the path forward

  • P.A.M. Dirac

    A new notation for quantum mechanics

    Math. Proc. Camb. Philos. Soc.

    (1939)
  • B. Robson

    The new physician as unwitting quantum mechanic: is adapting Dirac's inference system best practice for personalized medicine, genomics and proteomics?

    J. Proteome Res.

    (2007)
  • B. Robson
    (2009)
  • B. Robson
    (2009)
  • B. Robson

    Towards Automated Reasoning for Drug Discovery and Pharmaceutical Business Intelligence

    (2012)
  • B. Robson

    Towards new tools for pharmacoepidemiology

    Adv. Pharmacoepidemiol. Drug Saf.

    (2013)
  • S. Deckelman et al.
    (2015)
  • B. Robson et al.

    Considerations , for a universal exchange language for healthcare

  • B. Robson et al.

    A universal exchange language for healthcare MedInfo ’13

  • B. Robson et al.

    Suggestions for a web based universal exchange and inference language for medicine. Continuity of patient care with PCAST disaggregation

    Comput. Biol. Med.

    (2014)
  • B. Robson

    POPPER, a simple programming language for probabilistic semantic inference in medicine

    Comput. Biol. Med.

    (2014)
  • B. Robson et al.

    Interesting things for computer systems to do: keeping and data mining millions of patient records, guiding patients and physicians, and passing medical licensing exams, Bioinformatics and Biomedicine (BIBM)

  • B. Robson et al.

    Data-mining to build a knowledge representation store for clinical decision support. Studies on curation and validation based on machine performance in multiple choice medical licensing examinations

    Comput. Biol. Med.

    (2015)
  • B. Robson et al.

    Studies of the role of a smart web for precision medicine supported by biobanking, personalized medicine, FTG

    Pers. Med.

    (2016)
  • B. Robson et al.

    Methods and Systems of a Hyperbolic-Dirac-Net-Based Bioingine Platform and Ensemble of Applications

    (2017)
  • J. Pearl

    Probabilistic Reasoning in Intelligent Systems

    (1988)
  • E.R. Harold et al.

    XML in a Nutshell

    (2004)
  • Position statement from the workshop on RDF as a universal healthcare exchange language held at the 2013 semantic technology and business conference, san Francisco, Yosemite Manifesto on RDF as a universal healthcare exchange language

  • P.A.M. Dirac

    The Principles of QM

    (1958)
  • J. Bircher et al.

    Applying a complex adaptive system's understanding of health to primary care

    F1000 Res.

    (2016)
  • J.E. Stiglitz

    The rigged equality

    Sci. Am.

    (2018)
  • R.M. Sapolsky

    The health-wealth gap

    Sci. Am.

    (2018)
  • V. Eubanks

    Automating bias (how algorithms designed to alleviate poverty can perpetuate it instead)

    (2018)
  • J.K. Boyce

    The environmental cost of inequality

    (2018)
  • Organization for Economic Collaboration And Development

    Educational Opportunity for All

    (2017)
  • The Legatum institute

  • Emile Durkheim

    The Division of Labour in Society

    (1997)
  • J.J. Gerber et al.

    Sociology

    (2010)
  • B. Robson

    Analysis of the code relating sequence to conformation in globular proteins: theory and application of expected information

    Biochem. J.

    (1974)
  • The GOR Method

  • B. Robson

    Clinical and pharmacogenomic data mining: 3. Zeta theory as a general tactic for clinical bioinformatics

    J. Proteome Res.

    (2005)
  • K. Popper

    The Logic of Scientific Discovery

    (1934)
  • Cited by (10)

    • Towards faster response against emerging epidemics and prediction of variants of concern

      2022, Informatics in Medicine Unlocked
      Citation Excerpt :

      In the opinion of author of the present review, there are seven key technologies that will be important for early response to emerging epidemics involving new pathogens or strains. As Fig. 1 states, the Q-UEL language [21–38] has been used to help study them and is progressively incorporating them. Proceeding from the top clockwise, these are (i) the use of new generations of peptide biomarkers [32,33], (ii) analysis of patient genomics (including proteomics) regarding response to pathogens [28,29,32], (iii) improved de novo modeling of proteins such that large loops on polymorphic patient proteins, and not least of those on the pathogen proteins and their interactions with receptors and antibodies, can be simulated (including with better entropy calculations) [34], (iv) automated reasoning in public health [36,37], (v) alternative futures analysis (discussed below, using the example of different paths in development of COVID infections), (vi) high-dimensional analytics [24,27,29,30], and (vii) management of Real-World Data including interoperability [20–23,27,31].

    • Leveraging machine learning to characterize the role of socio-economic determinants on physical health and well-being among veterans

      2021, Computers in Biology and Medicine
      Citation Excerpt :

      These measures of physical health are generally used to build predictive models of health risk as a function of medical indicators, such as vitals taken by a doctor, and demographic characteristics, such as age and gender, that influence the predisposition to certain ailments [13]. While there is a general understanding that medical indicators and demographics play a role in explaining differences in physical health among individuals, including veterans, there is also an increasing recognition that social determinants are potentially even more important [7,10,14–16]. This comes at a time when new data is becoming available, as well as new methods for processing and interpreting it.

    View all citing articles on Scopus

    This paper is provided to the community to promote the more general applications of the thinking of Professor Paul A. M. Dirac in human and animal medicine in accordance with the charter of The Dirac Foundation, to emphasize the advantages and simplicity of the basic form of the Hyperbolic Dirac Net, to encourage its use, and to propose at least some of the principles of the associated Q-UEL, a universal exchange language for medicine, as a basis for a standard for interoperability. These mathematical and engineering principles are used, amongst many others in an integrated way, in the algorithms and internal architectural features of the BioIngine.com, a distributed system developed by Ingine Inc. VA for the mining of, and inference from, Very Big Data for commercial purposes.

    View full text