Studies in the use of data mining, prediction algorithms, and a universal exchange and inference language in the analysis of socioeconomic health data

doi:10.1016/j.compbiomed.2019.103369

Computers in Biology and Medicine

Volume 112, September 2019, 103369

https://doi.org/10.1016/j.compbiomed.2019.103369 Get rights and content

Highlights

•
Socioeconomic factors that are important determinants of health on the national and global scale.
•
Data mining and prediction algorithms can stand beside classical correlation analysis in analysis of socioeconomic data.
•
The method used was somewhat unusual, having a basis in Dirac’s quantum mechanics and a Theory of Expected Information.
•
A significant negative association found between population health and equity was surprising, at least to the authors.
•
To explore this, comparison is made between classical Pearson’s correlation and the possibility of Simpson’s paradox.

Abstract

While clinical and biomedical information in digital form has been escalating, it is socioeconomic factors that are important determinants of health on the national and global scale. We show how collective use of data mining and prediction algorithms to analyze socioeconomic population health data can stand beside classical correlation analysis in routine data analysis. The underlying theoretical basis is the Dirac notation and algebra that is a scientific standard but unusual outside of the physical sciences, combined with a theory of expected information first developed for analyzing sparse data but still largely confined to bioinformatics. The latter was important here because the records analyzed (which are for US counties and equivalents, not patients) are very few by contemporary data mining standards. The approach is very unlikely to be familiar to socioeconomic researchers, so the theory and the advantages of our inference nets over the Bayes Net are reviewed here, mostly using socioeconomic examples. While our expertise and focus is in regard to novel analytical methods rather than socioeconomics per se, a significant negative (countertrending) relationship between population health and equity was initially surprising, at least to the present authors. This encouraged deeper exploration including that of the relationship between our data mining methods and traditional Pearson's correlation. The latter is susceptible to giving wrong conclusions if a phenomenon called Simpson's paradox applies, so this is also investigated. Also discussed is that, even for very few records, associative data mining can still demand significant computational resources due to a combinatorial explosion.

Section snippets

Background

The explosive growth in medical data in digital form has primarily been in regard to clinical and biomedical data [1]. However, this has been primarily of a clinical and scientific nature. The ability of individuals to access and afford the fruits of growing medical knowledge is also a strong determinant of health. Data concerning that is essentially socioeconomic health data (SHD). Of course, researchers and organizations have studied data of this general kind for many years. Notably, in

The classical basis and notation

The underpinnings of the basic method (as well some further aspects by us and others [[44], [45], [46], [47], [48], [49], [50]] used in the present paper), do not replace “classical” data analysis and knowledge representation. However, we argue that they generalize and extend them. Our starting point is the familiar conditional probability, a purely real-valued scalar value on the interval 0 … 1, primarily historically introduced by de Moivre. It is also used building block and variable in BNs

Data

The original data [3] comprised one record for each of 500 counties or similar regions that are top scorers in population health, additionally split into 4 classes of 100 that represent urban high performing, urban up-and-coming, rural high performing, and rural up-and-coming. Along the above grouping (urban high performing, etc. plus the class Overall) the data comprises only scores for population health with 9 other types of score such as Economy, Infrastructure, and so on each of which

Preliminary studies and overview of principal statistical results

Although this paper is primarily methodological (and not medical or socioeconomic research) it is instructive to follow, with some discussion, a story for real data in a practical use-case context. A very early step in most studies is to generate a large Q-UEL tag (a collection of summer data for each specified study) called the SSMETHOD (statistical summary) tag. The following tag labeled POPHEALTH-SURVEY-SUMMARY was generated by DiracMiner in its original Perl version [20] which carries the

Comparing and contrasting our main approach with other methods

In this paper we brought together data mining of both structured (e.g. spreadsheet analysis) and unstructured data mining (i.e. text analytic, natural language processing), and automated construction of probabilistic inference nets that in this study used probabilities derived from the structured approach. However, structured data mining was the tool most used. How might our particular variation on that method be classified? Most broadly described, the main feature used here was categorical

Primary conclusions

The combined use of tools and modes of use described in this paper appears capable of adding significant value to the analysis of socioeconomic health data. Because the significant negative correlations between scores for equity and population health, economy etc. were unexpected at least to the authors, confirmation by several techniques and measures, including long-established Pearson's correlation, was particularly important for us. So was some consideration of Simpson's paradox that can

References (95)

I.M. Mullins et al.
Data mining and clinical data repositories: insights from a 667,000 patient data set
Comput. Biol. Med.
(2006)
B. Robson
Hyperbolic Dirac nets for medical decision support. Theory, methods, and comparison with Bayes nets
Comput. Biol. Med.
(2014)
B. Robson
Bidirectional general graphs for inference. Principles and implications for medicine
Comput. Biol. Med.
(2019)
B. Robson et al.
Suggestions for a web based universal exchange and inference language for medicine
Comput. Biol. Med.
(2013)
B. Robson et al.
Implementation of a web based universal exchange and inference language for medicine. Sparse data, probabilities and inference in data mining of clinical data repositories
Comput. Biol. Med.
(2015)
B. Robson
Studies in using a universal exchange and inference language for evidence based medicine. Semi-automated learning and reasoning for PICO methodology, systematic review, and environmental epidemiology
Comput. Biol. Med.
(2016)
B. Robson et al.
Studies in the extensively automatic construction of large odds-based inference networks from structured data. Examples from medical, bioinformatics, and health insurance claims data
Comput. Biol. Med.
(2018)
B. Robson et al.
The Engines of Hippocrates. From the Dawn of Medicine to Medical and Pharmaceutical Informatics
(2009)
S.B. Johnson
The Ghost Map: the Story of London's Most Terrifying Epidemic – and How it Changed Science, Cities and the Modern World, Riverhead
(2006)

President's council of advisors on science and technology, report to the president realizing the full potential of health information technology to improve healthcare for Americans: the path forward

P.A.M. Dirac

A new notation for quantum mechanics

Math. Proc. Camb. Philos. Soc.

(1939)

B. Robson

The new physician as unwitting quantum mechanic: is adapting Dirac's inference system best practice for personalized medicine, genomics and proteomics?

J. Proteome Res.

(2007)

B. Robson

(2009)

B. Robson

(2009)

B. Robson

Towards Automated Reasoning for Drug Discovery and Pharmaceutical Business Intelligence

(2012)

B. Robson

Towards new tools for pharmacoepidemiology

Adv. Pharmacoepidemiol. Drug Saf.

(2013)

S. Deckelman et al.

(2015)

B. Robson et al.

Considerations , for a universal exchange language for healthcare

B. Robson et al.

A universal exchange language for healthcare MedInfo ’13

B. Robson et al.

Suggestions for a web based universal exchange and inference language for medicine. Continuity of patient care with PCAST disaggregation

Comput. Biol. Med.

(2014)

B. Robson

POPPER, a simple programming language for probabilistic semantic inference in medicine

Comput. Biol. Med.

(2014)

B. Robson et al.

Interesting things for computer systems to do: keeping and data mining millions of patient records, guiding patients and physicians, and passing medical licensing exams, Bioinformatics and Biomedicine (BIBM)

B. Robson et al.

Data-mining to build a knowledge representation store for clinical decision support. Studies on curation and validation based on machine performance in multiple choice medical licensing examinations

Comput. Biol. Med.

(2015)

B. Robson et al.

Studies of the role of a smart web for precision medicine supported by biobanking, personalized medicine, FTG

Pers. Med.

(2016)

B. Robson et al.

Methods and Systems of a Hyperbolic-Dirac-Net-Based Bioingine Platform and Ensemble of Applications

(2017)

J. Pearl

Probabilistic Reasoning in Intelligent Systems

(1988)

E.R. Harold et al.

XML in a Nutshell

(2004)

Position statement from the workshop on RDF as a universal healthcare exchange language held at the 2013 semantic technology and business conference, san Francisco, Yosemite Manifesto on RDF as a universal healthcare exchange language

P.A.M. Dirac

The Principles of QM

(1958)

J. Bircher et al.

Applying a complex adaptive system's understanding of health to primary care

F1000 Res.

(2016)

J.E. Stiglitz

The rigged equality

Sci. Am.

(2018)

R.M. Sapolsky

The health-wealth gap

Sci. Am.

(2018)

V. Eubanks

Automating bias (how algorithms designed to alleviate poverty can perpetuate it instead)

(2018)

J.K. Boyce

The environmental cost of inequality

(2018)

Organization for Economic Collaboration And Development

Educational Opportunity for All

(2017)

The Legatum institute

Emile Durkheim

The Division of Labour in Society

(1997)

J.J. Gerber et al.

Sociology

(2010)

B. Robson

Analysis of the code relating sequence to conformation in globular proteins: theory and application of expected information

Biochem. J.

(1974)

The GOR Method

B. Robson

Clinical and pharmacogenomic data mining: 3. Zeta theory as a general tactic for clinical bioinformatics

J. Proteome Res.

(2005)

K. Popper

The Logic of Scientific Discovery

(1934)

Cited by (10)

Application of glass box AI to large numbers of medical records for rapid response to future respiratory virus pandemics. Examples considering potential future high-fatality COVID strains and a potential avian influenza pandemic in humans
2024, Informatics in Medicine Unlocked
It is crucial to consider the consequences that new strains of respiratory viruses such as COVID-19 and avian influenza could have on humans. Possible future human-to-human transmission of avian influenza is of particular concern. As discussed, not all countries took a worst-case approach to COVID-19 at the outset, with regrettable outcomes. To better prepare, it is important to have access to as much information as possible, including digital patient records, and to use that information in a timely fashion so that appropriate actions can be taken early. A glass-box AI approach, complementary to current mainly black-box AI, can effectively manage uncertainty, missing data, and feature interactions in a probabilistic fashion. This approach can obtain standard epidemiological measures, discover unexpected demographic and clinical interactions in past data, and then apply them to small amounts of future data. As this concerns future response, this is primarily a review and position paper. It is emphasized that our results at both the quantitative and qualitative levels are based on models for future pandemics of unknown nature and possibly great severity and are not intended to be realistic. We may sometimes overemphasize severity, but that is a worst-case strategy. We do not consider all epidemiological modeling methods. Rather, this paper concerns how some simple, less variant measures from the first COVID-19 wave and more general qualitative information might be used in combination with analysis of rapidly updated patient records in the first few days of the first wave of a future pandemic.
An ontology for very large numbers of longitudinal health records to facilitate data mining and machine learning
2023, Informatics in Medicine Unlocked
Despite the extensive experience of the authors working in industry with a variety of electronic health records that worked well in their intended context, none currently available in reasonably large numbers seem to have ontologies and formats that will scale well to very large numbers of detailed cradle-to-grave longitudinal health records facilitating knowledge extraction. By that we mean data mining, Deep Learning neural nets and all related analytic and predictive methods for biomedical research and clinical decision support potentially applied to the health records of an entire nation. They are mostly far too complicated to support frequent high-dimensional analysis, which is required because such records will update (or should update) dynamically on a regular basis, will in future include new tests etc. acquired daily by translational medical research, and not least allow public health, research, and diagnostic, vaccine, and drug development teams to respond quickly to emergent epidemics like COVID-19. A Presidential Advisory team call in 2010 for interoperability and ease of data mining for medical records is discussed and the situation seems still not fully resolved. The solution appears to lie between efficient comma separated value files and the ability to embellish these with a moderate degree of more elaborate ontology. One recommendation is made here with discussion and analysis that should guide alternative and future approaches. It combines demographic, comorbidity, genomic, diagnostic, interventional, and outcomes information along with time/date stamping method appropriate to analysis, with facilities for special research studies. By using a “metadata operator”, a suitable balance between a comma separated values file and an ontological structure is possible.
Towards faster response against emerging epidemics and prediction of variants of concern
2022, Informatics in Medicine Unlocked
Citation Excerpt :
In the opinion of author of the present review, there are seven key technologies that will be important for early response to emerging epidemics involving new pathogens or strains. As Fig. 1 states, the Q-UEL language [21–38] has been used to help study them and is progressively incorporating them. Proceeding from the top clockwise, these are (i) the use of new generations of peptide biomarkers [32,33], (ii) analysis of patient genomics (including proteomics) regarding response to pathogens [28,29,32], (iii) improved de novo modeling of proteins such that large loops on polymorphic patient proteins, and not least of those on the pathogen proteins and their interactions with receptors and antibodies, can be simulated (including with better entropy calculations) [34], (iv) automated reasoning in public health [36,37], (v) alternative futures analysis (discussed below, using the example of different paths in development of COVID infections), (vi) high-dimensional analytics [24,27,29,30], and (vii) management of Real-World Data including interoperability [20–23,27,31].
The author, the journal, Computers in Biology and Medicine (CBM), and Elsevier Press more generally, played a helpful very early role in responding to COVID-19. Within a few days of the appearance of the “Wuhan Seafood isolate” genome on GenBank, a bioinformatics study was posted by the present author in ResearchGate in January 2020, “Preliminary Bioinformatics Studies on the Design of Synthetic Vaccines and Preventative Peptidomimetic Antagonists against the Wuhan Seafood Market Coronavirus. Possible Importance of the KRSFIEDLLFNKV Motif” DOI: 10.13140/RG.2.2.18275.09761. On February 2nd^, 2020, a more thorough analysis was submitted to CBM, e-published on February 26, and formally published in April 2020, at about the same time as the virus named as 2019n-CoV was identified as essentially SARS and renames SARS-COV-2. This was followed by four further papers describing in more detail some previously unreported aspects of the early investigation. The speed of research and writing of the papers was made possible by knowledge-gathering tools. Based on this and earlier experiences with fast responses to emerging epidemics such as HIV and Mad Cow Disease, it is possible to envisage the nature of a speedier response to emerging epidemics and new variants of concern in established epidemics.
Testing machine learning techniques for general application by using protein secondary structure prediction. A brief survey with studies of pitfalls and benefits using a simple progressive learning approach
2021, Computers in Biology and Medicine
Many researchers have recently used the prediction of protein secondary structure (local conformational states of amino acid residues) to test advances in predictive and machine learning technology such as Neural Net Deep Learning. Protein secondary structure prediction continues to be a helpful tool in research in biomedicine and the life sciences, but it is also extremely enticing for testing predictive methods such as neural nets that are intended for different or more general purposes. A complication is highlighted here for researchers testing their methods for other applications. Modern protein databases inevitably contain important clues to the answer, so-called “strong buried clues”, though often obscurely; they are hard to avoid. This is because most proteins or parts of proteins in a modern protein data base are related to others by biological evolution. For researchers developing machine learning and predictive methods, this can overstate and so confuse understanding of the true quality of a predictive method. However, for researchers using the algorithms as tools, understanding strong buried clues is of great value, because they need to make maximum use of all information available. A simple method related to the GOR methods but with some features of neural nets in the sense of progressive learning of large numbers of weights, is used to explore this. It can acquire tens of millions and hence gigabytes of weights, but they are learned stably by exhaustive sampling. The significance of the findings is discussed in the light of promising recent results from AlphaFold using Google's DeepMind.
A scoping review on the use of machine learning in research on social determinants of health: Trends and research prospects
2021, SSM - Population Health
Machine learning (ML) has spread rapidly from computer science to several disciplines. Given the predictive capacity of ML, it offers new opportunities for health, behavioral, and social scientists. However, it remains unclear how and to what extent ML is being used in studies of social determinants of health (SDH).
Using four search engines, we conducted a scoping review of studies that used ML to study SDH (published before May 1, 2020). Two independent reviewers analyzed the relevant studies. For each study, we identified the research questions, Results, data, and algorithms. We synthesized our findings in a narrative report.
Of the initial 8097 hits, we identified 82 relevant studies. The number of publications has risen during the past decade. More than half of the studies (n = 46) used US data. About 80% (n = 66) utilized surveys, and 70% (n = 57) employed ML for common prediction tasks. Although the number of studies in ML and SDH is growing rapidly, only a few studies used ML to improve causal inference, curate data, or identify social bias in predictions (i.e., algorithmic fairness).
While ML equips researchers with new ways to measure health outcomes and their determinants from non-conventional sources such as text, audio, and image data, most studies still rely on traditional surveys. Although there are no guarantees that ML will lead to better social epidemiological research, the potential for innovation in SDH research is evident as a result of harnessing the predictive power of ML for causality, data curation, or algorithmic fairness.
Leveraging machine learning to characterize the role of socio-economic determinants on physical health and well-being among veterans
2021, Computers in Biology and Medicine
Citation Excerpt :
These measures of physical health are generally used to build predictive models of health risk as a function of medical indicators, such as vitals taken by a doctor, and demographic characteristics, such as age and gender, that influence the predisposition to certain ailments [13]. While there is a general understanding that medical indicators and demographics play a role in explaining differences in physical health among individuals, including veterans, there is also an increasing recognition that social determinants are potentially even more important [7,10,14–16]. This comes at a time when new data is becoming available, as well as new methods for processing and interpreting it.
We investigate the contribution of demographic, socio-economic, and geographic characteristics as determinants of physical health and well-being to guide public health policies and preventative behavior interventions (e.g., countering coronavirus).
We use machine learning to build predictive models of overall well-being and physical health among veterans as a function of these three sets of characteristics. We link Gallup's U.S. Daily Poll between 2014 and 2017 over a range of demographic and socio-economic characteristics with zipcode characteristics from the Census Bureau to build predictive models of overall and physical well-being.
Although the predictive models of overall well-being have weak performance, our classification of low levels of physical well-being performed better. Gradient boosting delivered the best results (80.2% precision, 82.4% recall, and 80.4% AUROC) with perceptions of purpose in the workplace and financial anxiety as the most predictive features. Our results suggest that additional measures of socio-economic characteristics are required to better predict physical well-being, particularly among vulnerable groups, like veterans.
Socio-economic characteristics explain large differences in physical and overall well-being. Effective predictive models that incorporate socio-economic data will provide opportunities to create real-time and personalized feedback to help individuals improve their quality of life.

View all citing articles on Scopus

^☆: This paper is provided to the community to promote the more general applications of the thinking of Professor Paul A. M. Dirac in human and animal medicine in accordance with the charter of The Dirac Foundation, to emphasize the advantages and simplicity of the basic form of the Hyperbolic Dirac Net, to encourage its use, and to propose at least some of the principles of the associated Q-UEL, a universal exchange language for medicine, as a basis for a standard for interoperability. These mathematical and engineering principles are used, amongst many others in an integrated way, in the algorithms and internal architectural features of the BioIngine.com, a distributed system developed by Ingine Inc. VA for the mining of, and inference from, Very Big Data for commercial purposes.

View full text

Studies in the use of data mining, prediction algorithms, and a universal exchange and inference language in the analysis of socioeconomic health data☆

Highlights

Abstract

Section snippets

Background

The classical basis and notation

Data

Preliminary studies and overview of principal statistical results

Comparing and contrasting our main approach with other methods

Primary conclusions

Comput. Biol. Med.

Comput. Biol. Med.

Comput. Biol. Med.

Comput. Biol. Med.

Comput. Biol. Med.

Comput. Biol. Med.

Comput. Biol. Med.

The Engines of Hippocrates. From the Dawn of Medicine to Medical and Pharmaceutical Informatics

The Ghost Map: the Story of London's Most Terrifying Epidemic – and How it Changed Science, Cities and the Modern World, Riverhead

President's council of advisors on science and technology, report to the president realizing the full potential of health information technology to improve healthcare for Americans: the path forward

A new notation for quantum mechanics

Math. Proc. Camb. Philos. Soc.

The new physician as unwitting quantum mechanic: is adapting Dirac's inference system best practice for personalized medicine, genomics and proteomics?

J. Proteome Res.

Towards Automated Reasoning for Drug Discovery and Pharmaceutical Business Intelligence

Towards new tools for pharmacoepidemiology

Adv. Pharmacoepidemiol. Drug Saf.

Considerations , for a universal exchange language for healthcare

A universal exchange language for healthcare MedInfo ’13

Suggestions for a web based universal exchange and inference language for medicine. Continuity of patient care with PCAST disaggregation

Comput. Biol. Med.

POPPER, a simple programming language for probabilistic semantic inference in medicine

Comput. Biol. Med.

Interesting things for computer systems to do: keeping and data mining millions of patient records, guiding patients and physicians, and passing medical licensing exams, Bioinformatics and Biomedicine (BIBM)

Data-mining to build a knowledge representation store for clinical decision support. Studies on curation and validation based on machine performance in multiple choice medical licensing examinations

Comput. Biol. Med.

Studies of the role of a smart web for precision medicine supported by biobanking, personalized medicine, FTG

Pers. Med.

Methods and Systems of a Hyperbolic-Dirac-Net-Based Bioingine Platform and Ensemble of Applications

Probabilistic Reasoning in Intelligent Systems

XML in a Nutshell

Position statement from the workshop on RDF as a universal healthcare exchange language held at the 2013 semantic technology and business conference, san Francisco, Yosemite Manifesto on RDF as a universal healthcare exchange language

The Principles of QM

Applying a complex adaptive system's understanding of health to primary care

F1000 Res.

The rigged equality

Sci. Am.

The health-wealth gap

Sci. Am.

Automating bias (how algorithms designed to alleviate poverty can perpetuate it instead)

The environmental cost of inequality

Educational Opportunity for All

The Legatum institute

The Division of Labour in Society

Sociology

Analysis of the code relating sequence to conformation in globular proteins: theory and application of expected information

Biochem. J.

The GOR Method

Clinical and pharmacogenomic data mining: 3. Zeta theory as a general tactic for clinical bioinformatics

J. Proteome Res.

The Logic of Scientific Discovery