Elsevier

Neurocomputing

Volume 458, 11 October 2021, Pages 701-713
Neurocomputing

Heterogeneous neural metric learning for spatio-temporal modeling of infectious diseases with incomplete data

https://doi.org/10.1016/j.neucom.2019.12.145Get rights and content

Abstract

Infectious disease data, recording the numbers of infection cases in different locations and time, is one of the most typical categories of spatio-temporal data and plays an important role in the infectious disease control and prevention. However, due to the insufficient resources and manpower, the observations and records of infection cases are inevitably missing in some locations and time, which brings difficulties to the accurate risk assessment and timely disease control. Imputing the missing infectious disease data is challenging as the infectious disease diffusion can be potentially caused and affected by many risk factors. To address the above-mentioned challenges, a novel machine learning method, Heterogeneous Neural Metric Learning (HNML), is developed to restore the integrity of case reporting data using both the incomplete reported cases and the underlying disease-related risk factors from heterogeneous data sources. We empirically validate the effectiveness of our developed method on a representative infectious disease, malaria. We test the developed method under three common real-life data missing patterns with different levels of missing rates. By incorporating the disease-related risk factors as external resources through the proposed HNML method, we demonstrate significant accuracy improvement over the baseline and state-of-the-art inference methods for predicting unobserved malaria cases based on the incomplete reporting data. The results suggest that the disease-related risk factors can provide valuable information about the transmission patterns of infectious diseases and should be taken into account when implementing the surveillance.

Introduction

Data with spatial and temporal attributes are very popular in many real-world applications, such as epidemiology [1], crowd flow [2], and air quality [3]. However, such spatio-temporal data are often incomplete due to various reasons, such as high cost of collection [4], sensor failure [5], or unstable transmission [6]. The data incompleteness makes it difficult to monitor and analyze the real-world spatio-temporal dynamics.

Infectious disease data, recording the numbers of infection cases in different locations and time, represents one of the most typical categories of spatio-temporal data and plays an important role in many applications of public health, such as hotspot detection [7], infectious disease risk forecast [8], and transmission network mining [9]. However, due to the insufficient resources and manpower, the observations and records of infection cases are inevitably missing in some locations and time. Taking malaria, one of most serious infectious diseases, as an example. For eliminating malaria, the WHO has called for countries to establish nation-wide epidemiological intelligence strategies to engage in effective surveillance for malaria early detection and prevention [10], which requires lots of experienced public health workers. However, the human resources are very insufficient particularly in remote and poor regions [11], making infection case data missing in some locations and time. The consequence of incompleteness in infectious disease data could be serious as the missing data bring difficulties to the accurate risk assessment and timely disease control.

Moreover, the infectious disease diffusion dynamic is quite complex as it can be potentially caused and affected by many risk factors, making it difficult to impute the missing infectious disease data by using only the partially observed case data. For instance, the transmission risk for malaria depends on the per capita mosquito density, which is closely related to environmental factors, such as temperature and rainfall. Empirical studies have also revealed population mobility patterns (and thus the routes by which malaria is transmitted) to be related to the geographical distances between locations and the socioeconomic factors [12], [13].

To efficiently implement the surveillance strategy, officials need to ensure real-time reporting of case data [14]. The persistent problem of incomplete case reporting data and the complexity of disease-related risk factors can be formulated as follows: Given the incomplete case reporting data, how can the heterogeneous data sources for complex disease-related risk factors be used to effectively restore the integrity of case reporting data, i.e., estimate missing values, such that the number of infection cases in unobserved time periods and locations can be accurately inferred?

Often, in order to restore the integrity of the case reporting data, a spatio-temporal modeling strategy would be adopted to accommodate the disease infections across both time and geographical locations by reflecting complex dual-dimensional correlations. There are some classical machine learning methods that capture the spatial, temporal, or spatio-temporal correlations of data, such as Kriging [15] (spatial method), Gaussian process (GP) [16] (temporal method), and K-nearest-neighbor (KNN) imputation [17] (spatio-temporal method). However, these may not be sufficient for restoring the integrity of case reporting data in infectious diseases, as they only consider the target variable itself (in our scenario, this would be the number of reported cases), while ignoring disease-related risk factors, which have been shown to be closely related to the transmission and spread of infectious diseases and should be taken into consideration.

Several recent machine learning approaches have been proposed that take advantage of external information to help inference [2]. These methods are well-grounded and have performed well across various inference tasks. However, they also require the completeness in historical observations of the target variable (here again, the target variable means the number of reported cases). Again, the infectious disease surveillance data cannot satisfy this prerequisite, especially for the hard-to-reach areas. Therefore, these approaches are not directly applicable to our task.

To restore the integrity of case reporting data using both the incomplete reported cases and the underlying disease-related risk factors for inferring the number of infectious disease cases in unobserved locations, we develop a novel machine learning method dubbed Heterogeneous Neural Metric Learning (HNML). Unlike existing spatio-temporal methods for missing data estimation, which only model static spatio-temporal correlations for the target variable, our method recovers missing data using both the target variable and the underlying disease-related risk factors, thus making the estimation more reliable. Compared with other approaches that incorporate external information to help with inference, our method does not hold a strong assumption about the completeness of historical data, which makes the proposed method more useful in the practical setting under consideration.

We empirically validate the effectiveness of our developed machine learning solution on a representative infectious disease, malaria. We use the 2005–2009 malaria case reporting data collected from the malaria endemic China-Myanmar border region. To systematically evaluate the performance of our method, we test it under three data missing patterns (spatial missing, temporal missing, and spatio-temporal missing) resulted from three common surveillance strategies with different levels of missing rates (from 10% to 50%). We also compare our method with the existing inference methods (including both the classical and the state-of-the-art methods). The results demonstrate that our method makes inferences on the unobserved malaria cases with higher accuracy, indicating its effectiveness in restoring the integrity of case reporting data with missing case data.

Note that in this paper, we use the word “heterogeneous” to emphasize four unique characteristics of the spatio-temporal modeling of infectious diseases with incomplete data:

  • 1.

    Various intrinsic properties. For example, in malaria transmission modeling, the temperature and rainfall datasets describe the environmental property, while the social-economic dataset characterizes the property of human activities.1

  • 2.

    Various roles in shaping epidemiology dynamics. The temperature and rainfall play a role in vector reproduction, which triggers the epidemiological transmission within one location, while the social-economic activity determines the inter-location transmission.

  • 3.

    Various availabilities. The temperature dataset is generally easy to obtain from the satellite remote sensing, which covers a large spatial region in a high resolution. However, the social-economic dataset is recorded manually, which is difficult to collect, especially in the remote area.

  • 4.

    Various spatio-temporal resolutions. The spatial and temporal resolutions of temperature dataset are 1 km (km) and daily, respectively. While the social-economic dataset is collected in the town-level spatial resolution and annually temporal resolution.

Such distinct characteristics in the heterogeneous data sources make the imputation task quite challenging.

Our contributions in this paper could be summarized as follows.

  • 1.

    First, we develop a machine learning method to conduct data imputation in scenarios featuring a variety of missing data patterns by extracting information from underlying related factors. Unlike existing spatial, temporal, or spatio-temporal methods, our method takes advantage of additional information about underlying related factors and thus outperforms existing methods.

  • 2.

    Second, we propose to take underlying disease-related risk factors into account when imputing missing values and inferring unobserved cases of infection using heterogeneous disease-related data sources. Specifically, we incorporate environmental factors (temperature and rainfall), geographical factors (latitude and longitude), and 22 socioeconomic factors collected from a multitude of data sources into our disease transmission model.

  • 3.

    Third, we empirically evaluate our method’s performance using ground-truth based on the 2005–2009 malaria case reporting data collected from the malaria endemic China-Myanmar border region, under three common real-life data missing patterns (spatial missing, temporal missing, and spatio-temporal missing) with different levels of missing rates (10%–50%). The results show that our method provides more accurate inferences than existing methods. The results also suggest that by appropriately incorporating underlying disease-related risk factors, extra information on malaria transmission patterns can be obtained and inference accuracy can be enhanced. The results thus provide a scientific foundation for public health authorities to implement real-time surveillance for malaria elimination and our approach could also be a useful framework for other applications with similar spatio-temporal missing scenarios and heterogeneous data sources.

Section snippets

Related work

By modeling spatial correlations in the data, spatial methods, such as the Kriging [15] and inverse distance weighting [18] methods, map the propagation of information across geographical space, and use the mapping of spatial correlations to recover missing information. Using an entirely different assumption, temporal methods, such as the Gaussian process (GP) [16] and auto regressive moving average (ARMA) [19] methods, ignore the horizontal propagation of information across geographical space,

Proposed method

To restore the integrity of case reporting data using both the incomplete reported cases and the underlying disease-related risk factors to infer the missing numbers of infection cases, we develop a machine learning method called Heterogeneous Neural Metric Learning (HNML). Fig. 1 illustrates the idea behind our method. As shown in Fig. 1(a), given incomplete historical data, HNML integrates different disease-related risk factors from heterogeneous data sources, such as environmental,

Synthetic experiments

We evaluate the performance of our method on a systematically designed synthetic dataset.

An empirical study in Yunnan province

We empirically validate the effectiveness of our developed machine learning solution using the 2005–2009 malaria case reporting data collected from Yunnan, a malaria-endemic province located in the China-Myanmar border region (as shown in Fig. 3), because malaria is one of most serious infectious disease and the Yunnan province is a typical region in the phase toward malaria elimination.

Conclusion and future work

In this study, we developed a machine learning method for making inferences based on incomplete data to support decision-making in infectious disease control and prevention. To solve this challenging yet important problem, we presented a machine learning method that incorporates location-specific attributes of underlying disease-related risk factors collected from heterogeneous data sources. To evaluate the performance of our method, we conducted experiments based on a real-world dataset, i.e.,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to acknowledge the funding support from Hong Kong Research Grants Council (RGC/HKBU12201318, RGC/HKBU12201619, RGC/HKBU12202220) for the research work being presented in this article.

Qi Tan received the B.S. from the School of Computer Science and Engineering, South China University of Technology, in 2014. He is pursuing the Ph.D. degree in the Department of Computer Science, Hong Kong Baptist University. His research interests include spatio-temporal data mining and social network analysis with applications for health informatics.

References (42)

  • S.K. Greene et al.

    Daily reportable disease spatiotemporal cluster detection, New York city, New York, USA, 2014–2015

    Emerging Infectious Diseases

    (2016)
  • R. Senanayake et al.

    Predicting spatio-temporal propagation of seasonal influenza using variational gaussian process regression

  • B. Shi et al.

    Inferring plasmodium vivax transmission networks from tempo-spatial surveillance data

    PLoS Neglected Tropical Diseases

    (2014)
  • World Health Organization, Disease surveillance for malaria elimination: operational manual,...
  • B. Yang, H. Guo, Y. Yang, B. Shi, X. Zhou, J. Liu, Modeling and mining spatiotemporal patterns of infection risk from...
  • L. Fumanelli et al.

    Inferring the structure of social contacts from demographic data in the analysis of infectious diseases spread

    PLoS Computational Biology

    (2012)
  • F. Simini et al.

    A universal model for mobility and migration patterns

    Nature

    (2012)
  • J. Wang et al.

    Gaussian process dynamical models

  • L. Beretta et al.

    Nearest neighbor imputation algorithms: a critical evaluation

    BMC Medical Informatics and Decision Making

    (2016)
  • S. Banerjee et al.

    Hierarchical Modeling and Analysis for Spatial Data

    (2014)
  • R.J.M. Buendia et al.

    A disease outbreak detection system using autoregressive moving average in time series analysis

  • Cited by (0)

    Qi Tan received the B.S. from the School of Computer Science and Engineering, South China University of Technology, in 2014. He is pursuing the Ph.D. degree in the Department of Computer Science, Hong Kong Baptist University. His research interests include spatio-temporal data mining and social network analysis with applications for health informatics.

    Yang Liu received the B.S. and M.S. degrees in Automation from National University of Defense Technology in 2004 and 2007, respectively. He received the Ph.D. degree in Computing from The Hong Kong Polytechnic University in 2011. Between 2011 and 2012, he was a Postdoctoral Research Associate in the Department of Statistics at Yale University. Dr. Liu is currently an Assistant Professor in the Department of Computer Science at Hong Kong Baptist University. His research interests include artificial intelligence, machine learning, as well as their applications in high-dimensional data mining, complex network analysis, and infectious disease modeling.

    Jiming Liu received the MEng and PhD degrees from McGill University. He is currently the chair professor of computer science at Hong Kong Baptist University. His research interests include data analytics, machine learning, complex network analytics, data-driven complex systems modeling, and health informatics. He has served as the editor-in-chief of the Web Intelligence Journal (IOS), and an associate editor of the Big Data and Information Analytics (AIMS), the IEEE Transactions on Knowledge and Data Engineering, the IEEE Transactions on Cybernetics, Neuroscience and Biomedical Engineering (Bentham), and Computational Intelligence (Wiley), among others. He is a fellow of the IEEE.

    Benyun received the BSc. degree in Mathematics from Hohai University, Nanjing, China, in 2003, and the M.Phil and Ph.D. degrees in Computer Science from Hong Kong Baptist University in 2008 and 2012. He is currently the professor of School of Computer Science and Technology, Nanjing Tech University. His research interests include Multi-agent Autonomy-Oriented Computing (AOC), Real-world Complex Systems Modeling, Complex Networks, particularly for Energy Distribution Systems and Infectious Disease Epidemiology.

    Shang Xia received the PhD degree in computer science from Hong Kong Baptist University. He is currently an associate professor in the National Institute of Parasitic Diseases, Chinese CDC. His research interests include computational epidemiology and health informatics.

    Xiao-Nong Zhou obtained his PhD in Biology at University of Copenhagen, Denmark in 1994, following his MSc in Medical Parasitology from Jiangsu Institute of Parasitic Diseases in China. He is a professor and the director of the National Institute of Parasitic Diseases at the Chinese Center for Disease Control and Prevention, based in Shanghai, China. He is a leading expert in the research and control of schistosomiasis and other infectious diseases, with over 30 years’ experience in the field. Professor Zhou established a career in infectious disease research across the fields of ecology, population biology, epidemiology, and malacology.

    View full text