Toward semantic data imputation for a dengue dataset

https://doi.org/10.1016/j.knosys.2020.105803Get rights and content

Abstract

Missing data are a major problem that affects data analysis techniques for forecasting. Traditional methods suffer from poor performance in predicting missing values using simple techniques, e.g., mean and mode. In this paper, we present and discuss a novel method of imputing missing values semantically with the use of an ontology model. We make three new contributions to the field: first, an improvement in the efficiency of predicting missing data utilizing Particle Swarm Optimization (PSO), which is applied to the numerical data cleansing problem, with the performance of PSO being enhanced using K-means to help determine the fitness value. Second, the incorporation of an ontology with PSO for the purpose of narrowing the search space, to make PSO provide greater accuracy in predicting numerical missing values while quickly converging on the answer. Third, the facilitation of a framework to substitute nominal data that are lost from the dataset using the relationships of concepts and a reasoning mechanism concerning the knowledge-based model. The experimental results indicated that the proposed method could estimate missing data more efficiently and with less chance of error than conventional methods, as measured by the root mean square error.

Introduction

Two key technologies are used to support decision-making for data collection: the data warehouse [1] and data mining [2]. For public health or medical use, such technologies are widely used [3], [4] in significant cases, such as forecasting dengue outbreaks [5], [6], [7]. However, accurate forecasts are available only through efficient data preprocessing [8]. If high-quality input data are available, the forecasting results are more accurate and are much more useful to decision-makers for making well-founded, correct decisions. However, the quality of data will potentially be compromised and contaminated when importing from several data sources [9] with missing values, heterogeneous terms, illegal values, and misspellings. These problems directly affect the forecasting performance, as was the case for the dengue epidemic referenced above.

Missing values [10] occur for various reasons: the reluctance of data providers to disclose information, user data entry errors, data corruption in transmission or storage, or the inability of data collectors to access the sample population due to a flood or earthquake disaster isolating them. Where text data are provided, the problem of misspelling arises, as well as synonyms and heterogeneous terms. The lack of standardized terms and terminology and the use of localized acronyms and abbreviations result in the inputs from several sources and resources being ambiguous and misleading. The lack of validation data, such as valid age ranges, results in unreasonable and improbable or impossible values, such as a patient’s age being recorded as “950”. All of these errors influence data analysis and processing. Of the various data problems that can be experienced, the missing data problem is considered to be the most difficult one to overcome, and attempts at providing a data value are usually inaccurate to some extent.

Missing data values are the main focus of our research reported here. The term “missing data” refers to missing values in a record set, table, or database used in a particular research activity. Nine out of ten research projects have encountered missing data [10], but only 21% have been able to successfully overcome the problem. Given that providing valid substitute values for missing values is an important part of effective modeling, addressing this data loss problem is imperative.

In the dengue forecasting technique, if there is a large amount of missing data, then the accuracy of the forecast model is significantly compromised [11]. The importance of this research is that it gives rise to several techniques to overcome the missing data problem, which has also been the subject of research in other disciplines and in statistics and mathematics [12].

One approach to addressing missing data is to delete records or datasets that have missing data. These methods are known as Listwise Data Deletion [13] and Pairwise Data Deletion [14]. A more sophisticated approach is to apply statistical methods to predict missing data. Software products such as Weka use the mean or mode of the data to impute the missing values [13], [15], [16], [17]. The regression equation is used by Doner [18] to estimate lost values, whereas Andridge and Little [19] use the Hot Deck Imputation method. Some approaches to predicting the missing data can be categorized as Machine Learning: Neural Networks [20], [21], Genetic Algorithms (GA) [22], [23], k-Nearest Neighbor (k-NN) method [24] and PSO [25], [26].

While almost all research has focused on missing numerical data using these techniques, they do not handle nominal data well at all. Our main hypothesis is that the ontology model can be used as a knowledge base to determine the appropriate replacement values for both types of data. It is essential to consider the semantics or relationships among the data, which statistical approaches do not do well.

The development of an ontology is a promising technique that has gained popularity in several research areas in recent years because of its capability for reasoning and for inferencing data. We applied an ontology model to address both numerical and nominal missing data in a data warehouse of dengue information. The novelty of this research is threefold. First, we overcome the numerical missing value problem by exploiting Particle Swarm Optimization and incorporating the K-means technique to enhance the performance of PSO through the fitness value computation. Second, the ontology model works as a knowledge base to filter out irrelevant information for PSO, which results in PSO converging on the missing value faster, and this approach is more accurate in predicting the numerical missing values. Third, distinct from existing frameworks, our unified framework can also tackle nominal missing data using the relationships of the concepts and the reasoning mechanism of the knowledge-based model.

Section snippets

Literature review

A knowledge-base model has been used to assist the data analytic task in data preprocessing [27], [28], [29], [30], data modeling [31], [32], and data postprocessing [33]. Although ontologies are used in several data analytic processes, this article focuses on only the use of an ontology to assist in the data preprocessing phase, especially the data cleansing process. Given this focus, the following section cites some of the research that is relevant to the data preprocessing phase discussed in

Hypotheses

Machine learning methodology has been used to find the optimal solution for many problems, one of which is the problem of undertaking broad searches of extensive data, which is the purpose of PSO. We consider that PSO is potentially an advantageous technique to use to estimate missing data and can therefore provide a more accurate answer with a lower tolerance for errors.

Optimization methods are more effective in estimating missing data than conventional methods, which has been proven by

Semantic data imputation approach

Every dataset must be cleaned for internal consistency and for further use in various ways. This criterion is especially important for medical datasets. Inspired by the challenges in the missing data problem and the recent advances in ontology research, we propose a novel technique to tackle the missing data problem. Typically, a dataset contains both numerical and nominal data, also known as categorical data. The following discussion is divided into two subsections: (1) a technique for

Experimental design

Epidemiological data from 5760 records of Thai dengue incidence reported over the 6 years between 2010 and 2015 were analyzed. The data were collected monthly from several districts of three provinces in the western region of Thailand, namely, Ratchaburi, Nakhon Pathom and Samut Sakhon. The data of particular interest included the region, province and district, average rainfall, average temperature, and number of patients. Data were collected from several data sources, such as the Parasitology

PSO parameter optimization

The objective of the first experiment was to determine the optimized parameters for obtaining the lowest RMSE value to impute the missing data. Typically, there are three important parameters that are relevant to PSO [66]: the variable inertia weight (w) and two acceleration constants (c1, c2). The inertia weight controls the effect exerted on the current speed by the prior iteration speed. A larger value of w improves the global search capability of the PSO algorithm, and a smaller value of w

Conclusions

We proposed a novel architecture for data cleansing using the PSO algorithm while incorporating the K-means and ontology model to predict missing data. Conventional methods usually replace missing data with the average of all available data, which results in a high RMSE. Our framework provides multimodal information support. In PSO, the fitness value is calculated by a collective search process. The work presented in this paper is an extension of our previous work. In [71], we reported on our

CRediT authorship contribution statement

N. Kamkhad: Writing - original draft, Formal analysis, Methodology, Validation. K. Jampachaisri: Writing - original draft. P. Siriyasatien: Data curation. K. Kesorn: Writing - review & editing, Writing - original draft, Formal analysis, Conceptualization, Funding acquisition, Supervision, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported by Computer Science and Information Technology Department, Science Faculty, Naresuan University of Thailand (Grant no: R2562E027) and National Research Council of Thailand and Health Systems Research Institute, Thailand (Grant No. 63-017). We also acknowledge the contribution of Mr. Roy I. Morien of the Division of Research Administration at NU and Elsevier Language Editing Services for his editing and checking of English grammar and expression in this paper. The

References (71)

  • WuX. et al.

    Data mining with big data

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • J.C. Prather, D.F. Lobach, L.K. Goodwin, J.W. Hales, M.L. Hage, W.E. Hammond, Medical data mining: Knowledge discovery...
  • M. Shouman, T. Turner, R. Stocker, Using data mining techniques in heart disease diagnosis and treatment, in:...
  • PromprouS. et al.

    Forecasting dengue haemorrhagic fever cases in Southern Thailand using ARIMA models

    Dengue Bull.

    (2006)
  • HiiY.L. et al.

    Forecast of dengue incidence using temperature and rainfall

    PLOS Negl. Trop. Dis.

    (2012)
  • KesornK. et al.

    Morbidity rate prediction of dengue hemorrhagic fever (DHF) using the support vector machine and the Aedes aegypti infection rate in similar climates and geographical areas

    PLoS One

    (2015)
  • KlausmeierH.J.

    Educational Psychology

    (1985)
  • D. Cherix, R. Usbeck, A. Both, J. Lehmann, The case of CROCUS: Cluster-based ontology data cleansing, in: Proceedings...
  • WoodA.M. et al.

    Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals

    Clin. Trials Lond. Engl.

    (2004)
  • D. Dou, H. Wang, H. Liu, Semantic data mining: A survey of ontology-based approaches, in: Proceedings of the 2015 IEEE...
  • TshilidziM.

    Computational Intelligence for Missing Data Imputation, Estimation, and Management: Knowledge Optimization Techniques

    (2009)
  • MarshH.W.

    Pairwise deletion for missing data in structural equation models: Nonpositive definite matrices, parameter estimates, goodness of fit, and adjusted sample sizes

    Struct. Equ. Model.

    (1998)
  • X. Feng, S. Wu, Y. Liu, Imputing missing values for mixed numeric and categorical attributes based on incomplete data...
  • FarhangfarA. et al.

    A novel framework for imputation of missing values in databases

    IEEE Trans. Syst. Man Cybern. A

    (2007)
  • ZhangZ.

    Missing data imputation: Focusing on single imputation

    Ann. Transl. Med.

    (2016)
  • DonnerA.

    The relative effectiveness of procedures commonly used in multiple regression analysis for dealing with missing values

    Am. Stat.

    (1982)
  • AndridgeR.R. et al.

    A review of hot deck imputation for survey non-response

    Int. Stat. Rev.

    (2010)
  • Rey-del-CastilloP. et al.

    Fuzzy min–max neural networks for categorical data: Application to missing data imputation

    Neural Comput. Appl.

    (2011)
  • N.A. Setiawan, P.A. Venkatachalam, A.F.M. Hani, Missing data estimation on heart disease using artificial neural...
  • C.T. Tran, M. Zhang, P. Andreae, Multiple imputation for missing data using genetic programming, in: Proceedings of the...
  • C. Leke, B. Twala, T. Marwala, Modeling of missing data prediction: Computational intelligence and optimization...
  • BeiramiM.H.N. et al.

    Predicting missing attribute values using cooperative particle swarm optimization

    J. Basic Appl. Sci. Res.

    (2013)
  • NiJ. et al.

    A GS-MPSO-WKNN method for missing data imputation in wireless sensor networks monitoring manufacturing conditions

    Trans. Inst. Meas. Control

    (2014)
  • AsifM. et al.

    Identifying disease genes using machine learning and gene functional similarities, assessed through gene ontology

    PLoS One

    (2018)
  • S. Asadifar, M. Kahani, Semantic association rule mining: A new approach for stock market prediction, in: Proceedings...
  • Cited by (7)

    • Virtual sensor-based imputed graph attention network for anomaly detection of equipment with incomplete data

      2022, Journal of Manufacturing Systems
      Citation Excerpt :

      A data imputation algorithm can cope with this problem by estimating and generating missing values based on observed values [30]. Although there are few algorithms dealing with incomplete data in the field of LRE anomaly detection, there are many methods based on incomplete data filling applied in many filed such as medical domain [31–33], traffic flow [34–36], and multi-view learning [37–39]. In general, imputation methods can be categorized into statistics-based methods and deep learning-based methods.

    • A critical review of real-time modelling of flood forecasting in urban drainage systems

      2022, Journal of Hydrology
      Citation Excerpt :

      However, the third one seems more efficient despite skewing the existing patterns recognised by original data (Aieb et al., 2019). While there are no clear guidelines for data imputation in the context of UDS’s missing data, infilling gaps have been widely used for rainfall prediction or non-urbanised flood forecasting (Aires, 2020; Kamkhad et al., 2020). Specific methods used for infilling missing data include the simple mean value of available data (Anbarasan et al. 2020), data mining techniques such as the K-Nearest Neighbours method (Motta et al. (2021) and empirical regression methods (Kamwaga et al. 2018).

    • Intelligent approach to automated star-schema construction using a knowledge base

      2021, Expert Systems with Applications
      Citation Excerpt :

      Future works should include processes to validate data prior to loading it into the DW to ensure error-free input that would otherwise adversely affect the correctness of summarized and aggregated OLAP processes. For example, automatic detection of missing data and the addition of values (Kamkhad et al., 2020) has been recommended to enhance the quality of input data prior to feeding it into a data-mining algorithm. When input data are of high quality, mining techniques can effectively be used to discover hidden knowledge.

    View all citing articles on Scopus
    View full text