Toward semantic data imputation for a dengue dataset
Introduction
Two key technologies are used to support decision-making for data collection: the data warehouse [1] and data mining [2]. For public health or medical use, such technologies are widely used [3], [4] in significant cases, such as forecasting dengue outbreaks [5], [6], [7]. However, accurate forecasts are available only through efficient data preprocessing [8]. If high-quality input data are available, the forecasting results are more accurate and are much more useful to decision-makers for making well-founded, correct decisions. However, the quality of data will potentially be compromised and contaminated when importing from several data sources [9] with missing values, heterogeneous terms, illegal values, and misspellings. These problems directly affect the forecasting performance, as was the case for the dengue epidemic referenced above.
Missing values [10] occur for various reasons: the reluctance of data providers to disclose information, user data entry errors, data corruption in transmission or storage, or the inability of data collectors to access the sample population due to a flood or earthquake disaster isolating them. Where text data are provided, the problem of misspelling arises, as well as synonyms and heterogeneous terms. The lack of standardized terms and terminology and the use of localized acronyms and abbreviations result in the inputs from several sources and resources being ambiguous and misleading. The lack of validation data, such as valid age ranges, results in unreasonable and improbable or impossible values, such as a patient’s age being recorded as “950”. All of these errors influence data analysis and processing. Of the various data problems that can be experienced, the missing data problem is considered to be the most difficult one to overcome, and attempts at providing a data value are usually inaccurate to some extent.
Missing data values are the main focus of our research reported here. The term “missing data” refers to missing values in a record set, table, or database used in a particular research activity. Nine out of ten research projects have encountered missing data [10], but only 21% have been able to successfully overcome the problem. Given that providing valid substitute values for missing values is an important part of effective modeling, addressing this data loss problem is imperative.
In the dengue forecasting technique, if there is a large amount of missing data, then the accuracy of the forecast model is significantly compromised [11]. The importance of this research is that it gives rise to several techniques to overcome the missing data problem, which has also been the subject of research in other disciplines and in statistics and mathematics [12].
One approach to addressing missing data is to delete records or datasets that have missing data. These methods are known as Listwise Data Deletion [13] and Pairwise Data Deletion [14]. A more sophisticated approach is to apply statistical methods to predict missing data. Software products such as Weka use the mean or mode of the data to impute the missing values [13], [15], [16], [17]. The regression equation is used by Doner [18] to estimate lost values, whereas Andridge and Little [19] use the Hot Deck Imputation method. Some approaches to predicting the missing data can be categorized as Machine Learning: Neural Networks [20], [21], Genetic Algorithms (GA) [22], [23], k-Nearest Neighbor (-NN) method [24] and PSO [25], [26].
While almost all research has focused on missing numerical data using these techniques, they do not handle nominal data well at all. Our main hypothesis is that the ontology model can be used as a knowledge base to determine the appropriate replacement values for both types of data. It is essential to consider the semantics or relationships among the data, which statistical approaches do not do well.
The development of an ontology is a promising technique that has gained popularity in several research areas in recent years because of its capability for reasoning and for inferencing data. We applied an ontology model to address both numerical and nominal missing data in a data warehouse of dengue information. The novelty of this research is threefold. First, we overcome the numerical missing value problem by exploiting Particle Swarm Optimization and incorporating the -means technique to enhance the performance of PSO through the fitness value computation. Second, the ontology model works as a knowledge base to filter out irrelevant information for PSO, which results in PSO converging on the missing value faster, and this approach is more accurate in predicting the numerical missing values. Third, distinct from existing frameworks, our unified framework can also tackle nominal missing data using the relationships of the concepts and the reasoning mechanism of the knowledge-based model.
Section snippets
Literature review
A knowledge-base model has been used to assist the data analytic task in data preprocessing [27], [28], [29], [30], data modeling [31], [32], and data postprocessing [33]. Although ontologies are used in several data analytic processes, this article focuses on only the use of an ontology to assist in the data preprocessing phase, especially the data cleansing process. Given this focus, the following section cites some of the research that is relevant to the data preprocessing phase discussed in
Hypotheses
Machine learning methodology has been used to find the optimal solution for many problems, one of which is the problem of undertaking broad searches of extensive data, which is the purpose of PSO. We consider that PSO is potentially an advantageous technique to use to estimate missing data and can therefore provide a more accurate answer with a lower tolerance for errors.
Optimization methods are more effective in estimating missing data than conventional methods, which has been proven by
Semantic data imputation approach
Every dataset must be cleaned for internal consistency and for further use in various ways. This criterion is especially important for medical datasets. Inspired by the challenges in the missing data problem and the recent advances in ontology research, we propose a novel technique to tackle the missing data problem. Typically, a dataset contains both numerical and nominal data, also known as categorical data. The following discussion is divided into two subsections: (1) a technique for
Experimental design
Epidemiological data from 5760 records of Thai dengue incidence reported over the 6 years between 2010 and 2015 were analyzed. The data were collected monthly from several districts of three provinces in the western region of Thailand, namely, Ratchaburi, Nakhon Pathom and Samut Sakhon. The data of particular interest included the region, province and district, average rainfall, average temperature, and number of patients. Data were collected from several data sources, such as the Parasitology
PSO parameter optimization
The objective of the first experiment was to determine the optimized parameters for obtaining the lowest RMSE value to impute the missing data. Typically, there are three important parameters that are relevant to PSO [66]: the variable inertia weight () and two acceleration constants (, ). The inertia weight controls the effect exerted on the current speed by the prior iteration speed. A larger value of improves the global search capability of the PSO algorithm, and a smaller value of
Conclusions
We proposed a novel architecture for data cleansing using the PSO algorithm while incorporating the -means and ontology model to predict missing data. Conventional methods usually replace missing data with the average of all available data, which results in a high RMSE. Our framework provides multimodal information support. In PSO, the fitness value is calculated by a collective search process. The work presented in this paper is an extension of our previous work. In [71], we reported on our
CRediT authorship contribution statement
N. Kamkhad: Writing - original draft, Formal analysis, Methodology, Validation. K. Jampachaisri: Writing - original draft. P. Siriyasatien: Data curation. K. Kesorn: Writing - review & editing, Writing - original draft, Formal analysis, Conceptualization, Funding acquisition, Supervision, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research was supported by Computer Science and Information Technology Department, Science Faculty, Naresuan University of Thailand (Grant no: R2562E027) and National Research Council of Thailand and Health Systems Research Institute, Thailand (Grant No. 63-017). We also acknowledge the contribution of Mr. Roy I. Morien of the Division of Research Administration at NU and Elsevier Language Editing Services for his editing and checking of English grammar and expression in this paper. The
References (71)
- et al.
Review: A gentle introduction to imputation of missing values
J. Clin. Epidemiol.
(2006) - et al.
K-nearest neighbors with mutual information for simultaneous classification and missing data imputation
Neurocomputing
(2009) - et al.
Enrichment of association rules through exploitation of ontology properties – Healthcare case study
Procedia Comput. Sci.
(2017) - et al.
Ontology knowledge mining based association rules ranking
Procedia Comput. Sci.
(2016) - et al.
Dynamic L-RNN recovery of missing data in IoMT applications
Future Gener. Comput. Syst.
(2018) - et al.
A multi-modal incompleteness ontology model (MMIO) to enhance information fusion for image retrieval
Inf. Fusion
(2014) - et al.
Noisy data elimination using mutual k-nearest neighbor for classification mining
J. Syst. Softw.
(2012) - et al.
K-means clustering with outlier removal
Pattern Recognit. Lett.
(2017) - et al.
Particle swarm optimization (PSO). A tutorial
Chemometr. Intell. Lab. Syst.
(2015) - et al.
Present and future directions in data warehousing
SIGMIS Database
(1998)
Data mining with big data
IEEE Trans. Knowl. Data Eng.
Forecasting dengue haemorrhagic fever cases in Southern Thailand using ARIMA models
Dengue Bull.
Forecast of dengue incidence using temperature and rainfall
PLOS Negl. Trop. Dis.
Morbidity rate prediction of dengue hemorrhagic fever (DHF) using the support vector machine and the Aedes aegypti infection rate in similar climates and geographical areas
PLoS One
Educational Psychology
Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals
Clin. Trials Lond. Engl.
Computational Intelligence for Missing Data Imputation, Estimation, and Management: Knowledge Optimization Techniques
Pairwise deletion for missing data in structural equation models: Nonpositive definite matrices, parameter estimates, goodness of fit, and adjusted sample sizes
Struct. Equ. Model.
A novel framework for imputation of missing values in databases
IEEE Trans. Syst. Man Cybern. A
Missing data imputation: Focusing on single imputation
Ann. Transl. Med.
The relative effectiveness of procedures commonly used in multiple regression analysis for dealing with missing values
Am. Stat.
A review of hot deck imputation for survey non-response
Int. Stat. Rev.
Fuzzy min–max neural networks for categorical data: Application to missing data imputation
Neural Comput. Appl.
Predicting missing attribute values using cooperative particle swarm optimization
J. Basic Appl. Sci. Res.
A GS-MPSO-WKNN method for missing data imputation in wireless sensor networks monitoring manufacturing conditions
Trans. Inst. Meas. Control
Identifying disease genes using machine learning and gene functional similarities, assessed through gene ontology
PLoS One
Cited by (7)
Nearest neighbor imputation for categorical data by weighting of attributes
2022, Information SciencesVirtual sensor-based imputed graph attention network for anomaly detection of equipment with incomplete data
2022, Journal of Manufacturing SystemsCitation Excerpt :A data imputation algorithm can cope with this problem by estimating and generating missing values based on observed values [30]. Although there are few algorithms dealing with incomplete data in the field of LRE anomaly detection, there are many methods based on incomplete data filling applied in many filed such as medical domain [31–33], traffic flow [34–36], and multi-view learning [37–39]. In general, imputation methods can be categorized into statistics-based methods and deep learning-based methods.
A critical review of real-time modelling of flood forecasting in urban drainage systems
2022, Journal of HydrologyCitation Excerpt :However, the third one seems more efficient despite skewing the existing patterns recognised by original data (Aieb et al., 2019). While there are no clear guidelines for data imputation in the context of UDS’s missing data, infilling gaps have been widely used for rainfall prediction or non-urbanised flood forecasting (Aires, 2020; Kamkhad et al., 2020). Specific methods used for infilling missing data include the simple mean value of available data (Anbarasan et al. 2020), data mining techniques such as the K-Nearest Neighbours method (Motta et al. (2021) and empirical regression methods (Kamwaga et al. 2018).
Intelligent approach to automated star-schema construction using a knowledge base
2021, Expert Systems with ApplicationsCitation Excerpt :Future works should include processes to validate data prior to loading it into the DW to ensure error-free input that would otherwise adversely affect the correctness of summarized and aggregated OLAP processes. For example, automatic detection of missing data and the addition of values (Kamkhad et al., 2020) has been recommended to enhance the quality of input data prior to feeding it into a data-mining algorithm. When input data are of high quality, mining techniques can effectively be used to discover hidden knowledge.
A Novel Algorithm for Imputing the Missing Values in Incomplete Datasets
2022, Research Square