Elsevier

Neurocomputing

Volume 218, 19 December 2016, Pages 17-25
Neurocomputing

Brief papers
Probabilistic neural network based categorical data imputation

https://doi.org/10.1016/j.neucom.2016.08.044Get rights and content

Abstract

Real world datasets contain both numerical and categorical attributes. Very often missing values are present in both numerical and categorical attributes. The missing data has to be imputed as the inferences made from complete data are often more accurate and reliable than those made from incomplete data [15]. Also, most of the data mining algorithms cannot work with incomplete datasets. The paper proposes a novel soft computing architecture for categorical data imputation. The proposed imputation technique employs Probabilistic Neural Network (PNN) preceded by mode for imputing the missing categorical data. The effectiveness of the proposed imputation technique is tested on 4 benchmark datasets under the 10 fold-cross validation framework. In all datasets, except Mushroom, which are complete, some values, which are randomly removed, are treated as missing values. The performance of the proposed imputation technique is compared with that of 3 statistical and 3 machine learning methods for data imputation. The comparison of the mode+PNN imputation technique with mode, K-Nearest Neighbor (K-NN), Hot Deck (HD), Naive Bayes, Random Forest (RF) and J48 (Decision Tree) imputation techniques demonstrates that the proposed method is efficient, especially when the percentage of missing values is high, for records having more than one missing value and for records having a large number of categories for each categorical variable.

Introduction

Missing data is observed in almost all real world datasets. Missing data forms a serious hurdle for many statistical analyses and data mining techniques. For obtaining accurate inferences from the data, the data should be complete. Imputation is the substitution of a missing data point or missing component of a data point with the most plausible value. In case the dataset contains missing values, the missing values should be imputed before performing any further analysis on the data. Missing data has two major negative effects [1]: (i) it has a negative impact on statistical power; (ii) it may result in biased estimates i.e., the measures of central tendency and dispersion may be biased.

Also, analyses on complete data yield inferences that are more precise and dependable than those from incomplete data. Statistical and computational intelligence techniques for data mining tasks such as classification, regression, association rule mining and outlier analysis require accurate and complete data. Data imputation is of great use in such applications if the data contains missing values.

Almost all fields are replete with datasets having missing values. For instance, in surveys, data may be missing due to variety of reasons such as errors in data entry, disclosure restrictions, failure to complete the entire questionnaire, absence of the respondent at the time of survey and when the response does not apply for an individual (e.g., questions regarding the years of marriage for a respondent who has never been married) [2]. In the geosciences, data items in the observational data sets may be missing altogether, or they may be imprecise in one way or another [3]. Datasets for effort prediction in software project management contain missing values [4]. Geophysical time series datasets also contain missing data [5]. Reasons such as equipment malfunctioning, outliers and incorrect data entry contribute to missing values in many practical observations [6]. Due to faults in the data acquisition process, data tend to be missing in environmental research data sets. In automatic speech recognition speech samples that are corrupted by very high levels of noise are considered to be missing data [7]. Datasets for business and financial applications may also contain missing data. Missing data problems are common in health research (e.g. retrospective and prospective studies). Longitudinal studies which collect data on a set of subjects repeatedly over time are afflicted by attrition;subjects drop out because they move or suffer side effects from drugs, or for other often unknown reasons. In biological research with DNA microarrays, gene data may be missing due to reasons such as a scratch on the slide containing the gene sample and contaminated samples [8].

The standard categorization of missing data mechanism [9], considers data: Missing Completely At Random (MCAR), Missing At Random (MAR) and Not Missing At Random (NMAR).

  • 1.

    Missing Completely At Random (MCAR). It occurs when the probability of an instance having missing value on some variable X is independent of the variable itself and on the values of any other variables in the dataset. Typical examples of MCAR are when the gender or phone number of a customer is missing in customer's database, when a tube containing a blood sample of a study subject is broken by accident or when a questionnaire of a study subject is accidentally lost. Possible reasons for MCAR include manual data entry procedure, incorrect measurements, equipment error, changes in experimental design etc.

  • 2.

    Missing At Random (MAR). It occurs when the probability of an instance having missing value on some variable X depends on other variables in the database but not the variable itself. For example, if the income level of a customer is missing it can be estimated from other variables like customer's profession, age and qualification.

  • 3.

    Not Missing At Random (NMAR). It occurs when the probability of an instance having missing value on some variable X depends on the variable itself. For instance, if citizens did not participate in a survey, then NMAR occurs.

In real life datasets, numerical and categorical attributes contain missing values. A lot of literature is available for numerical data imputation. To impute the incomplete or missing numerical data, several techniques based on statistical analysis are reported [10]. These methods include mean substitution methods, Hot Deck imputation, regression methods, expectation maximization, and multiple imputation methods. Other machine learning based methods include SOM [11], K-Nearest Neighbor [12], multi layer perceptron [13], fuzzy-neural network [14] and auto-associative neural network imputation with genetic algorithms [15].

Even though numerous studies are reported regarding the imputation of the numerical or continuous data, there is not much research devoted to categorical data imputation with machine learning despite the fact that many real life datasets contain categorical attributes. Categorical data are common in many fields like education (e.g., student responses to an exam question with the categories correct and incorrect), marketing (e.g., consumer preference among the leading brands of a product), banking (e.g., type of loan with categories house loan, vehicle loan, educational loan etc.), social and biomedical sciences, behavioral sciences (e.g., type of mental illness, viz., schizophrenia, depression, neurosis), epidemiology and public health, genetics (type of allele inherited by an offspring) and zoology (e.g., alligators primary food preference with categories as fish, invertebrate, reptile) [16]. The existing methods for categorical data imputation employ techniques that are originally designed for numerical variables. These techniques include listwise deletion, mode imputation, model-based procedures such as factored likelihoods [9] and Bayesian methods [17].

This paper proposes a novel soft computing based imputation technique based on PNN for categorical data. PNN is employed in this study because of its ability to identify complex non-linear relationships between a set of input and output variables. Also, PNN can train fast on sparse data sets and it is a universal approximator for smooth classification problems [18]. Finally, it should be noted the paper proposes a soft computing technique only for categorical data imputation. It does not evaluate the impact of imputation on classification accuracy of a classifier.

The remainder of the paper is organized as follows: A brief review of literature on imputation of missing categorical data is presented in Section 2. Probabilistic Neural Network (PNN) is described briefly in Section 3. The proposed method and experimental setup are described in Section 4. Results and discussions are presented in Section 5, followed by conclusions in Section 6.

Section snippets

Review of categorical data imputation techniques

The methods for handling missing categorical data can be broadly classified three categories [9]: (1) deletion, (2) Model the distribution of missing data and then estimating them based on certain parameters and (3) Imputation. Each of these is discussed below.

Overview of the Probabilistic Neural Network (PNN)

The Probabilistic Neural Network (PNN) was originally proposed by Specht [41]. PNN is a feed-forward neural network involving a one pass training algorithm used for classification and mapping of data. PNN is an implementation of statistical algorithm called kernel discriminant analysis in which the operations are organized into multilayer feed forward network with four layers: input layer, pattern layer, summation layer and output layer. It is a pattern classification network based on the

Proposed methodology

The architecture of the proposed method, preprocessing steps for dealing with categorical variables, the datasets employed in the experiment and the experimental design are described below.

Results and discussions

We developed the code for PNN in Java in Windows environment on a laptop with 2 GB RAM. Weka tool is used for imputation with RF, DT and Naïve Bayes. We measured the performance of the proposed method by percentage of values that are predicted correctly for the missing values.Percentage of Correct Predictions(PCP)=100*Number of Correct Predictions(NCP)Total number of predictions

PNN is governed by a parameter called smoothing factor (SF). For a dataset with missing values in ‘p' attributes, p

Conclusions

We proposed a novel, soft computing technique for categorical data imputation. The proposed technique employs PNN preceded by mode for imputation. The proposed imputation method is tested on four benchmark datasets in the framework of 10 fold cross validation. The performance of the proposed imputation technique is compared with that of RF, DT, Naïve bayes, HD, K-NN and mode imputation. The results indicate that the proposed imputation technique yields better Percentage of Correct Predictions

Mr. Kancherla Jonah Nishanth is currently a Senior Technology Manager at Andhra Bank, Hyderabad. He received M. Tech degree in Information Technology from University Of Hyderabad (UoH), Hyderabad, India in 2012. His research interests include data mining, data analytics, data imputation and machine learning. He has two publications to his credit in reputed International Journals.

References (41)

  • Z. Geng et al.

    Bayesian method for learning graphical models with incompletely categorical data

    Comput. Stat. Data Anal.

    (2003)
  • F. Lobato et al.

    Multi-objective genetic algorithm for missing data imputation

    Pattern Recognit. Lett.

    (2015)
  • D.F. Specht

    Probabilistic neural networks

    Neural Netw.

    (1990)
  • P.L. Roth et al.

    Missing data in multiple item scales: a Monte Carlo analysis of missing data techniques

    Organ. Res. Methods

    (1999)
  • K.J. Nishanth et al.

    A computational intelligence based online data imputation method: an application for banking

    J. Inf. Process. Syst.

    (2013)
  • D.H. Schoellhamer

    Singular spectrum analysis for time series with missing data

    Geophys. Res. Lett.

    (2001)
  • M. Cooke, P. Green, M. Crawford, Handling missing data in speech recognition, in: Third International Conference on...
  • O. Troyanskaya et al.

    Missing value estimation methods for DNA microarrays

    Bioinformatics

    (2001)
  • R.J.A. Little et al.

    Statistical Analysis with Missing Data

    (2002)
  • P.J. García-Laencina et al.

    Pattern classification with missing data: a review

    Neural Comput. Appl.

    (2010)
  • Cited by (0)

    Mr. Kancherla Jonah Nishanth is currently a Senior Technology Manager at Andhra Bank, Hyderabad. He received M. Tech degree in Information Technology from University Of Hyderabad (UoH), Hyderabad, India in 2012. His research interests include data mining, data analytics, data imputation and machine learning. He has two publications to his credit in reputed International Journals.

    Prof. Vadlamani Ravi is a Professor at the Institute for Development and Research in Banking Technology, Hyderabad since June 2014. He obtained his pH.D. in the area of Soft Computing from Osmania University, Hyderabad and RWTH Aachen, Germany (2001); MS (Science and Technology) from BITS, Pilani (1991) and M.Sc. (Statistics & Operations Research) from IIT, Bombay (1987). At IDRBT, he spearheads the Center of Excellence in Analytics, first-of-its-kind in India and evangelizes Analytical CRM and Non-CRM related analytics in a big way by conducting customized training programs for bankers on OCRM & ACRM; Data Warehousing, Data and Text Mining, Big Data Analytics, Fraud Analytics, Risk Analytics, Social Media Analytics, Credit Recovery Analytics, Business Analytics and conducting POC for banks etc. He has 176 papers to his credit with the break-up of 74 papers in refereed International Journals, 6 papers in refereed National Journals, 77 papers in refereed International Conferences and 3 papers in refereed National Conferences and 16 invited book chapters. His papers appeared in Applied Soft Computing, Soft Computing, Asia-Pacific Journal of Operational Research, Decision Support Systems, European Journal of Operational Research, Expert Systems with Applications, Engineering Application of Artificial Intelligence, Fuzzy Sets and Systems, IEEE Transactions on Fuzzy Systems, IEEE Transactions on Reliability, Information Sciences, Journal of Systems and Software, Knowledge Based Systems, Neurocomputing, IJUFKS, IJCIA, IJAEC, IJDMMM, IJIDS, IJDATS, IJISSS, IJECRM, IJISSC, IJCIR, IJCISIM, IJBIC, JIPS, Computers and Chemical Engineering, Canadian Geotechnical Journal, Biochemical Engineering Journal, Computers in Biology and Medicine, Applied Biochemistry and Biotechnology, Bioinformation, Journal of Services Research etc. He also edited a Book entitled “Advances in Banking Technology and Management: Impacts of ICT and CRM” (http://www.igi-global.com/reference/ details.asp? id=6995), published by IGI Global, USA, 2007 and the Proceedings of 5th Fuzzy and Neuro Computing Conference, 2016 held at Hyderabad, India. Some of his research papers are listed in Top 25 Hottest Articles by Elsevier and World Scientific. He has an H-index of 28 and more than 3178 citations for his papers (http://scholar.google.co.in/). His profile was among the Top 10% Most Viewed Profiles in LiknkedIn in 2012. He is recognized as a Ph.D. supervisor at Department of Computer and Information Sciences, University of Hyderabad and Department of Computer Sciences, Berhampur University, Orissa. He is an invited member in Marquis Who's Who in the World, USA in 2009, 2015. He is also an invited member in 2000 Outstanding Intellectuals of the 21st Century 2009/2010- published by International Biographical Center, Cambridge, England, He is an Invited Member of “Top 100 Educators in 2009” published by International Biographical Center, Cambridge, England. So far, 3 Ph.D. students graduated under his supervision and 5 more are currently working towards Ph.D. So far, he advised more than 50 M.Tech./MCA/M.Sc projects and 20 Summer Interns from various IITs. He currently supervises 3 M.Tech students. He is on the IT Advisory Committee of Canara Bank for their DWH and CRM project; IT Advisor for Indian Bank for their DWH and CRM project and Principal Consultant for Bank of India for their CRM project; Expert Committee Member for IRDA for their Business Analytics and Fraud Analytics projects. He is a referee for 40 International Journals of repute. Moreover, he is a member of the Editorial Review Board for the “International Journal of Information Systems in Service Sector” published by IGI Global, USA; “International Journal of Data Analysis Techniques and Strategies” published by Inderscience, Switzerland; International Journal of Information and Decision Sciences (IJIDS), Inderscience, Switzerland; International Journal of Strategic Decision Sciences (IJSDS), IGI Global, USA and International Journal of Information Technology Project Management (IJITPM), IGI Global, USA. International Journal of Data Science (IJDS), Inderscience, Switzerland; Editorial Board Member for Book Series in Banking, Inderscience Switzerland, He is on the PC for some International Conferences and chaired many sessions in International Conferences in India and abroad. His research interests include Fuzzy Computing, Neuro Computing, Soft Computing, Evolutionary Computing, Data Mining, Text Mining, Web Mining, Big Data Analytics, Privacy Preserving Data Mining, Global/Multi-Criteria/Combinatorial Optimization, Bankruptcy Prediction, Risk Measurement, Customer Relationship Management (CRM), Fraud Analytics, Sentiment Analysis, Social Media Analytics, Big Data Analytics, Churn Prediction in Banks and firms and Asset Liability Management through Optimization. In a career spanning 28 years, Dr. Ravi worked in several cross-disciplinary areas such as Financial Engineering, Software Engineering, Reliability Engineering, Chemical Engineering, Environmental Engineering, Chemistry, Medical Entomology, Bioinformatics and Geotechnical Engineering. At IDRBT, he held various administrative positions such as Coordinator, IDRBT-Industry Relations (2005-06), M. Tech (IT) Coordinator, (2006–2009), Convener, IDRBT Working Group on CRM (2010-11); Ph.D.Coordinator (2014–2016). As the convener, IDRBT Working Group on CRM, he co-authored a Handbook on Holistic CRM and Analytics (http://www.idrbt.ac.in/PDFs/Holistic%20CRM%20Booklet_Nov2011.pdf), where a new framework for CRM, best practices and new organization structures apart from HR issues for Indian banking industry are all suggested. He has 28 years of research and 15 years of teaching experience. He designed and developed a number of courses in Singapore and India at M. Tech level in Soft Computing, Data Warehousing and Data Mining, Fuzzy Computing, Neuro Computing, Quantitative Methods in Finance, Soft Computing in Finance etc. Further, he designed and developed a number of short courses for Executive Development Programs (EDPs) in the form of 2-week long CRM for executives, Data Mining, Big Data and its relevance to Banking, Fraud Analytics etc. He conducted ACRM proof of the concept (POC) for 14 banks on their real data. He established excellent research collaborations with University of Hong Kong, University of Ghent, Belgium, IISc., Bangalore and IIT Kanpur. He co-ordinated Two International EDPs in University of Ghent, Belgium on ACRM to banking executives jointly with Prof Dr Dirk Van den Poel, University of Ghent, Belgium in 2011 and 2012. As part of academic outreach, he is a Guest Speaker in IIM Kolkata's PGP program and an invited Resource Person in various National Workshops and Faculty development programs on Soft Computing, Data Mining, Big Data funded by AICTE and organized by SRM University, JNTU, UoH and some Engineering Colleges in India. He is an invited Panel Member/Chair in several National forums like IBA on Big Data Analytics in Banking in India. He is a member of task force set up to suggest changes to curriculum by NSDC and administered by NASSCOM o to develop Data Scientists. Further, he contributed to the Roadmap for Big Data in India developed by DST. He is an External examiner for Ph.D. in Auckland University of Technology, New Zealand and Christ University, India. Further, he is an External Expert to review Research Project Proposals on Analytics and ACRM submitted by the Belgian Academics to Belgian Government for funding. He was the Associate Professor at IDRBT between February 2010-June 2014. Prior to joining IDRBT as Assistant Professor in April 2005, he worked as a Faculty at the Institute of Systems Science (ISS), National University of Singapore (April 2002 - March 2005). At ISS, he was involved in teaching M. Tech. (Knowledge Engineering) and research in Fuzzy Systems, Neural Networks, Soft Computing Systems and Data Mining & Machine Learning. Further, he consulted for Seagate Technologies, Singapore and Knowledge Dynamics Pte. Ltd., Singapore, on data mining projects. Earlier, he worked as Assistant Director (Scientist E1) from 1996–2002 and Scientist C from 1993 to 1996 respectively at the Indian Institute of Chemical Technology (IICT), Hyderabad. He was deputed to RWTH Aachen (Aachen University of Technology) Germany under the DAAD Long Term Fellowship to carry out advanced research during 1997–1999. He earlier worked as Scientist B and Scientist C at the Central Building Research Institute, Roorkee (1988–1993) and was listed as an expert in Soft Computing by TIFAC, Government of India.

    View full text