KDE-OCSVM model using Kullback-Leibler divergence to detect anomalies in medical claims

https://doi.org/10.1016/j.eswa.2022.117056Get rights and content

Highlights

  • An anomaly detection method is developed for the records of UEBMI.

  • A combination method is utilized to preprocess the experimental data.

  • A feature selection method involves two aspects, variance and similarity.

  • An extended OCSVM model is established to improve the model performance.

  • Interesting findings provide some guideline for the future research and practical issues.

Abstract

Detecting and preventing unusual claims behavior in medical insurance can reduce the drawdown of a group medical insurance pool. Taking the case of the urban employee basic medical insurance (UEBMI) in China, this paper develops a method to detect unusual medical claim patterns in the UEBMI. We collect public domain records involving gastric malignancy as the experimental sample. A method to preprocess the experimental sample for data learning is provided. We present a feature selection method, involving variance analysis and similarity analysis, to determine the core features. Next, we establish an extended one-class support vector machine (OCSVM) model, the kernel density estimation (KDE)-OCSVM, which exploits the Kullback-Leibler divergence and the KDE method to estimate the parameter v of the OCSVM model, to improve model performance. An experiment and two analyses are performed to validate the proposed method.

Introduction

As the world shifts to an increasingly urban population, governments are keen to provide an affordable national insurance scheme. China is no exception as her citizens are becoming more health conscious, seeking high quality medical services to treat their medical conditions. Unfortunately, access to such high quality medical services invariably costs more, making good medical care out of reach for the retirees and lower wage workers. As such, the policy response has been to provide affordable and accessible universal healthcare for all. In the case of China, it has established a medical insurance system aiming at universal insurance coverage (Huang et al., 2020), to provide an urban employee basic medical insurance (UEBMI) for employees and retirees. According to China’s Statistical Yearbook in 2020 from the Chinese National Bureau of Statistics, 329 million urbanites (242 million workers and 87 million retirees) are covered by this scheme.

The UEBMI fund is on a tight budget currently (see Fig. 1) and the UEBMI membership pool has been increasing. Projections suggest that the UEBMI fund will be in the red in the near future, if no interventions are taken (Xie et al., 2020). Two reasons, aging of population and excessive claims on the UEBMI, result in this situation (Zeng et al., 2019, Jiang and Ni, 2020). Population aging has been a serious concern for the managers of the UEBMI. As shown in Table 1, in the period 2010–2020, an aging workforce will lead to more retirees. Moreover, as the current government regulations do not require retirees to make additional contributions to the UEBMI fund as long as the minimum period of payment is met (Li & Tian, 2020). With more retirees experiencing a longer runway on longevity, the UEBMI will have to cater to the growing medical needs of the retirees. Solutions have been offered to mitigate this effect (Qiu and Wu, 2019, Qiu et al., 2020), albeit inadequately.

The UEBMI fund has two portions, a member’s fund and a pooled fund. The former is used to cover the sundry medical costs incurred by the member, e.g., outpatient cost, and paying for pharmaceutical prescriptions. The latter is used for the larger medical expenses, e.g., inpatient cost. Both portions should only cover the medical-related expenses, subject to certain conditions. However, some members abuse the member’s fund on the non-medical related products while others use the pooled fund to cover the non-reimbursable medical service expenses. Both actions will deplete the UEBMI fund faster. Hence, there is a need to detect such incidents and take steps to manage them, for the UEBMI to be sustainable.

In practice, all claims made on the UEBMI are recorded and uploaded to the National Healthcare Security Administration (NHSA) platform, which can view the claim records made on the UEBMI. Currently, there appears to be more unusual claim records on the UEBMI fund, though the monetary value of the claims is minor. As such, detecting the abnormal claims is an anomaly detection issue.

This paper analyzes the abnormal claim records, which are referred to as anomalies (Villa-Pérez et al., 2021). As a learning task, anomaly detection can be supervised, semi-supervised, or unsupervised (Olson et al., 2018, Kang et al., 2019, Cappozzo et al., 2020). The conventional classifiers of anomaly detection work well in the presence of at least two well-defined classes, but they may have some bias when data irregularities exist, such as imbalanced classes, small disjoint, skewed class distribution, and missing values (Sonbhadra et al., 2020). In particular, when a class is ill-defined, using such classifiers would provide a biased outcome. The literature labels this as a one-class classification (Koch et al., 1995), i.e., the negative samples are much fewer than the positive samples. This sampling distribution makes the construction of the decision boundary complex and challenging. In the field, there exist many one-class classification algorithms, namely, the one-class support vector machine (OCSVM) (Kumar et al., 2021), one-class random forest (Désir et al., 2013), one-class deep neural network (Wu et al., 2020), and others (Tajoddin and Abadi, 2019, Kim et al., 2021, Zhou et al., 2021).

The OCSVM is an extension of the support vector machine to address the issue when data from only one class are available. It has been successfully applied in many areas involving document classification (Bouamra et al., 2018), healthcare (Krishnan et al., 2019), intrusion detection (Binbusayyis & Vaiyapuri, 2021), mobile network management (Dridi et al., 2021), medical diagnostics (Togo et al., 2020), industrial monitoring and damage detection (Dias et al., 2021), Internet of Things (Sheikh & Jilani, 2021), and others (Dong et al., 2017, Feng et al., 2017, Nguyen et al., 2021). Many studies have provided measures to improve the performance of the OCSVM model, namely: 1) Robustness. Zhu et al. (2016) proposed an instance-weighted strategy to dampen the effect of noise. Tian et al. (2018) developed a Ramp-OCSVM algorithm, which exploits the non-convex properties of the Ramp loss function to enhance the generalization performance. 2) Kernel parameter estimation. Xiao et al. (2014) provided two parameter selection methods of the Gaussian kernel, utilizing the information of the farthest and the nearest neighbors of each sample, and detecting the tightness of the decision boundary. Ghafoori et al. (2018) presented a K-nearest neighbor method, which can avoid the normal boundary propensity to skew toward the anomalies if the training set includes anomalies. 3) Feature selection and dimensionality reduction. Khalifa et al. (2016) compared eight feature selection methods on the performance of the OCSVM algorithm. Alam et al. (2020) designed an algorithm to maximize the learning ability about the target class while minimizing the number of training samples.

However, studies to determine the parameter v of the OCSVM model are scant. The parameter v is an upper bound on the proportion of training errors and a lower bound of the proportion of support vectors. In the extant studies, there are two ways to determine v. The first is to set v as a small value and then use positive samples as the training set to determine the decision boundary. When the training set contains anomalies, the decision boundary is prone to skew toward the negative samples, yielding a significant decline in performance. The second is assign the ratio of negative samples to the overall samples as v. This second method has a limitation in that the experimental outcome may be not satisfactory, when there exist differences between the experimental samples and actual data. Taking an anomaly detection of medical records in a city as an example, it is highly likely that the training samples are obtained from several hospitals instead from all hospitals in the city. When there is a significant difference between the training samples from the selected hospitals and the data distribution of all hospitals in the city, the decision boundary has deviations and the detection performance drops.

This study is thus motivated as follows. As claims abuse can threaten the intended progress of the UEBMI, an anomaly detection can be used to ferret those claims to prevent future occurrence. Second, the existing limitations in obtaining v can lead to weak performance of the OCSVM model, prompting a revamp of the OCSVM model.

Our study makes the following contributions. First, as our study collects a raft of raw medical data from the local hospitals in China, we develop a data structurization process of data cleansing, data vectorization, and data normalization to preprocess the semi-structured and unstructured data for downstream use, thus yielding over a million valid medical records. Second, to ensure feature dimensionality reduction, we build a feature selection method, involving variance analysis and similarity analysis. Third, a kernel density estimation (KDE)-OCSVM model based on experimental samples is formed to obtain v, through the Kullback-Leibler divergence, to give the value of 1-v (the ratio of positive to overall samples in the data). The sample data is used to estimate the probability distribution of the overall data through the KDE method. Fourth, an experiment of anomaly detection involving the medical records of gastric malignancy of the hospitals in a province in China, is set up, followed by a sensitivity analysis and a comparison analysis to better explain why some records are grouped under the misclassified categories and how to discover the misclassified records.

The rest of this study is set as follows. Section 2 introduces the data collection and preprocessing. The method of feature selection and KDE-OCSVM are developed in Section 3. Section 4 contains the experiment, followed by sensitivity and comparison analyses, and a discussion. Section 5 concludes.

Section snippets

Data source and description

The UEBMI is a public medical insurance plan in China, meant to reimburse the medical expenses of the insured. The contributions to the UEBMI funds are made by employees, firms, and government, and it is managed by the NHSA. All the medical expense transaction records are held at this institution.

In this study, the two-year data from January 1, 2018 to December 31, 2019 are extracted from the NHSA branch in Hunan, containing the diagnosis and treatment information of the members who suffered

Methodology

An anomaly detection of the UEBMI data is critical and complex. As the UEBMI data are imbalanced, i.e., the proportion of normal claims data is much higher than that of the abnormal claims data, and the majority of the insurance data are unlabeled, meaning that it is not completely clear which claim records are normal and which are not. At the same time, the UEBMI data possess high dimensionality, necessitating the feature selection of the UEBMI data. When the features are used in the anomaly

Experimental process

After the preprocessing the UEBMI data, the feature selection module is used to handle the original features. First, the sample data are integrated via the associated features. The data is from three tables which contain the membership information, the treatment expenditure information, and the records of the medical service and medication usage. The table of the membership information is parent to the table on the treatment expenditure information, cascading to the table of the records of

Conclusion

While more people have enjoyed the benefits of UEBMI, there are also challenges and issues. Currently, the UEBMI is a serious problem, because some people acquire the benefit through misuse and unsociable behavior. Hence, there are some abnormal and misclassified records in the records of UEBMI, and this study seeks to detect records to stem the unnecessary abuse.

In the records of UEBMI in several hospitals during a year, we select some records of gastric malignancy as the data source of our

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors thank the editors and anonymous reviewers for their helpful comments and suggestions. This work was supported by the Natural Science Foundation of Hunan Province (No. 2020JJ4121) and the China Scholarship Council (No. 202006370246).

References (36)

Cited by (8)

  • Fast SVM-based One-Class Classification in Large Training Sets

    2023, Proceedings - 9th IEEE International Conference on Information Technology and Nanotechnology, ITNT 2023
View all citing articles on Scopus
1

ORCID: 0000-0001-7167-5062.

2

ORCID: 0000-0001-8671-3930.

3

ORCID: 0000-0001-5142-5277.

4

ORCID: 0000-0001-7668-4881.

5

ORCID: 0000-0002-3620-7658.

6

ORCID: 0000-0002-0825-9194.

7

ORCID: 0000-0002-1767-1391.

View full text