Abstract
Patients in rural India cannot able to enquire about their health using appropriate disease related keywords, submitted as query. Lack of domain knowledge prevents the patients to refine the query using well-known feedback mechanism. Moreover, due to scarcity of doctors in rural India, the health assistants who run the health centers do not have enough knowledge to treat the patients based on the imprecise query. In the paper, we propose an autonomous provisional disease diagnosis system by classifying the query, which has been expanded using semantic of the domain knowledge. First, we apply spatial distribution based nearest neighbor spacing distribution (NNSD) on the disease related medical document corpus (MDC) to find the relevant terms, mostly symptoms with respect to different diseases. We frame a symptom vocabulary (SV) with the unique terms present in different diseases, known apriori. Each query is expanded as bag of symptoms (BoS) using 5-gram collocation model and log likelihood ratio (LLR) to measure the association between the query and the terms in the MDC. The terms in the BoS may not exactly match with the symptoms in the SV but have contextual similarity. We propose a novel approach to know which symptoms in the SV are nearest in context to the corresponding terms in the BoS. The feature vector is obtained by encoding the SV with respect to (w.r.t.) each BoS, which is sparse in nature. We apply sparse representation based classifier (SRC) to classify the query into a particular disease. Proposed nearest neighbor spacing distribution based sparse representation classifier (NNSD-SRC) shows promising performance considering MDC dataset and we validate the results with the doctors showing negligible error.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
The goal of query classification is to identify the category label known a priori that best represents the domain of the keywords submitted in a query. However, the performance of a query based classifier largely depends on the keywords submitted by the users which often do not express the underlying information is searched for. Such keywords are called noise terms and cannot unambiguously represent the actual context of the query, resulting error in classification. Selection of more appropriate keywords, represent the context of the query enhances performance of the classifier.
In most of the cases, feedback mechanism [1] performs well when user can modify the query based on the suggestions provided by the search engine. However, there is no scope of query refinement using feedback when the user does not have any domain knowledge and this scenario is very common in rural healthcare sector of India. The healthcare services to remote villages face real challenge due to scarcity of doctors. Generally, health assistants manage the rural health centers but they have lack of expertise to refine the patient’s query containing noise terms. In [2], a query classification system has been proposed for diagnosis of the disease at primary level by processing the imprecise query keywords with the help of experts’ knowledge base. Therefore, the aim of the paper is to develop an autonomous provisional disease diagnosis system using statistical and computational methods, which effectively can monitor the health of the rural people.
It has been observed that the relevant words are spatially distributed while irrelevant words are randomly distributed in the document. Therefore, there is enormous difference in the pattern of occurrences between the relevant and the non-relevant terms in the document. A spatial distribution based method has been proposed for obtaining the symptoms related terms from the disease-related document corpus. In level statistical analysis of quantum disordered system, “energy level” of a word within an “energy spectrum” is considered as spatial distribution of the word to extract the relevant words whose energy levels attract each other [3]. In this paper, we propose a nearest-neighbour spacing distribution (NNSD) based approach to obtain symptoms w.r.t the disease-classes, known apriori. A symptom vocabulary (SV) is constructed using the unique symptoms present in the disease-classes [4]. We build a disease-symptom matrix (DSM) consisting of number of symptoms present in the SV and the number of corresponding diseases where each element of the matrix denotes tf-score [5] of the respective symptom, considering the disease related MDC. The DSM is built by extracting knowledge from the MDC and sparse in nature.
After knowledge extraction, the imprecise query submitted by the patient is expanded using the terms, which have strong association with the query keyword. For measuring association, suitable adaptive technique is needed which represent context of the query more precisely. In this paper, for expanding the query 5-gram collocation model and log-likelihood ratio (LLR) are employed to measure the association [6]. For a query keyword five co-occurred terms are considered as expanded query, called bag-of-symptoms (BoS). Each term in the BoS might not exactly match with the terms in the SV, though semantically or contextually similar. Here, we propose a novel approach for finding the most similar terms in the SV w.r.t each term in the BoS using distributional similarity measure. Finally, the SV is encoded with tf value of the terms in the BoS and considered as the feature vector (FV). Since the terms in a BoS are very specific, the FV is sparse and used as the test pattern for predicting the disease. We utilize sparse DSM and sparse test pattern for predicting the disease of a patient by applying Sparse Representation based Classifier(SRC) [7]. The proposed system is described in Fig. 1.
This paper is divided into four sections. Section 2 describes the methodology. Results are summarized in Sect. 3 and conclusions are arrived at Sect. 4.
2 Methodology
In the paper, we propose an autonomous provisional disease identification system based on the patient’s query keyword which is often imprecise consisting of noise terms. First contribution of the paper is knowledge extraction by analyzing the pre-defined disease related document corpuses collected from different medical sources.
2.1 Disease-Class Generation
Here, we utilize NNSD of words (or symptoms) over the documents for finding the relevant symptoms to a disease. The spacing distribution P(d) of a word w is obtained as the normalized histogram of the set of distances or spacing (d 1 , d 2 , …, d m ) between consecutive occurrences of a word w in the documents, where m is the number of times the word w occurs in the document [4]. It has been observed that a non-relevant word like “and” is placed at random in the document, whereas a relevant word like “angina” appears in the “heart-disease” related document following spatial distribution P(d). Therefore, the level of attraction of relevant words is higher than the level of attraction of irrelevant words. The relevance of a word is defined using the parameter ρ where \( \rho = \frac{\sigma }{{\overline{d} }} \), \( \overline{d} \) is the average distance and σ is the standard deviation \( \sqrt {d^{2} - \overline{d}^{2} } \) for distribution P(d). For different words, ρ value is used for comparing the distributional similarity. When the words are uncorrelated they follow Poisson distribution.
The relevant words follow a correlated spatial distribution and form a group w.r.t a disease based on ρ. In this paper, we obtain group of relevant words as symptoms for each disease class by applying NNSD to each word in the document. From Fig. 2 it is evident that the relevant words “heart”, “angina” and “Palpitation” follow similar type of distribution with different mean and standard deviation while the non-relevant term “called” and “high” follow random distribution.
The symptoms in different disease classes are thus obtained and a Symptom Vocabulary (SV) is built with k number of unique symptoms (n << k) present in n different disease-classes. A disease-symptom-matrix (DSM)n×k is built as measurement space, each element of which is calculated using Eq. (1),
Where f w,D is the count of the term w in D [5].
The DSM is a sparse matrix as most of the symptoms are unique for a disease and used to classify the query keyword submitted by the patient.
2.2 Query Expansion Model
In the proposed query expansion model, the query of a patient has been expanded using 5-gram collocation by consulting the same MDC. We find the co-occurred terms of the query keyword using LLR as association measure. It has been observed that beyond 5-gram, the co-occurred terms are redundant [8]. The expanded query consisting of five co-occurred terms and defined as bag-of- symptoms (BoS). The BoSs are not unique and there may be multiple BoSs for each keyword due to associations with different words throughout the document. From multiple BoSs the highest LLR scored BoS has been chosen as expanded query. For example, if a patient enquires about “heart” related problems, the keyword “heart” is expanded and the top scored BoS: (heartbeat angina heart disease nausea) is considered as expanded query.
Each BoS is used to generate the feature vector (FV) by comparing each term of the BoS with the symptoms in SV depending on the ρ value. The symptom in the SV, which is closest to the term of a BoS is encoded with the tf score [5] of the respective term. In case multiple terms of a BoS are mapped to the same symptom of the SV, highest tf score is used to encode the respective symptom. Remaining elements of the SV are set to zero and so the FV is sparse in nature.
2.3 Sparse Representation Based Classification of Query
The FV is represented by vector y, which is sparse and we apply SRC to classify the query by reconstruction using Eq. (2).
Where W is the co-efficient vector and is sparse since not all elements of the disease-classes contribute to reconstruct the query sample y.
The sparsest solution can be obtained by solving the following optimization problem, given in Eq. (3),
Where \( \left\| . \right\|_{0} \) is the \( L_{0} \) - norm, counting the number of non-zero entries in the co-efficient vector. This problem has been solved in polynomial time by standard linear programming algorithm [8]. After the sparsest solution say, \( \widehat{\varvec{w}}_{1} \) is obtained, the SRC [7] is performed in the following way.
For each disease-class i, let \( \partial_{i} : {\mathbb{R}}^{S} \to {\mathbb{R}}^{S} \) be the characteristic function that selects the co-efficient associated with the i th class. Using only the co-efficient associated with the i th class, reconstruction has been performed for a given test sample y as \( {\varvec{y}_{{\varvec{new}}}^{\varvec{i}}}^{\varvec{T}} = DSM^{T} * \partial_{i} \left( {\widehat{\varvec{W}}_{1} } \right) \) where \( {\mathbf{y}}_{{\varvec{new}}}^{\varvec{i}} \) is called the prototype of class i with respect to the sample y. Equation (4) calculates the residual distance between the actual and its prototype of class i,
The SRC decision rule: If \( r_{m} \left( {\mathbf{y}} \right) = \min_{i} r_{i} \left( {\mathbf{y}} \right), \) y is assigned to the class m [9].
Example
-
Step1.
The BoS corresponding to the patient’s keyword ‘Angina’ is (Fatigue Coronary Palpitation Heart Nausea] T
-
Step2:
Encode the expanded query as test pattern y using SV (1×70). The term “Coronary” is not present in SV, so replace “Coronary” with most similar symptom “Heart” by comparing ρ value. FV y is given as follows:
y (1×70) = [ 0, 0. …, 0, 1.27, …, 0, 0, …, 0, 0.3, …, 0, 0, …, 0, 0.3, …, 0, 0, …, 0, 0.9, …, 0, 0 ]T
-
Step3:
Considering y as the encoded test sample and DSM (4 × 70) as training set, obtain the sparse coding vector W 4×1 using following Eq. (3)
W 4×1 = [0.12 0.03 –0.04 –0.005]T
-
Step4:
Reconstruct y ( \( {\mathbf{y}}_{{{\mathbf{new}}}}^{\varvec{i}} ) \) for every non-zero coefficient in W i for the i th disease class label.
-
Step5:
Residual distance for each class i is given using Eq. (4).
r i = [1.88 1.97 2.03 2]T
-
Step6:
Minimum residual distance is 2.01 corresponding to i = 1. Therefore, the query is classified as disease-class “Heart-Disease”.
3 Results and Discussions
In our experiment, a large medical document corpus (MDC) is prepared by consulting several medical websites (webmd.com, mayoclinic.org, healthcare.com) and literatures [10]. There are 260 documents divided into four sub-corpuses representing diseases, namely “Heart-disease”, “Diabetes”, “Diarrhea” and “Lung-disease”.
The NNSD-SRC method has been applied on four different sub-corpuses to extract the relevant terms, which are symptoms and the dimension of the SV is 70. We sample 200 patients’ query from a rural health kiosk in a span of one week and classify the query using 10-fold cross validation technique. NNSD-SRC method shows significant improvement in accuracy and guarantees lower rate of misclassification while comparing with other classifiers, as given in Table 1. High precision and recall value ensures that NNSD-SRC performs better than other classifiers. ROC curves for different classifiers are given in Fig. 3, which demonstrates best performance of the NNSD-SRC.
4 Conclusions
The proposed NNSD-SRC based provisional disease diagnosis method, which minimizes the experts’ involvement. The patient’s query has been expanded moderately based on 5-gram collocation approach. For classification of the query sparse representation based classifier (SRC) is employed which utilizes sparsity of the feature vector and the DSM matrix. The SRC based classifier outperforms other classifiers showing significant improvement in accuracy and sensitivity on different data sets. In the work, we prepare a benchmark data set MDC of medical documents related to “Heart-disease”, “Diabetes”, “Diarrhea” and “Lung-disease” and verified with the experts. The performance of the system is satisfactory and used in rural healthcare in India where scarcity of doctors is a real challenge.
References
Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. (CSUR) 44(1), 1 (2012)
Sil, J., Bhattacharya, I.: Patient classification based on expanded query using 5-gram collocation and binary tree. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2015 36678 2015, pp. 1–10. IEEE (2015)
Mehta, M.L.: Random Matrices, vol. 142. Academic Press, Amsterdam (2004)
Carpena, P., Bernaola-Galván, P., Hackenberg, M., Coronado, A.V., Oliver, J.L.: Level statistics of words: Finding keywords in literary texts and symbolic sequences. Phys. Rev. E 79(3), 035102 (2009)
Ramos, J.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning (2003)
Pauls, A., Klein, D.: Faster and smaller N-gram language models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 258–267. Association for Computational Linguistics (2011)
Yang, J., Chu, D., Zhang, L., Xu, Y., Yang, J.: Sparse representation classifier steered discriminative projection with applications to face recognition. IEEE Trans. Neural Netw. Learn. Syst. 24(7), 1023–1035 (2013)
Donoho, D.L., Tsaig, Y.: Fast solution of-norm minimization problems when the solution may be sparse. IEEE Trans. Inf. Theor. 54(11), 4789–4812 (2008)
Bhattacharya, I., Sil, J.: Query classification using LDA topic model and sparse representation based classifier. In: 2016 Proceedings of the 3rd IKDD Conference on Data Science, p. 24. ACM, March 2016
Harrison’s Principles of Internal Medicine, vol. 2. McGraw-Hill Medical, New York (2008)
Acknowledgement
This research was supported by grants from Information Technology Research Academy (ITRA), under the Department of Electronics and Information Technology (DeitY), Government of India.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Bhattacharya, I., Sil, J. (2017). Spatial Distribution Based Provisional Disease Diagnosis in Remote Healthcare. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_76
Download citation
DOI: https://doi.org/10.1007/978-3-319-69900-4_76
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)