Abstract
This paper investigates the patient risk prediction problem in the context of active learning with relative similarities. Active learning has been extensively studied and successfully applied to solve real problems. The typical setting of active learning methods is to query absolute questions. In a medical application where the goal is to predict the risk of patients on certain disease using Electronic Health Records (EHR), the absolute questions take the form of “Will this patient suffer from Alzheimer’s later in his/her life?”, or “Are these two patients similar or not?”. Due to the excessive requirements of domain knowledge, such absolute questions are usually difficult to answer, even for experienced medical experts. In addition, the performance of absolute question focused active learning methods is less stable, since incorrect answers often occur which can be detrimental to the risk prediction model. In this paper, alternatively, we focus on designing relative questions that can be easily answered by domain experts. The proposed relative queries take the form of “Is patient A or patient B more similar to patient C?”, which can be answered by medical experts with more confidence. These questions poll relative information as opposed to absolute information, and even can be answered by non-experts in some cases. In this paper we propose an interactive patient risk prediction method, which actively queries medical experts with the relative similarity of patients. We explore our method on both benchmark and real clinic datasets, and make several interesting discoveries including that querying relative similarities is effective in patient risk prediction, and sometimes can even yield better prediction accuracy than asking for absolute questions.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Asuncion A, Newman D (2007) Uci machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
Cebron N, Berthold MR (2009) Active learning for object classification: from exploration to exploitation. Data Min Knowl Discov 18(2):283–299
Chattopadhyay R, Wang Z, Fan W, Davidson I, Panchanathan S, Ye J (2012) Batch mode active sampling based on marginal probability distribution matching. In: KDD, pp 741–749
Chen Y, Carroll RJ, Hinz ERM, Shah A, Eyler AE, Denny JC, Xu H (2013) Applying active learning to high-throughput phenotyping algorithms for electronic health records data. JAMIA 20:e253–e259
Culotta A, McCallum A (2005) Reducing labeling effort for structured prediction tasks. In: Proceedings of the 20th national conference on artificial intelligence—vol 2, AAAI’05. AAAI Press, Menlo Park, pp 746–751
Davis DA, Chawla NV, Christakis NA, Barabási AL (2010) Time to care: a collaborative engine for practical disease prediction. Data Min Knowl Discov 20(3):388–415. doi:10.1007/s10618-009-0156-z
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases, VLDB ’99. Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 518–529
Gionis A, Lappas T, Terzi E (2012) Estimating entity importance via counting set covers. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12. ACM, New York, NY, pp 687–695
Guo Y, Greiner R (2007) Optimistic active learning using mutual information. In: Proceedings of the 20th international joint conference on artifical intelligence, IJCAI’07, pp 823–829
Hoi SCH, Jin R, Zhu J, Lyu MR (2006) Batch mode active learning and its application to medical image classification. In: Proceedings of the 23rd international conference on machine learning, ICML ’06. ACM, New York, NY, pp 417–424. doi:10.1145/1143844.1143897
Ipeirotis PG, Provost FJ, Sheng VS, Wang J (2014) Repeated labeling using multiple noisy labelers. Data Min Knowl Discov 28(2):402–441
Kapoor A, Horvitz E, Basu S (2007) Selective supervision: guiding supervised learning with decision-theoretic active learning. In: IJCAI, pp 877–882
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’94. Springer-Verlag New York Inc, New York, NY, pp 3–12
Melville P, Mooney RJ (2004) Diverse ensembles for active learning. In: Proceedings of the twenty-first international conference on machine learning, ICML ’04. ACM, New York, NY, pp 74–81
Muslea I, Minton S, Knoblock C (2000) Selective sampling with redundant views. In: Proceedings of the national conference on artificial intelligence
Norén GN, Hopstadius J, Bate A, Star K, Edwards IR (2010) Temporal pattern discovery in longitudinal electronic patient records. Data Min Knowl Discov 20(3):361–387. doi:10.1007/s10618-009-0152-3
Panigrahy R (2008) An improved algorithm finding nearest neighbor using kd-trees. In: Proceedings of the 8th Latin American conference on theoretical informatics, LATIN’08. Springer-Verlag, Berlin, Heidelberg, pp 387–398
Qian B, Li H, Wang J, Wang X, Davidson I (2013a) Active learning to rank using pairwise supervision. In: SDM, pp 297–305
Qian B, Wang X, Wang J, Li H, Cao N, Zhi W, Davidson I (2013b) Fast pairwise query selection for large-scale active learning to rank. In: ICDM, pp 607–616
Rashidi P, Cook DJ (2011) Ask me better questions: active learning queries based on rule induction. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11. ACM, New York, NY, pp 904–912. doi:10.1145/2020408.2020559
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326
Roy N, Mccallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of 18th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 441–448
Settles B (2009) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison
Settles B, Craven M (2008) An analysis of active learning strategies for sequence labeling tasks. In: EMNLP, pp 1070–1079
Settles B, Craven M, Ray S (2008) Multiple-instance active learning. In: Advances in neural information processing systems NIPS. MIT Press, Cambridge, pp 1289–1296
Sun J, Wang F, Hu J, Edabollahi S (2012) Supervised patient similarity measure of heterogeneous patient records. SIGKDD Explor 14(1):16–24
Wang F, Zhang C (2006) Label propagation through linear neighborhoods. In: Proceedings of the 23rd international conference on machine learning, ICML’06. ACM, New York, NY, pp 985–992. doi:10.1145/1143844.1143968
Wang F, Sun J, Ebadollahi S (2012) Composite distance metric integration by leveraging multiple experts’ inputs and its application in patient similarity assessment. Stat Anal Data Min 5(1):54–69
Wang X, Wang F, Wang J, Qian B, Hu J (2013) Exploring patient risk groups with incomplete knowledge. In: ICDM, pp 1223–1228
Wauthier FL, Jojic N, Jordan MI (2012) Active spectral clustering via iterative uncertainty reduction. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’12. ACM, New York, NY, pp 1339–1347
Wu J, Roy J, Stewart WF (2010) Prediction modeling using ehr data: challenges, strategies, and a comparison of machine learning approaches. Med care 48(6):S106–S113
Zhang T, Oles FJ (2000) A probability analysis on the value of unlabeled data for classification problems. In: Proceedings 17th international conference on machine learning, pp 1191–1198
Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2003) Learning with local and global consistency. In: NIPS
Zhou J, Sun J, Liu Y, Hu J, Ye J (2013) Patient risk prediction model via top-k stability selection. In: SDM, pp 55–63
Zhu X, Ghahramani Z, Lafferty JD (2003a) Semi-supervised learning using gaussian fields and harmonic functions. In: ICML, pp 912–919
Zhu X, Lafferty J, Ghahramani Z (2003b) Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In: ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, pp 58–65
Zhuang H, Tang J, Tang W, Lou T, Chin A, Wang X (2012) Actively learning to infer social ties. Data Min Knowl Discov 25(2):270–297
Author information
Authors and Affiliations
Corresponding authors
Additional information
Responsible editors: Fei Wang, Gregor Stiglic, Ian Davidson and Zoran Obradovic.
Rights and permissions
About this article
Cite this article
Qian, B., Wang, X., Cao, N. et al. A relative similarity based method for interactive patient risk prediction. Data Min Knowl Disc 29, 1070–1093 (2015). https://doi.org/10.1007/s10618-014-0379-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-014-0379-5