Abstract
The abundant availability of health-care data calls for effective analysis methods to help medical experts gain a better understanding of their patients and their health. The focus of existing work has been largely on prediction. In this paper, we introduce Core, a framework for cohort “representation” and “exploration.” Our contributions are twofold: First, we formalize cohort representation as the problem of aggregating the trajectories of its patients. This problem is challenging because cohorts often consist of hundreds of patients who underwent medical actions of various types at different points in time. We prove that producing a representative cohort trajectory is NP-complete with a reduction in the multiple sequence alignment problem. We propose a heuristic that extends the Needleman–Wunsch algorithm for sequence matching to handle temporal sequences. To further improve cohort representation efficiency, we introduce “trajectory families” and “stratified sampling.” Our second contribution is formalizing the problem of cohort exploration as finding a set of cohorts that are similar to a cohort of interest and that maximize entropy. This problem is challenging because the potential number of similar cohorts is huge. We prove NP-completeness with a reduction in the maximum edge subgraph problem. To address complexity, we develop a multi-staged approach based on limiting the search space to “contrast cohorts.” To speed up the computation of cohort similarity, we use “event sets” that are inspired from the double dictionary encoding proposed for keyword search. Moreover, we explore the usefulness and efficiency of Core using an extensive set of qualitative and quantitative experiments on two real health-care datasets. In a user study with medical experts, we show that Core reduces time-to-insight from hours to seconds and helps them find better insights than baseline approaches. Moreover, we show that the obtained cohort representations offer the right trade-off between quality and performance. We study the benefits of trajectory families and stratified sampling for cohort representation and show their applicability on large and heterogeneous cohorts. We also show the benefit of event sets for cohort exploration in providing interactive performance.
Notes
Continuous Positive Airway Pressure.
Throughout the paper, we use the shorter term “age” for “age category,” and “life” for “life status.”
Throughout the paper, the dot notation represents the invocation of a function on its right-hand side for the object (e.g., a patient) on its left-hand side.
References
Munshi, A., Sharma, V., Sharma, S.: Lessons learned from cohort studies, and hospital-based studies and their implications in precision medicine. In: Progress and Challenges in Precision Medicine. Elsevier (2017)
Welch, S.R., Huff, S.M.: Cohort amplification: an associative classification framework for identification of disease cohorts in the electronic health record. In: Annual Symposium Proceedings. American Medical Informatics Association (2010)
Maggi, F.M., Di Francescomarino, C., Dumas, M., Ghidini, C.: Predictive monitoring of business processes. In: International Conference on Advanced Information Systems Engineering. Springer, pp. 457–472 (2014)
Pham, T., Tran, T., Phung, D., Venkatesh, S.: Predicting healthcare trajectories from medical records: a deep learning approach. J. Biomed. Inform. 69, 218–229 (2017)
Fejza, A.., Genevès, P., Layaïda, N., Bosson, J.-L.: Scalable and interpretable predictive models for electronic health records. In DSAA, IEEE (2018)
Heuser, A., Huynh, M., Chang, J.C.: Empirical process-based large sample properties of the area bounded by cohort-weighted Kaplan Meier curves. arXiv preprint arXiv:1701.02424 (2017)
Liu, Y., Safavi, T., Dighe, A., Danai, K.: Graph summarization methods and applications: a survey. ACM Comput. Surv. 51, 1–34 (2018)
Senderovich, A., Weidlich, M., Gal, A.: Temporal network representation of event logs for improved performance modelling in business processes. In: BPM (2017)
Monroe, M., Lan, R., Lee, H., Plaisant, C., Shneiderman, B.: Temporal event sequence simplification. TVCG (2013)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Pahins, C.A.L., Omidvar-Tehrani, B., Amer-Yahia, S., Siroux, V., Pépin, J.L., Borel, J.-C., Comba, J.: COVIZ: a system for visual formation and exploration of patient cohorts. PVLDB 12(12), 1822–1825 (2019)
Von Elm, E., Altman, D.G., Egger, M., et al.: The strengthening the reporting of observational studies in epidemiology (strobe) statement: guidelines for reporting observational studies. PLoS Med. 147, 573–577 (2007)
Hall, A., Bachmann, O., Büssow, R., Gănceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. Proc. VLDB Endow. 5(11), 1436–1446 (2012)
Omidvar-Tehrani, B., Amer-Yahia, S., Lakshmanan, L.V.S.: Cohort representation and exploration. In: DSAA. IEEE (2018)
Armony, M., Israelit, S., Mandelbaum, A., Marmor, Y.N., Tseytlin, Y., Yom-Tov, G.B.: On patient flow in hospitals: a data-based queueing-science perspective. Stoch. Syst. 5(1), 146–194 (2015)
Jenkins, K.: Comorbidity patterns with female incontinence distinguish subtypes. MedPage Today J. (2018)
Woodfield, J.: Gestational diabetes associated with early signs of kidney damage. The Global Diabetes Community (2018)
Collins, T.: For your patients-REM sleep behavior disorder: REM disorder is highly predictive of neurodegenerative disease, study shows. Neurol. Today 18, 1–22 (2018)
Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1(4), 337–348 (1994)
Chen, Z., Dehmer, M., Shi, Y.: A note on distance-based graph entropies. Entropy 16(10), 5416–5427 (2014)
Feige, U., Peleg, D., Kortsarz, G.: The dense k-subgraph problem. Algorithmica 29(3), 410–421 (2001)
Kaplan, E.L., Meier, P.: Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 53(282), 457–481 (1958)
Gollery, M.: Bioinformatics: sequence and genome analysis. Clin. Chem. 51, 2219 (2005)
Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11, 473–483 (2010)
Smith, T., Waterman, M.: Identification of common molecular subsequences. Mol. Biol. 147, 195–197 (1981)
Polyanovsky, V.O., Roytberg, M.A., Tumanyan, V.G.: Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences. Algorithms Mol. Biol. 6, 25 (2011)
Goonesekere, N.C.W., Lee, B.: Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins Struct. Funct. Bioinf. 71(2), 910–919 (2008)
Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555–565 (1991)
Omidvar-Tehrani, B.: Augmented therapy with online support groups. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH). Springer (2018)
Notredame, C., Higgins, D.G., Heringa, J.: T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205–217 (2000)
Chatain, T., Carmona, J., Van Dongen, B.: Alignment-based trace clustering. In: International Conference on Conceptual Modeling. Springer, pp. 295–308 (2017)
Enright, A.J., Van Dongen, S., Ouzounis, C.A.: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30(7), 1575–1584 (2002)
Bhuiyan, M., Mukhopadhyay, S., Al Hasan, M.: Interactive pattern mining on hidden data: a sampling-based solution. In: CIKM. ACM (2012)
Amer-Yahia, S., Kleisarchaki, S., Kolloju, N.K., Lakshmanan, L.V.S., Zamar, R.H.: Exploring rated datasets with rating maps. In: WWW (2017)
Omidvar-Tehrani, B., Amer-Yahia, S., Termier, A.: Interactive user group analysis. In: CIKM (2015)
Jiang, D., Cai, Q., Chen, G., Jagadish, H.V., Ooi, B.C., Tan, K.-L., Tung, A.K.H.: Cohort query processing. Proc. VLDB Endow. 10((1), 1–12 (2016)
Ge, C., He, X., Ilyas, I.F., Machanavajjhala, A.: Accuracy-aware differentially private data exploration. In: SIGMOD, Apex (2019)
Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions—i. Math. Program. 14(1), 265–294 (1978)
Sabidussi, G.: The centrality index of a graph. Psychometrika 31(4), 581–603 (1966)
Opsahl, T., Agneessens, F., Skvoretz, J.: Node centrality in weighted networks: generalizing degree and shortest paths. Soc. Netw. 32(3), 245–251 (2010)
Sharma, D., Kapoor, A., Deshpande, A.: On greedy maximization of entropy. In: International Conference on Machine Learning, pp. 1330–1338 (2015)
Korn, G.A., Korn, T.M.: Mathematical Handbook for Scientists and Engineers: Definitions, Theorems, and Formulas for Reference and Review. Courier Corporation, North Chelmsford (2000)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2012)
Fekete, J.-D., Primet, R.: Progressive analytics: a computation paradigm for exploratory data analysis. arXiv preprint arXiv:1607.05162 (2016)
Miller, G.: Human memory and the storage of information. IRE Trans. Inf. Theory 2(3), 129–137 (1956)
Rozinat, A., de Medeiros, A.K.A., Günther, C.W., et al.: The need for a process mining evaluation framework in research and practice. In: BPM. Springer, pp. 84–89 (2007)
Sharma, G., Goodwin, J.: Effect of aging on respiratory system physiology and immunology. Clin. Interv. Aging 1(3), 253 (2006)
Shanks, D.: Solved and Unsolved Problems in Number Theory, vol. 297. AMS, Providence (2001)
Bonchi, F., Giannotti, F., Lucchese, C., Orlando, S., Perego, R., Trasarti, R.: Conquest: a constraint-based querying system for exploratory pattern discovery. In: ICDE (2006)
Yan, N., Li, C., Roy, S.B., Ramegowda, R., Das, G.: Facetedpedia: enabling query-dependent faceted search for Wikipedia. In: CIKM. ACM (2010)
Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: New trends on exploratory methods for data analytics. Proc. VLDB Endow. 10(12), 1977–1980 (2017)
Acknowledgements
Funding was provided by CDP LIFE (Grant No. C7H-ID16-PR4-LIFELIG).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Omidvar-Tehrani, B., Amer-Yahia, S. & Lakshmanan, L.V.S. Cohort analytics: efficiency and applicability. The VLDB Journal 29, 1527–1550 (2020). https://doi.org/10.1007/s00778-020-00625-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-020-00625-6