1 Introduction

Disease co-occurrence, which means that two or more diseases co-occur within one patient [1], is a popular topic in public health studies. It sometimes represents comorbidity or multimorbidity and can suggest interactions between different risk factors like diagnoses, treatments and procedures [1, 2]. Data mining and machine learning techniques are widely applied to public health domain to discover disease co-occurrences. For example, statistical methods can be used to measure the association between two different diagnoses [3], and structure learning models like Bayesian Network are used to analyze interactions in disease co-occurrence patterns [2]. Disease co-occurrences can also be identified by computing diseases that co-occur most frequently using Apriori-like algorithms [4]. Patterns and features discovered from comorbidities could provide a foundation for creating predictive models [5]. For example, comorbidities are informative features in predicting readmission risk of certain diseases [1].

Although disease co-occurrence is essential in studying correlations among different diseases, it fails to suggest temporal trends of diagnoses as information on the order in which diseases occur is not available. It therefore cannot reveal disease progression. Sequential data mining, which considers the order of data elements, has been used to detect temporal trends of various diseases. For example, windowing, episode rules and inductive logic programming are used to extract frequent sequential patterns of cardiovascular diseases [6]. Aggregate values and time intervals from health records are used as features to cluster patients into different cohorts [7]. Wavelet functions can help to analyze time series in healthcare data of patients with diabetes [8]. However, most of these methods are value-based, they use values from laboratory tests or other healthcare records to generate results. In our study, we adopt a sequence mining method that uses diagnosis codes (class labels) to study disease progression from patients’ diagnosis histories.

Recently, open data initiatives from governments collect and make available large amounts of healthcare data, and provide a unique opportunity to study disease comorbidities and sequential patterns. They are attractive to researchers working on public health studies because of their completeness and inexpensive nature [9, 10]. Such data are extensively used in healthcare research, such as prevention and detection of diseases, studying comorbidity and mortality, and advancing interventions, therapies and treatments [9]. They can also be combined with multiple data sources to serve different purposes, such as studying disease patterns and improving healthcare quality among different cohorts. For instance, predicting asthma-related emergency department visits [11] and analyzing temporal patterns of in-hospital falls among elderly patients [12].

As part of New York State’s open data initiative, New York State Statewide Planning and Research Cooperative System (SPARCS) collects patient-level information on discharge records from hospitals, which contains patients’ diagnosis, procedure and demographic information for over 35 years [13]. SPARCS is now widely applied to public health studies in New York State [14, 15], such as correlations between various factors and outcomes of patients who suffer from different diseases [16,17,18,19], associations of different patient characteristics, diseases and treatments [16, 20]. SPARCS is also used to discover temporal or spatial patterns of emergency department visits before, during and after Hurricane Sandy [21, 22]. Researchers can benefit from SPARCS data by leveraging the long patient-level diagnosis histories, such as conducting population-based studies [23] and assessing completeness of disease reporting [24]. Patient-level longitudinal data can also embrace other data sources like drug exposure profiles and genetics data to study patterns in different cohorts [5].

The objective of this study is to find association rules (i.e., co-occurrences of diseases) and frequent sequence patterns from diagnosis histories of cancer patients in New York State using SPARCS data. Association rules learning of multiple diseases could imply comorbidities, while sequence patterns of diseases could indicate disease progression. We extract all discharge records of patients with at least one cancer-related diagnosis code, and convert the ninth and tenth revision of International Classification of Diseases (ICD-9 and ICD-10) diagnosis codes to single-level Clinical Classifications Software (CCS) diagnosis categories. The CCS cancer categories are used as disease labels in our work. We use Apriori algorithm for association rules learning to find potential comorbidities using multiple diagnoses from individual visits and cSPADE algorithm for frequent sequence mining to identify frequent disease sequence patterns from full discharge histories of patients in each cohort. We perform the studies by using only primary diagnoses and using all diagnoses (including secondary ones), to generate different patterns. We present the results based on several common cancer types, and we believe that the results will provide essential data and knowledge for clinical researchers to further investigate comorbidities and disease progression for improving the management of multiple diseases.

Table 1. Cancer-related CCS diagnosis categories and descriptions.

2 Methods

Using data mining and machine learning methods to study patients’ profiles can help researchers to study comorbidities and disease progression [5]. Our objective is to conduct a patient-level longitudinal study using SPARCS data to discover frequent disease co-occurrence and sequence patterns. We first convert ICD-9 and ICD-10 diagnosis codes to CCS diagnosis categories, and then use Apriori and cSPADE algorithms to identify patterns using these high-level categories. We only focus on histories of patients who have at least one of the cancer-related CCS diagnosis categories (Table 1).

2.1 Data Sources

We use SPARCS data and obtain histories of 21,466,868 patients from 97,849,071 discharge records during 2011–2015. Discharge records with all four kinds of claim types (i.e. inpatient, outpatient, ambulatory surgery and emergency department) are used to get a full history of each patient. Table 2 shows patient characteristics of our experiment data.

Table 2. Statistics of patient characteristics for selected cancer types.

There are 25 data elements used to record ICD diagnosis codes of each hospital visit in SPARCS. The first diagnosis code is the primary diagnosis code that represents a main reason for a patient’s hospital visit, the rest are secondary diagnosis codes that represent conditions coexist during that hospital visit. All ICD-9 and ICD-10 diagnosis codes are converted to their corresponding single-level CCS diagnosis categories, i.e. primary diagnosis categories and secondary diagnosis categories. These high-level diagnosis categories are used to represent disease diagnoses to reduce dimensionality in data mining. We study patients with cancer diagnosis categories only. For each cancer category, patients whoever have at least one discharge record containing the cancer-related diagnosis information are selected into the cohort. There are 1,565,237 cancer patients and 18,208,830 history discharge records used in this study. Each patient’s discharge records are grouped together using an encrypted unique patient identifier in SPARCS. Due to the length limit of this paper, we select seven types of cancers with high incident rates, which are consistent with the statistics by American Cancer Society [25], to present our results.

For each patient, discharge records are ordered by admission dates such that all CCS diagnosis categories on the same admission date form an element, and all elements are ordered to constitute a sequence (Fig. 1). Discharge records contain AIDS/HIV or abortion diagnoses are deleted from our experiment data because the admission dates are redacted and we cannot decide their positions in a sequence. An example of diagnoses sequence of a patient in cohort with lung and bronchus cancer is shown in Fig. 1. CCS diagnosis category descriptions reported on the same admission date are listed in brackets and form an element. The corresponding CCS category labels are marked in the parentheses following the descriptions. Admission dates are marked on top of each corresponding element. The primary diagnosis category of each element is underlined. CCS category that represents the targeted cancer (i.e., lung and bronchus cancer) is highlighted in bold.

Fig. 1.
figure 1

Diagnoses sequence of a patient with lung and bronchus cancer.

2.2 Apriori Algorithm: Identifying Disease Co-occurrence Patterns

Association rule learning is a rule-based machine learning approach and is usually used to identify co-occurrences or temporal patterns between diseases in clinical domain [4]. In this study, we adopt Apriori algorithm [26] to identify disease co-occurrence patterns among each cohort. Only elements with targeted cancer CCS diagnosis categories are selected, and both primary and secondary diagnosis categories are used in our experiment. For instance, for the sequence illustrated in Fig. 1, elements where the targeted cancer CCS diagnosis categories are highlighted in bold are used.

Apriori algorithm discovers frequent disease co-occurrences by comparing their supports with a user-specified minimum support threshold. In Fig. 1, for example, if the support of pattern “{Cancer of bronchus; lung (19), Other lower respiratory disease (133)}” is 15%, it means that 15% of the elements in this cohort have this disease co-occurrence pattern. If the minimum support threshold is greater than 15%, this pattern will not be identified. However, if the minimum support threshold is set smaller than 15%, the pattern will be detected.

2.3 cSPADE Algorithm: Discovering Frequent Sequence Patterns

Because ICD diagnosis codes are the only data elements available in SPARCS that contain patient-level disease information, we can use frequent sequence mining [27] technique to find frequent disease sequence patterns among different cohorts. Since diagnosis codes are strictly ordered in sequences, the results might reveal disease progression. We use cSPADE algorithm [27] to discover frequent disease sequence patterns in different cohorts. We experiment on complete patient sequences with two settings: one is using only primary diagnosis categories, the other one is using both primary and secondary diagnosis categories. Figure 1 is an example of a complete patient sequence consists of both primary and secondary diagnosis categories. The length of a sequence pattern is the total number of elements in this sequence. There are 10 elements in the sequence in Fig. 1, thus it is a length-10 sequence.

cSPADE algorithm also works by comparing the support of a sequence pattern with the minimum support threshold. Multiple occurrences of a pattern in the same sequence is counted only once. For example, length-2 sequence pattern “{Cancer of bronchus; lung (19), Other lower respiratory disease (133)} \(\rightarrow \) {Cardiac dysrhythmias (106)}” appears twice in Fig. 1, but this pattern will be counted only once in this sequence when calculating the support of this pattern. If the support of this sequence pattern is 15%, it means that the fraction of sequences containing this pattern in the targeted cohort is 15%. If the minimum support threshold is smaller than 15%, this sequence pattern is selected; otherwise the pattern is pruned in the searching results.

3 Results

We present the top five frequent disease co-occurrence and sequence patterns ranked by their supports in each cohort. Some meaningless results, such as patterns containing identical diagnosis categories, CCS diagnosis categories that represent unspecific disease groups or serve administrative purposes, patterns with length one and patterns irrelevant to targeted cancers, are filtered out when refining experiment results. We choose to present length-2 disease sequences in our experiment results, because longer disease sequence patterns obtained in our experiments usually contain repeated diagnosis categories that represent follow-up visits rather than disease progression.

Frequent disease co-occurrence patterns are presented in Table 3, and the results are generated using both primary and secondary diagnosis categories. Table 4 presents frequent disease sequence patterns discovered using only primary diagnosis categories. Table 5 demonstrates frequent disease sequence patterns identified using both primary and secondary diagnosis categories.

Table 3. Frequent disease co-occurrences for selected cancers, using both primary and secondary diagnosis categories.
Table 4. Frequent sequence patterns for selected cancers, using primary diagnosis categories only.
Table 5. Frequent sequence patterns for selected cancers, using both primary and secondary diagnosis categories.

4 Discussion

4.1 Common CCS Categories in Different Cohorts

We can learn from Tables 3 and 5 that essential hypertension is the most frequent CCS diagnosis category among all results of either frequent disease co-occurrence or sequence patterns. However, essential hypertension appears in only three sequences in Table 4. This might because of the difference between primary diagnosis codes and secondary diagnosis codes in SPARCS data. Results in Tables 3 and 5 are generated using both primary and secondary diagnosis categories, but patterns in Table 4 are discovered using primary diagnosis categories only. Since primary diagnosis codes usually represent one major reason for a hospital visit and secondary diagnosis codes imply conditions that coexist during this visit, a combination of primary and secondary diagnosis codes usually contain richer diagnosis information. Perhaps cancers are more likely to be diagnosed with in the elderly and essential hypertension tend to be popular among old people, thus patients with cancer diagnoses could usually have essential hypertension. Combining primary and secondary diagnosis codes can help us easily detect this pattern. Disorders of lipid metabolism is another diagnosis category that is frequent in both Tables 3 and 5, while unseen in Table 4. The underlying theory might be similar.

4.2 Disparities Between Primary and Secondary Diagnosis Codes

Tables 4 and 5 both present frequent disease sequence patterns among different cohorts, while Table 4 shows the results produced using primary diagnosis categories only and Table 5 demonstrates results using both primary and secondary diagnosis categories. Frequent disease sequence patterns among same cohorts in these two tables are quite different. Disparities between Tables 4 and 5 could imply that either primary diagnosis codes or secondary diagnosis codes may be or may not be useful in finding potentially meaningful disease sequence patterns. Since primary diagnosis codes usually represent the main reason of a hospital visit, these codes are supposed to be good indicators of a patient’s condition at admission. However, secondary diagnosis codes simply represent conditions that coexist in the same hospital visit, they might not be able to accurately represent a patient’s condition responsible for that hospital visit. Thus, secondary diagnosis codes could be less meaningful information in this study. This can be justified by comparing results in Tables 4 and 5.

For patients with lung and bronchus cancer in Table 4, the most frequent sequences mainly consist of respiratory system diseases, such as pneumonia and chronic obstructive pulmonary disease and bronchiectasis. But there is no respiratory system disease in the top five frequent disease sequence patterns among the same patient cohort in Table 5. Another typical cohort is patients with liver and intrahepatic bile duct cancer. We can learn from Table 4 that patients in this cohort sometimes expose themselves to hepatitis or biliary tract disease. However, such patterns are not available in Table 5. Also for patients with Non-Hodgkin’s lymphoma, frequent sequence patterns shown in Tables 4 and 5 are quite different. Only results in Table 4 capture the existence of lymphadenitis and disease of white blood cells. Also, the most frequent disease sequence patterns among this cohort in Table 4 all consist of immune system diseases.

4.3 Frequent Disease Co-occurrence Patterns Versus Frequent Disease Sequence Patterns

One major difference between disease sequence and co-occurrence patterns is that the orders of diagnoses are taken into consideration in a disease sequence, while disease co-occurrences simply represent different diagnoses that occur simultaneously. Disease sequence pattern can therefore be a potential indicator of disease progression. Since the order of two different diagnosis categories is the major factor to consider when tracking disease progression, we retain a frequent disease sequence pattern in the results, if its elements are reversed in another top frequent disease sequence pattern.

For instance, sequence patterns “{Rectum and anus cancer} \(\rightarrow \) {Colon cancer}” and “{Colon cancer} \(\rightarrow \) {Rectum and anus cancer}” are both kept in Table 4. The former has support 0.1323, which is slightly greater than the latter (0.1206). Perhaps it is because that rectum and anus cancer are more likely to develop into colon cancer, but fewer patients suffer from colon cancer can eventually have rectum and anus cancer. There could be causal relationships between the two diseases, or perhaps it is simply a result of the different mechanisms of these two types of cancers.

Another typical pattern is in disease sequences containing essential hypertension. In Table 5, for example, sequence pattern “{Essential hypertension} \(\rightarrow \) {Pancreas cancer}” has support 0.5440, which is higher than the reversed sequence “{Pancreas cancer} \(\rightarrow \) {Essential hypertension}” with support 0.4156. It is evident that all the sequences where essential hypertension is at the first position have higher supports than their reversed sequences. It is an interesting phenomenon that perhaps imply the progression of pancreas cancer. However, we cannot obtain any information on disease progression from disease co-occurrence patterns. For example, Table 3 shows that pattern “{Pancreas cancer, Essential hypertension}” is with the highest support among patients with pancreas cancer. It simply suggests that these two diagnoses co-occur frequently, but no information on the order in which they occur is available.

4.4 Validation of Results

Many public health studies use data from only one or a few hospitals collected in a short period of time [3, 4, 10]. However, SPARCS has been collecting more representative and comprehensive data for over 35 years, as all Article 28 facilities (i.e. hospitals, nursing homes, and diagnostic treatment centers) certified for inpatient care and all facilities providing ambulatory surgery services in New York State are required to submit inpatient or outpatient data to SPARCS [13]. We therefore have a large-scale dataset with longer patient histories that could help generate potentially meaningful results.

For disease co-occurrences (Table 3), patients with lung and bronchus cancer usually have chronic obstructive pulmonary disease and bronchiectasis observed at the same time. Since these two diseases are both respiratory system diseases, they are reasonably correlated with each other. The same applies for patients with pancreas cancer. Patients in this cohort have a risk of suffering from diabetes, as pancreas cancer and diabetes are clinically correlated [28]. Moreover, patients with liver and intrahepatic bile duct cancer also have chance to be diagnosed with hepatitis at the same time, because these two diseases are also associated with each other [28].

As for disease sequences (Table 4), many patients have pancreatic disorders (not diabetes) or biliary tract disease before being diagnosed with pancreas cancer. This might be a typical disease progression pattern in clinical studies and could help domain experts to identify pancreas cancer in the early stages. Another representative result is about Non-Hodgkin’s lymphoma, because the result sequences usually consist of immune system diseases. The top sequence patterns suggest that lymphadenitis is likely to happen before Non-Hodgkin’s lymphoma and disease of white blood cells is usually diagnosed after Non-Hodgkin’s lymphoma.

Although secondary diagnosis codes could be redundant information on patient conditions, they are also able to produce some potentially interesting and meaningful patterns on disease progression when combined with primary diagnosis codes. For example, prostate cancer is more likely to be diagnosed after genitourinary symptoms and ill-defined conditions are identified, and breast cancer usually happens after nonmalignant breast conditions (Table 5). Since these two patterns have comparatively higher supports than other sequence patterns in the same cohorts, they could be typical patterns in clinical studies.

5 Conclusion

We employ association rule learning (Apriori algorithm) and frequent sequence mining (cSPADE algorithm) to identify frequent disease co-occurrence and sequence patterns among cancer patients using SPARCS data. Different types of diagnosis codes are utilized in our experiments. Seven cohorts where cancers are with high incident rates are selected to present the results. Our results suggest that the methods adopted can generate potentially interesting and clinically meaningful disease co-occurrence and sequence patterns. These patterns might be able to imply comorbidities and disease progression. However, due to the limitation of information that diagnosis codes can convey in SPARCS, our results contain some redundant or less meaningful patterns irrelevant to the targeted cancers. Since SPARCS is designed to serve administrative purpose to monitor and improve qualities of hospital services and data reporting, we believe our study could not only help to improve healthcare qualities provided to serve cancer patients, but also throw light upon researches using diagnosis codes in SPARCS.

Since high-level diagnosis categories contain richer but less specific diagnoses information than diagnosis codes, we can use low-level ICD-9 and ICD-10 diagnosis codes in our future researches to see if more specific and useful patterns can be extracted. We can also experiment on a cohort with one certain disease to narrow down the scope of our study and gain a deeper insight into that specific cohort.