Keywords

1 Introduction

Cancer is a set of complex genetic diseases whose pathogenesis is not well understood. Initiation and progression of these diseases depend on the misregulation of key genes called cancer/tumor genes. Gene misregulation occurs through different mechanisms including the gain and losses of DNA chromosome fragments (e.g. [11, 18, 20, 24]). These events are commonly termed DNA copy number aberrations (CNAs) and are routinely detected in the laboratory through comparative genomic hybridization (CGH) arrays, single nucleotide polymorphism (SNP) arrays and sequencing (e.g. [1214, 17, 22, 36, 47]). However not all detected CNAS are relevant for tumor initiation and/or progression. It is currently believed that CNAs that contain tumor genes are those that are relevant for tumor progression. These CNAs are called drivers while those which appear to have no biological implications are called passengers. Determining which CNAs are driving tumor progression and which ones are just passengers remains an open problem. Certain CNAs expand over large fragments of the genome and are sometimes termed Amplicons. These regions are important because contain multiple tumor genes and the presence or absence of certain CNAs within an amplicon has been associated with patient’s prognosis (e.g. [23, 41]). Examples include 9p in breast cancer, colon and glioblastoma tumors and lymphomas [5, 19], 11q in head and neck, breast, oral and liver tumors (reviewed in [46]) and 17q in ERBB2/HER2/NEU (ERBB2+, thereafter) positive breast cancer [4]. The detailed structure of amplicons is complex and difficult to investigate using traditional statistical methods since some amplifications appear to occur simultaneously, hence they are not significant as independent CNAs, and have synergistic effects [1, 28, 43]. In this work we will call co-occurring CNAs those that occur simultaneously independently of their functional effects. One potential approach to study the structure of an amplicon and identify potential co-occurring CNAs is to encode combinations of CNAs as a single predictor variable and perform association studies between these new predictor variables and phenotypes of interest.

Here we extend our previously reported supervised approach, termed Topological Analysis of array CGH (TAaCGH), to study the structure of an amplicon. In TAaCGH, we associate a point cloud to each CGH profile (or section of a CGH profile) through a sliding window algorithm [15], build a Vietoris-Rips (VR) simplicial complex [31] and perform an association study between the topological properties of the VR complex and the chosen phenotype. The difference between TAaCGH and other current association studies is that TAaCGH uses the topological properties of the point cloud, instead of the probes, as predictor variables. The advantage of using topological properties as predictors is that they can encode relationships between probes. In previous works we showed that using the rank of the zero homology group (\(\beta _0\)) as a predictor variable in association studies of breast cancer is comparable to other statistical methods [3]. Here we hypothesize that performing association between the rank of the first homology group \(\beta _1\) and a specific phenotype helps analyze the underlying structure of amplicons. This hypothesis is based on recent analytical and numerical results that shows that \(\beta _1\) encodes for periodic patterns [34] and by our own observations that show that neighboring (not-necessarily periodic) regions of amplifications are mirrored by \(\beta _1\) [10, 38].

To test our hypothesis and to illustrate our methodology we analyze the amplicon on 17q in ERBB2/HER2/NEU (ERBB2+, thereafter) positive breast cancer samples. ERBB2+ breast cancer is an aggressive form of the disease that comprises 25 % of all breast tumors diagnosed (reviewed in [35]). The ERBB2 gene is located in the region of the genome labeled as cytoband 17q12 (where 17 is the chromosome arm, q denotes the long arm of the chromosome and 12 denotes a specific band that can be detected by chromosome staining). Misregulation of ERBB2 in ERBB2+ tumors commonly occurs through copy number gains of 17q12. In many patients, this amplification is accompanied by gains of other regions in the same chromosome arm. This includes amplifications of 17q21.2 that encompasses the Top2A gene [32], chromosome regions 17q21.1, 17q22 [27] and \(17q21.33-q25.1\) which is predictive of early recurrence [9] and contains TANC2 (17q23) and PPM1D genes [29, 37], two independent co-amplified regions have also been reported in 17q23 [4, 39].

To test whether TAaCGH can detect these events, we analyzed two independently published data sets [13, 20]. We first confirmed the presence of the amplicon in 17q in both data sets using \(\beta _0\), we then identified specific regions within this arm using \(\beta _1\) analysis. This study revealed two regions of significance delimited by 17q12 and \(17q12-17q21.33\). To further localize the regions of the genome that contributed to the significance of \(\beta _1\) we calculated the generators of the first homology group and the correspondence between the probes and the generators. Statistical analysis quantifying the over-representations of genomic regions in the generators allowed us to further subdivide the region \(17q12-17q21.33\). A first amplification was detected in between the neighboring regions \(17q21.2-17q21.31\) (extending from base pairs 40,884,763-41,826,877) and the region 17q21.33 (from base-pairs 46,603,678-49,075570). Using the UCSC genome browser we observed that the first region contains the keratin cluster (e.g. [30]) and the second contains, among others the HOXB cluster (see [8] for a review). Both of these clusters have been previously reported in breast cancer studies. Whether their functionality is synergistic in some patients remains to be determined.

2 Data Sets and Methods

2.1 CGH Data

CNAs are defined as gains or losses of genome fragments and can be detected using microarray technologies. Through Comparative Genomic Hybridization (CGH), DNA probes (i.e., fragments of DNA sequences) are spotted on a platform. Tumor DNA, labeled with Cy3, and control DNA, labeled with Cy5, are co-hybridized in a 1:1 ratio. The intensity of the hybridized samples is captured and transformed into a red-green ratio value called the \(\log _2\) ratio. Since the physical position of each probe is known, these \(\log _2\) ratios can be mapped to the original genome producing a CGH profile (Fig. 1). In traditional statistical approaches each CGH profile is normalized and segmented, and significant copy number aberrations are then identified [6, 33, 45].

Fig. 1.
figure 1

A CGH profile for chromosome arm 17q. The x-axis indicates the genomic position and the y-axis the \(\log _2\) ratio of the intensity of the tumor and control samples co-hybridized to the same array.

2.2 Simulation Data Set

We simulated single and co-occurring aberrations. A detailed description of the simulation methods for a single aberration can be found in [3, 25, 26]. In brief, each simulation consisted of 200 profiles, 100 in the control set and 100 in the test set. Each simulated profile contained 100 aCGH probes. The value of the copy number along the profile was determined by three parameters: the mean value of the aberration \(\mu \), the length of the aberration \(\lambda \), and the standard deviation associated with noise \(\sigma \). Probes outside the aberration and in the control set had \(\mu =0\), whereas for those probes inside the aberration was \(\mu = 0.6\) or 1. Aberration length \(\lambda \) was equal to 5 and 10 probes. Noise was implemented by drawing samples from a Gaussian distribution of mean 0 and standard deviation \(\sigma \) of values 0.2, 0.6 or 1. The control set for single aberrations was made of profiles without aberrations (i.e. only noise).

Co-occurring aberrations were represented by two aberrations of different lengths. In the first aberration \(\mu = 0.6\) or 1 and in the second \(\mu =1\). The control set was made of profiles with no aberrations or with only one aberration.

2.3 Horlings Data Set

This dataset analyzed was published by Horlings and colleagues [20] and was obtained from the supplementary data [21]. Measurements of copy number variations were performed on microarrays containing 3.5 k BAC, PAC-derived DNA segments covering the entire genome with a spacing average of 1 Mb. Each BAC clone was spotted and triplicated on every slide (Code Link Activated Slides, Amersham Biosciences). Our own preprocessing of the data can be found in [3]. This study contained 14 ERBB2+ patients determined by clinical diagnosis. The control set consisted of the patients belonging to the remaining subtypes.

2.4 Climent Data Set

This data set was used as a validation set. In [13] genome-wide measurements of copy number variations were performed by array CGH (UCSF Hum. Array 2.0) with an average spacing between probes of 1Mb. The study contained 180 patients diagnosed with a stage I/II lymph node-negative breast cancer. The data set was downloaded from the GEO data base with accession number GSE6448. Arrays were preprocessed by averaging/removing probes as follows: 18 clones mapping to chromosome Y or missing genomic location information were removed, 80 probes mapping to identical genomic regions were averaged and represented as single values, 179 probes missing entries for 30 % or more patients were removed, and missing values were imputed using the lowess regression method in the aCGH package for R [16]. This resulted in 2,168 unique clones from the original 2,445 printed in the array. We classified as ERBB2+ tumors the subset of 9 patients that showed a copy number change \({>}{1}\) (in log scale) at the clone DMPC-HFF#1-61H8 which contains the ERBB2 gene.

2.5 Multidimensional Analysis of CGH Profiles Using Computational Algebraic Topology

We previously reported a new method to analyze CGH data called topological analysis of array CGH (TAaCGH) [3, 15]. Our method uses a sliding window algorithm that associates a point cloud to a given CGH profile (or section of a CGH profile). The dimension of the point cloud is determined by the size of the sliding window. In this study and based on our previous work [3] we considered windows of size \(n= 2\). TAaCGH assigns a \(\beta _0\) curve to each CGH profile, computes the average \(\langle \beta _0\rangle \) curve for each population of patients (test and control) and performs statistical analysis to determine differences between them (see below). Here we extended TAaCGH by incorporating a similar analysis using \(\langle \beta _1\rangle \) curves. We used the program JavaPlex to perform the calculation of \(\beta _1\) and its generators [40]. As in the case of \(\beta _0\), we generated the function \(\beta _1(\epsilon )\) for each patient. In this case \(\epsilon \) took values between 0 and the value at which \(\beta _0\)= 1.Given the \(\beta _1(\epsilon )\) for each patient, we computed the average \(\langle \beta _1\rangle \) for the ERBB2 set and the control set (consisting of the reminder of the patients) and test for statistically significant differences between the two \(\langle \beta _1\rangle \) curves.

2.6 Testing for Statistical Differences

To test for statistically significant differences between \(\langle \beta _i \rangle \) curves associated to different patient groups, we assumed the null hypothesis that \(\langle \beta _i\rangle \) curves for a sample of patients was independent of the cancer subtype. We quantified deviations from the null distribution by the statistic \(S_{exp}\), which was defined as the sum of the squares of the differences between the average \(\langle \beta _i\rangle \) curves across all radii, i.e.

$$\begin{aligned} S_{exp,i} = \sum (a_{ij} - b_{ij})^2 \quad \text {for} \quad j= 1,\ldots ,N \end{aligned}$$

where \(a_{ij}\) and \(b_{ij}\) are the \(\langle \beta _i(j)\rangle \) value for each population under study and for the value of the filtration parameter \(\epsilon =j\).

2.7 Finding Co-Occurring Aberrations

In order to determine the regions of the genome that contributed to the first homology we found the CGH probes that were mapped to each of the vertices of the generators. First, generators for each patient and value of the filtration coefficient were calculated using JavaPlex [40]. Second, the probes of the CGH profile that mapped to the vertices of the generators were identified. Third, since generators were not necessary minimal and, due to the noise of the data, some generators mapped to different areas of the genome we determined a CNA by measuring the concentration of the probes. Regions with higher concentration of probes than the control set were called CNAs.

2.8 Software for Visualization of Generators

We created an exploratory tool using Shiny app to visualize the generators in the point cloud together with their corresponding probes in the CGH profile. The app highlights the probes and generators as the values of the filtration coefficient changes. The software allows to visualize the dispersion of the probes associated with the probes through the CGH profile. An example is shown in Fig. 5. The software is available from the authors upon request.

3 Results

3.1 Computer Simulations

To better interpret our results we performed computer simulations. Since the analysis of \(\beta _0\) has been performed elsewhere [3, 15], we focused on simulations concerning the detection of CNAs using \(\beta _1\). Figure 2 shows an example of two simulated profiles, one with no aberrations as control (Fig. 2) and a second one with two co-occurring aberrations (Fig. 2B). In both Fig. 2A and B, the x-axis represents the position along the chromosome and the y-axis the \(log_2\) ratio of the copy number values. The \(\langle \beta _1\rangle \) curves (Fig. 2C) obtained from the curves above help understand the growth and disappearance of the first homology. In the case of no amplification (red), the \(\langle \beta _1\rangle \) curve starts at \(\langle \beta _1\rangle =0\), since for very small values of \(\epsilon \) there is no 1-dimensional homology. \(\langle \beta _1\rangle \) rapidly increases due to the structure of the noise until it reaches a maximum after which it decays to 0. The graph for \(\langle \beta _1\rangle \) is different when two aberrations are present (blue). For small values of the filtration parameter the graph behaves similarly to the graph without aberrations, however in this case the graph shows more than one local maximum and a lower \(log_2\) ratio of copy number values at the first maximum.

Fig. 2.
figure 2

Examples of simulated aberration profiles and \(\langle \beta _1\rangle \) curve. (A) shows a control profile with no aberrations with \(\sigma \)=0.2. (B) shows a profile with two aberrations with parameters \(\lambda =10\) and 5, \(\mu =0.6\) and 1 and \(\sigma =0.2\) for both. The blue dashed lines represent two standard deviations. The bottom graph shows in red the \(\langle \beta _1\rangle \) for the control group with no aberrations and in blue the \(\langle \beta _1\rangle \) curve for a pair of aberrations with \(\lambda =10\), \(\mu =0.6\) and 1 and \(\sigma =0.2\) (Color figure online).

We tested our method by performing a sensitivity and specificity analysis in three different simulation experiments. Each experiment consisted of 200 profiles (100 tests and 100 controls) and all possible combinations of parameters were considered. A successful identification of an aberration was scored when the obtained P-value was less than 0.05 after correcting by FDR. First we considered the case of one single amplification (test set) taking as control set a population with no aberrations. In this case sensitivity was 87.5 %. In the second experiment we used profiles with two amplifications as a test set and no amplifications as the control set. In this experiment we got average sensitivity of 95 %. In the third experiment we compared double amplifications with single (as control). Results showed 82.5 % in sensitivity. Specificity was measured by comparing two control data sets resulting in 97.5 %. Our method has bigger chances to fail when the length of the aberration is small (5 or less) and \(\mu =\sigma \).

\(\varvec{\beta _0}\) Significance of 17q

As discussed elsewhere [3, 15], \(\langle \beta _0\rangle \) curves can detect chromosome aberrations. Since we are interested in the entire amplicon in 17q, we applied TAaCGH to full chromosome arms. The chromosome arm 17q was significant in both data sets. In the Horlings data set we found significance on \(\langle \beta _0\rangle \) curves when comparing chromosome arm 1q (P-value = 0.021) and 17q (P-value = 0.004). The graph for chromosome 1q however showed that the control curve was above the test set indicating that the control set (ERBB-) had more CNAs that the test set (ERBB+). Therefore was not relevant in this study. In our validation data set, we found only 17q to be significant with a corresponding P-value after FDR correction of 0.0037. Figure 3 shows examples of \(\langle \beta _0\rangle \) curves for both chromosomes. Since \(\beta _0\) is the number of connected components of the simplicial complex, \(\langle \beta _0\rangle \) curves start at the value of the number of probes in each chromosome arm for \(\epsilon =0\) and gradually decays with increasing \(\epsilon \) until a single connected component remains. All blue curves shown in Fig. 3 represent the ERBB2+ population and all red curves represent the ERBB2- population. Results shown in Fig. 3A and B include \(\langle \beta _0\rangle \) curves associated to 17q for the Climent and Horlings data sets respectively; Fig. 3C shows \(\langle \beta _0\rangle \) curves associated to 1q and Fig. 3D \(\langle \beta _0\rangle \) curves associated to the negative control 19q. Chromosome arm 17q showed, as expected, a higher number of chromosome aberrations in the ERBB2+ patients than in the ERBB2- patients.

\(\varvec{\beta _1}\) Significance of 17q

Next, we analyzed the significance of \(\beta _1\) in chromosome arm 17q. We considered two approaches. First we tested for \(\beta _1\) significance of the entire chromosome arm 17q and then for overlapping sections of the chromosome arm. We found important to use both approaches since co-occurring CNAs may be local or spread over the entire arm. Analysis using the whole arm showed 17q to be significant in the Climent data set (with a P-value of 0.040), but not in the Horlings data set (P-value 0.172). Figure 4 shows the corresponding \(\langle \beta _1\rangle \) curves for both studies suggesting that any amplicon structure, if present, would be local.

Fig. 3.
figure 3

Examples of \(\langle \beta _0\rangle \) curves in dimension 20. Blue indicates the ERBB2+ population and red the ERBB2-. (A) Arm 17q arm in Climent; (B) Arm 17q in Horlings, (C) Arm 1q in Horlings and (D) Arm 19q in Horlings (Color figure online).

Table 1. Chromosome Sections. Correspondence between sections, cytobands and base pairs range for each of the sections used to analyze chromosome 17q.

Following our previous work [3] we subdivided chromosome arm 17q in the Horlings data set into 6 sections, which corresponded to 5 sections in the Climent data set. Each section containing 20 CGH probes with 10 overlapping probes. Results are shown in Table 1. Column 1 shows the section analyzed; columns 2 and 5 the cytogenetic band, columns 3 and 6 the location in base pairs, and columns 4 and 7 the p-values [7]. Both data sets showed some significant sections. In the Horlings data set, Sects. 2 and 3 significant after correction for multiple testing (column 4). In the Climent data set all sections except Sect. 4 were significant (column 7). Based on the reproducibility of these results we concluded that sections containing cytobands 17q12 to 17q21.33 had co-occurring CNAs and are therefore good candidates for uncovering the underlying structure of the amplicon.

Fig. 4.
figure 4

\(\langle \beta _1\rangle \) Significance of 17q in the climent and the horlings data sets. (A)The figure shows the \(\langle \beta _1\rangle \) curves for ERBB2+ (blue) and ERBB2-(red) in the Climent data set (significant). (B) Here we show the \(\langle \beta _1\rangle \) curves for both categories for the Horlings data set (non-significant) (Color figure online).

To further identify the regions within 17q12 and \(17q21.31-17q21.33\) we identified the generators of the first homology group for each patient and mapped the probes to the vertices of the corresponding generators. Before we discuss the statistical results we highlight some interesting properties of the generators: (1) probes that made up the generators may be distributed throughout the entire arm or localized in a specific region (2) unlike \(\beta _0\) generators do not necessarily detect the global maximum in the profile but different regions that contribute to several local maxima (3) neighboring maxima or even sections of the same maximum are detected at different values of the filtration parameter. Figure 5 shows the profile of a patient for 17q and the point cloud. Probes in blue are those that were mapped to the generators at two different filtration coefficient values. The corresponding 2D point cloud (with edges included) and with the vertices in each cycle highlighted in blue are also shown.

Fig. 5.
figure 5

Correspondence between CGH probes and generators. Different values of the filtration parameter detects different generators which corresponds to different probes in the genome. Panel A shows the profile of one patient and its associated point cloud. The probes highlighted in blue correspond to the vertices of the single generator, also in blue. The filtration coefficient was \(\epsilon =0.78\). Panel B shows the same patient and point cloud for a different value of the filtration coefficient \(\epsilon =0.83\)

These inherent variability of the generators and the noise of the data motivated us to use a statistical approach. As detailed in the methods sections for each patient and value of the filtration parameter we computed the cycles and the probes that defined those cycles. The frequency at which a probe was mapped to a particular region of the genome is represented by a histogram (see Fig. 6). The top graphs show the histograms for the Horlings data set and the bottom ones the histograms for the Climent data set. The histograms on the left are the control and the ones on the right correspond to the ERBB2+. The most remarkable feature is the difference between the control and the ERBB2 data sets. While the control show no significant concentration of the probes that belong to cycles the ERBB2+ clearly show three regions of interest. 17q12 has a significant concentration of cycle elements and corresponds to the position of the gene ERBB2. Two regions extend beyond the position of ERBB2 The first one is in the boundary between 17q21.2 and 17q21.31. The Horlings data set suggests that the region of interest is more localized in 17q21.31 while the Climent data set suggest a region contained in 17q21.2. The last region is located at 17q21.33 and is common to both studies.

Fig. 6.
figure 6

Comparison of ERBB2- (left) and ERBB2+ (right) patients at the generator level. The top histograms correspond to the Horlings data set and the bottom to the Climent data set. Each bar in the histogram represents a probe. Its height represents the cumulative presence of that probe on the generators of the first homology group divided by the number of patients. The cumulative presence is calculated by counting the number of cycles in which the probe is part of the generator for each value of the filtration parameter (multiplied by the number of generators if they were more than one).

Since our simulations show that the first homology group can also identify single amplifications one may argue that the found amplifications correspond to single independent events. To address this problem we analyzed the distribution of the cycles-forming-probes. Figure 7 show some examples of the distribution of cycles in the genome for specific patients. Each plate corresponds to one patient, the x-axis is the position along the genome and the y-axis the “life” of the cycle. Each color represents a different cycle. If the amplifications were independent events one would expect to see single colors concentrated at specific regions. However we see cycles dispersed over the entire profile indicating the presence of co-occurring CNAs.

Fig. 7.
figure 7

Distribution of cycles in CGH profiles. Each plate corresponds to the CGH profile of a patient and how the vertices of the cycles are mapped back to the profile. Different colors indicate different cycles and do not represent the same cycle in each plate. The height of the bars represent the life of the cycle (Color figure online).

4 Discussion

Copy number measurements provide an unparalleled opportunity to identify the underlying mechanisms of cancer. Previous efforts in analyzing copy number data have mainly focused on the identification of single, independent chromosome copy number aberrations. These approaches however are known to be deficient in the identification of co-occurring copy number changes since there is a large number of combinations of probes that one needs to interrogate. In this study, we have presented a methodology that helps circumvent the search for simultaneously occurring CNAs by encoding copy number data as topological objects. In particular we have used the rank of the first homology group to perform this association. To test this hypothesis, we searched for co-occurring aberrations in ERBB2+ breast cancer patients. Our results show \(\beta _1\) significance in chromosome cytobands that extend from 17q12 to 17q21.33. By identifying the probes that form the generators and measuring their concentration along the CGH profiles we were able to further narrow this significant region to three amplifications. The first is 17q12 which contains the ERBB2 gene. The second and the third have also been reported in ERBB2+ patients. The second amplification is in the boundary between 17q21.2 and 17q21.31 and according to our estimation is delimited by the Top2A and BRCA1 genes (base pairs \(40,884763-41,826,877\)). This region encompasses the type I keratin gene cluster. Finally we identified 17q21.33 (base pairs \(47,400,368-49,075570\)) a large region that contain multiple tumor associated genes including the HOXB cluster [42], Prohibitin [44] and amplification of this region has been associated with poor prognosis [41]. Unfortunately at this point, due to the small sample size, we cannot determine how common these co-occurring CNAs are in the general population of ERBB2+ patients or whether they form subtypes within the ERBB2+ subtype. Nevertheless the fact that these regions are significant in two independent data sets is encouraging. It is therefore our immediate plan to scale up this study on larger data sets.

Our work presents also new tools for the topological analysis of time series. We and others [34] independently introduced the concept of using the sliding window algorithm to analyze time series. In our previous work we noted that: (1) the overall shape of the point cloud already provides information of the data [2, 3, 15], (2) The point cloud can be seen as the reconstruction set of the dynamical system induced by the sliding window algorithm [2], (3) the zero homology group identifies large step increments between consecutive measurements [15]. Our contributions in this work is the development of algorithms that (1) detect the single and co-occuring maxima in the data in non-necessarily periodic signals using the first homology group (2) Identify local maxima by computing the concentration of the pre-images (by the sliding window algorithm) of the vertices that form the cycles. It is our belief that the use of topological methods for the analysis of signals using simple construction techniques, such as the commonly used sliding window algorithm, can provide new insights in the analysis of time series.