1 Introduction

Time delays between images of strongly-lensed distant variable sources can serve as a valuable tool for cosmography, provided that time delays between the image fluxes can be accurately measured (e.g. [8, 16]). A number of methods have been developed to accurately estimate time delays. These include the dispersion spectra method [3] and kernel-based method with variable width (K-V) [4, 5]. Actively studied strong quasars with time-delay measurements include RXJ1131−1231 [21] and B1608+656 [6, 8]; Q0957+561 (e.g. [9]). Available data are usually in the form of daily measurements which can be used to predict longer (days and months) delays. Current methods in astrophysics are solely rooted in this scenario. However, when countering the problem of shorter delays (e.g. hours), daily measurements are insufficient and one needs to investigate the individual arrival times of photons.

Poisson process can applied as a model for photon streams [15]. To resolve the delay in gravitationally lensed photon streams one can use the standard kernel based estimation of the inhomogeneous Poisson process rate function on individual photon streams and then try to time-shift the rate function estimates so as the overlap is maximized. Another, more principled alternative is to impose that the source of the delayed photon streams is the same and we simply observe different realizations from the same inhomogeneous Poisson process, gravitationally delayed in time. We study whether, comparing with the standard kernel based baseline, such a principled approach can bring benefits in terms of more stable (less variance) estimation.

Normally, delay estimation would be done over streams of photons from a given energy band and then unified over a multitude of energy bands. The baseline and principled delay estimation methods are then compared in a controlled experimental setting using synthetic photon fluxes with known imposed delay from a variety of inhomogeneous processes assumed to come from a single energy band. To our best knowledge this is the first systematic study that addresses the problem of delay estimation on lensed photon streams. We do not perform experiments on real data, since no large real photon streams from known delayed systems with short time delay are available. Nevertheless, this study serves as a proof of concept and be readily used once appropriate lensed photon streams become available.

2 Kernel Based Delay Estimation in Lensed Photon Streams

For the sake of simplicity we will deal with the case of two lensed photon streams A and B from the same source. All techniques presented in this paper can be easily generalized to multiple streams. We assume that the observed photon streams can be accounted for by a Poisson process (e.g. [18]). In the non-homogeneous Poisson process (NHPP) the mean rate function \(\lambda (s)\) varies over time s. Given a series of arrival times \(s_1, s_2, ..., s_S\) over an interval [0, T], the rate function is commonly estimated by imposing a (Gaussian) kernel of width r on top of each arrival time \(s_i\),

$$\begin{aligned} K_g(s; s_i,r) = \exp \left\{ - \frac{(s - s_i)^2}{2r^2} \right\} . \end{aligned}$$
(1)

The rate function estimate (up to scaling) is then [1214]

$$\begin{aligned} \hat{\lambda }(s)=\sum ^S_{i=1} {K_g(s; s_i,r)}. \end{aligned}$$
(2)

We will refer to this method as Kernel Rate Estimation (KRE).

Suppose that we observe two lensed photon streams \(\{s^A_i\}_{i=1}^{S^A}\) and \(\{s^B_i\}_{i=1}^{S^B}\) from the same source. On each stream we produce a kernel based estimate of the rate function \(\hat{\lambda }^A(s)\), \(\hat{\lambda }^B(s)\). Given a suggested time delay \(\varDelta \), the closeness of the rate estimates (under the delay \(\varDelta \)) can be evaluated e.g. through the mean square difference eventuated on a regular grid of time stamps \(\{z_j\}_{j=1}^Z \) in the relevant time interval,

$$\begin{aligned} d_2(\hat{\lambda }^A,\hat{\lambda }^B; \varDelta ) = \frac{1}{Z} \sum ^{Z}_{j=1} {(\hat{\lambda }^A(z_j) -\hat{\lambda }^B(z_j))}^2. \end{aligned}$$
(3)

The delay is then estimated through minimization of \(d_2(\hat{\lambda }^A,\hat{\lambda }^B; \varDelta )\) with respect to \(\varDelta \) (e.g. via gradient descent). In the following sections we will introduce two variants of delay estimation based on innovation process corresponding to the underlying Poisson process.

3 Innovation Process Based Estimation (IPE)

Recall that if event counts can be modeled by Poisson distribution with mean rate \(\lambda \), then the inter-arrival times are distributed with exponential distribution with mean \(\lambda ^{-1}\). We denote the differences between two consecutive arrival times by \(d^A=\{d^A_i\}_{i=1}^{D^A}\) and \(d^B=\{d^B_i\}_{i=1}^{D^B}\), where \(d^A_i = s^A_{i+1}-s^A_i \) and \(d^B_i = s^B_{i+1}-s^B_i\), respectively. Our goal is to find a probabilistic model that maximizes the probability \(P(d ^A,d ^B |\lambda (s))\),

$$\begin{aligned} P(d^A,d^B |\lambda ^A(s),\lambda ^B(s)) = \prod _{i=1}^{D^A}P(d ^A_i |\lambda ^A(s_i;\varvec{w})) \prod _{i=1}^{D^B}P(d^B_i |\lambda ^B(s_i;\varvec{w}, \varDelta )), \end{aligned}$$
(4)

where \(P(d|\lambda ) = \lambda e^{-\lambda d}\). We impose a kernel based model on the common rate function underlying both streams (expressed for stream A):

$$\begin{aligned} \lambda ^A(s)= \sum \limits ^{J}_{j=1} w_j K_g(s; c_j,r_o) = \varvec{w}^\intercal K_g(s; {\varvec{c}}, r_o), \end{aligned}$$
(5)

with kernels of width \(r_o\), centered at \(c_j, j=1,2\cdots J\) and the J free parameters \(w_j\) collected in vector \(\varvec{w}\). \(K_g(s; {\varvec{c}}, r_o)\) is a vector of kernel evaluations \(K_g(s; c_j,r_o) \) at all centers of \({\varvec{c}} = (c_1, c_2, ..., c_J)\). The rate function of stream B is a time-delayed (by \(\varDelta \)) version of the one for stream A:

$$\begin{aligned} \lambda ^B(s)= \sum \limits ^{J}_{j=1} w_j K_g(s; c_j-\varDelta ,r_o) = \varvec{w}^\intercal K_g(s; {\varvec{c}-\varDelta }, r_o), \end{aligned}$$
(6)

We thus obtain

$$\begin{aligned} P(d^A,d^B |\lambda ^A(s),\lambda ^B(s)) =\prod _{i=1}^{D^A} \lambda ^A(s_i)e^{-\lambda ^A(s_i) d^A_i} \prod _{i=1}^{D^B}\lambda ^B(s_i)e^{-\lambda ^B(s_i) d^B_i}, \end{aligned}$$
(7)

leading to the error functional (negative log likelihood),

$$\begin{aligned} E=-\sum \limits ^{D^A}_{i=1}(\log \lambda ^A(s_i)-\lambda ^A(s_i)d^A_i)-\sum \limits ^{D^B}_{i=1}(\log \lambda ^B(s_i)-\lambda ^B(s_i)d^B_i). \end{aligned}$$
(8)

We minimize E w.r.t two parameters \((\varvec{w}, \varDelta )\) via gradient descent. To that end we plug (5) and (6) into (8),

$$\begin{aligned} \begin{aligned} E&=-\sum \limits ^{D^A}_{i=1}\bigg (\log \sum \limits ^{J}_{j=1} w_jK_g(s^A_i;c_j,r_o)-d^A_i\sum \limits ^{J}_{j=1} w_jK_g(s^A_i;c_j,r_o)\bigg )\\&-\sum \limits ^{D^B}_{i=1}\bigg (\log \sum \limits ^{J}_{j=1} w_jK_g(s^B_i;c_j-\varDelta ,r_o)-d^B_i\sum \limits ^{J}_{j=1} w_jK_g(s^B_i;c_j-\varDelta ,r_o)\bigg ), \end{aligned} \end{aligned}$$
(9)

leading to,

$$\begin{aligned} \frac{\partial E}{\partial \varvec{w}} =- & {} \sum \limits ^{D^A}_{i=1}\bigg (\frac{ K_g(s^A_i;\varvec{c},r_o)}{\varvec{w}^\intercal K_g(s^A_i; \varvec{c}, r_o)}-d^A_i K_g(s^A_i;\varvec{c},r_o)\bigg ) \nonumber \\- & {} \sum \limits ^{D^B}_{i=1}\bigg (\frac{ K_g(s^B_i;\varvec{c} - \varDelta \cdot \varvec{1} ,r_o)}{\varvec{w}^\intercal K_g(s^B_i; \varvec{c}- \varDelta \cdot \varvec{1},r_o)}-d^B_i K_g(s^B_i;\varvec{c}- \varDelta \cdot \varvec{1} ,r_o)\bigg ), \end{aligned}$$
(10)

where \(\varvec{1}\) is a vector of 1’s and

$$\begin{aligned} \begin{aligned} \frac{\partial E}{\partial \varDelta }&=-\sum \limits ^{D^B}_{i=1} \Bigg (\frac{1}{\sum \limits ^{J}_{j=1} w_j \exp \{\frac{-(s^B_i-(c_j-\varDelta ))^2}{2r_o^2}\}}\sum \limits ^{J}_{j=1} w_j \exp \{\frac{-(s^B_i-(c_j-\varDelta ))^2}{2r_o^2}\}\frac{-2(s^B_i-(c_j-\varDelta ))}{2r_o^2} \\&-d^B_i\sum \limits ^{J}_{j=1} w_j \exp \{\frac{-(s^B_i-(c_j-\varDelta ))^2}{2r_o^2}\}\frac{-2(s^B_i-(c_j-\varDelta ))}{2r_o^2}\Bigg ). \end{aligned} \end{aligned}$$
(11)

4 Parameters Initialization

Gaussian kernels have two parameters need to be determined, in particular kernel centers \(\{c_j\}^{J}_{j=1}\) and the kernel width r. As explained above, in KRE, kernels are centered at each photon’s arrival time, whereas in IPE, the centers \(c_j\) are uniformly distributed across the time period [0, T].

The kernel width determines the degree of smoothing for the underlying rate function. For KRE, we apply a method for selecting the width based on the principle of minimizing the mean integrated square error (MISE) proposed by [19]. For IPE, the kernel width \(r_o\) is optimized using cross-validation method proposed in [4, 10]. The algorithm partitions the data into 10 blocks of equal length \(\mathcal {L}\). The i-th validation set \(\varvec{\mathcal {V}}_i\), \(i = 1,2\cdots \mathcal {L}\), is obtained by collecting the i-th element of each block. The rest of the data is the “training set". We then fit our models on the training set and use the validation set \(\varvec{\mathcal {V}}_i\) to calculate the cost function E over a range of suggested width values \(r_o\in (L_{r_o},U_{r_o})\). This procedure is repeated \(\mathcal {L}\) times for each validation set \(\varvec{\mathcal {V}}_i\), \(i = 1,2\cdots \mathcal {L}\). The chosen \(r_o\) is the one yielding the smallest average cost E across the folds \(i = 1,2\cdots \mathcal {L}\).

The IPE weight vector \(\varvec{w}\) is initialized using the rate function estimates readily provided by the KRE model. However, the rate functions obtained by KRE on streams A and B need to be scaled to represent the underlying rate of the non-homogeneous Poisson processFootnote 1. Given the KRE-estimated rate functions on streams A and B, \(\hat{\lambda }^A(s)\), \(\hat{\lambda }^B(s)\), respectively, the overall KRE rate function is their average

$$\begin{aligned} \hat{\lambda }(s)=\frac{\hat{\lambda }^A(s)+\hat{\lambda }^B(s)}{2}. \end{aligned}$$
(12)

The scaling factor \(\vartheta \) is obtained by imposing the rate function \(\lambda (s)=\vartheta \hat{\lambda }(s)\) and minimizing (9) with respect to \(\vartheta \). Denoting \(\hat{\lambda }(s^A_i)d^A_i\) and \(\hat{\lambda }(s^B_i) d^B_i\) by \(q^A_i\) and \(q^B_i\), respectively, it can be shown that the minimum is obtained at

$$\begin{aligned} \vartheta =\frac{D^A+D^B}{ \sum \limits ^{D^A}_{i=1}q^A_i+\sum \limits ^{D^B}_{i=1}q^B_i}. \end{aligned}$$
(13)

Setting of IPE weights to match the rate function \(\lambda (s)\) can then be done by imposing a regular \((s_1, s_2, ..., s_N)\) grid on [0, T], evaluating the rate values on the grid, \(\varvec{{x}} = (\hat{\lambda }(s_1),\hat{\lambda }(s_2)\ldots \hat{\lambda }(s_N) )^\intercal \), and solving

$$\begin{aligned} \varvec{w} =\varvec{{K}^{\intercal +}}\varvec{{x}}, \end{aligned}$$
(14)

where \(\mathbf {K}\) is an \(N\times N\) matrix

$$\begin{aligned} \mathbf {K}=\left[ K_g(s_1; \varvec{c},r_o), K_g(s_2; \varvec{c},r_o),\cdots K_g(s_{N}; \varvec{c},r_o)\right] . \end{aligned}$$
(15)

5 Experiments

To test and compare the methodologies suggested above, we performed controlled experiments on synthetic data generated from non-homogeneous Poisson processes. From each given non-homogeneous Poisson processes we generated two series A and B of arrival times, the series B was then time-shifted by a known delay \(\varDelta \).

The rate functions defining non-homogeneous Poisson processes were obtained by superimposing G Gaussian functions of fixed width \(r_g\) positioned on a regular grid \(\{ c_g \}_{g=1}^G\) in [0, T],

$$\begin{aligned} \lambda (s) = \sum ^G_{g=1} w_g \cdot K_g(s; c_g, r_g), \end{aligned}$$
(16)

where \(w_g\in \mathbb {R}\) are the mixing weights generated randomly from uniform distribution on \( [L_w,U_w]\). The kernel widths were set to a multiple of the kernel separation (distance between the two consecutive kernel centers) \(d_g\), \(r_g = \alpha _g \cdot d_g\). We used \(T=400\), \(G=80\), \(\alpha _g=3\), \(L_w =-1\) and \(U_w=1\). The synthetic rate functions were then rescaled to the interval [0.2]. Given a rate function \(\lambda (s)\), the arrival times were generated using the Thinning technique [2, 7, 11, 17, 20]. An example of a test rate function and the corresponding photon stream is shown in Fig. 1.

Fig. 1.
figure 1

An example of a test rate function and the corresponding photon stream.

Table 1. Statistical analysis of delay estimates. The results for each method and each imposed delay \(\varDelta \in \{20, 22, 25, 28\}\) are averaged over 100 test rate functions. The time delay trial values were taken from the interval [10, 40] with increment of 10.

Using this process, we generate two photon streams from the same rate function: \(\{s_{i}^{A}\}\) and \(\{s_{j}^{B}\}\), \(i=1,2,\cdots ,S^{A}\) and \(j=1,2,\cdots ,S^{B}\). To create a pair of time shifted streams, \(s^{B}\) is shifted in time by a delay \(\varDelta >0\)

$$\begin{aligned} s_{i}^{B}\leftarrow s_{i}^{B}+\varDelta ,\forall i=1,2,\cdots ,S^{B} \end{aligned}$$
(17)

To prepare the streams for experiments, we cut the two streams to ensure they have the same start and end point in time.

We performed controlled experiments, where 100 test rate functions were generated as described in Sect. 5. For each test rate function we imposed four delay values \(\varDelta \in \{20, 22, 25, 28\}\), resulting in 400 individual experiments. The time delay trial values were taken from the interval [10, 40] with increment of 10. For each model and each imposed delay \(\varDelta \in \{20, 22, 25, 28\}\), we report the mean \(\mu \) and standard deviation \(\sigma \) of the maximum-likelihood delay estimates across the set of 100 test rate functions. We also report the mean absolute error (MAE) of the delay estimates and the 95 % Credibility Interval (CI). Summary results are presented in Table 1.

6 Conclusion

In this paper, we tested whether a more principled treatment of delay estimation in lensed photon streams, compared with the standard kernel estimation method, can have benefits of a more accurate (less biased) and/or more stable (less variance) estimation. In particular, we formulated a baseline method (KRE) based on kernel estimation of the rate function of inhomogeneous Poisson process. The delay estimate is refined using gradient descent in the delay parameter on the error functional.

A more principled delay estimation relied on imposing a single latent inhomogeneous Poisson process underlying the lensed photon streams. The rate function model was formulated as a linear combination of nonlinear basis functions, thus making the non-linear model linear in the mixing parameters. We tested this idea in the Innovation Process Based Estimation (IPE).

Somewhat surprisingly, the overall emerging picture is that the theoretically more principled methods do not bring much practical benefit in terms of the bias/variance of the delay estimation. This is in contrast to our previous findings on daily flux data [1, 4, 5]. It appears that because the underlying latent rate function is represented only implicitly through the streams of arrival times weakens the stabilizing factor of the single unified intensity function that proved so useful in the case of daily flux data [1, 4, 5]. Indeed, in that case, knowing the amount of observational noise and observing noisy flux levels gave much better clues as to what the common source variability could be, thus stabilizing the delay estimation. Nevertheless, we propose that a study of the kind is useful and necessary for future developments of alternative methods for the delay estimation in lensed photon streams.