1 Introduction

The Viterbi algorithm [36, 37] proposed by Andrew J. Viterbi in 1967, is a dynamic programming algorithm that finds the most probable sequence of hidden states, called the “Viterbi path” from a given sequence of observed events in the context of a hidden Markov model (HMM).

Motivation. The Viterbi algorithm has numerous real-world applications. Although it was originally used for speech recognition in CDMA technology, in the last 25 years, it has been heavily used in computational biology and bioinformatics for finding coding and non-coding regions of an unlabeled string of DNA nucleotides (i.e., gene finding) [3], prediction of protein-coding regions in genome sequences modeling families of related DNA or protein sequences and prediction of secondary structure elements in proteins [24], CpG island [17], promoter [29] and conserved elements detection [30]. Apart from computational biology, Viterbi algorithm is used in TDMA system for GSM [15], television sets [28], satellite and space communication [21], magnetic recording systems [23], parsing context-free grammars [22], and part-of-speech tagging [16]. Therefore, improving performance of Viterbi algorithm will likely to have impact in these areas as well.

When the input data becomes too large to fit into a cache, between two algorithms that perform the same set of CPU operations, the one that is more cache-efficient, i.e., causes fewer block transfers (or IO) between adjacent levels of caches is likely to run faster. Though there have been a lot of efforts and successes in parallelizing the Viterbi algorithm, there is little work in the realm of designing cache-efficient Viterbi algorithms that are also cache-oblivious [20], i.e., independent of cache parameters such as cache sizes and block sizes. Similarly, a processor-oblivious [12] algorithm does not use the number of processors in the algorithmic description. A cache- and processor-oblivious algorithm is more likely to be portable across machines. To the best of our knowledge, we present the first provably cache-efficient cache-oblivious parallel Viterbi algorithm.

We use dynamic multithreading model [14] and ideal cache model [20] to measure parallelism and serial cache complexity, respectively.

Related Work. Several efficient cache- and processor-oblivious recursive divide-and-conquer algorithms for solving dynamic programs (DP) have been developed [2, 4, 711, 13, 31, 33, 34]. But the approach used in those papers assumes that the set and sequence of DP cell updates to be performed do not depend on the data values in the DP table which is not true in case of Viterbi DP.

One can use auto-parallelizers to parallelize sequential Viterbi programs. Fisher and Ghuloum [19] present a method in which loop body instances are represented in a closed form using function compositions. Reduction is then applied for parallelization. Chin et al. [5, 6] use second-order generalization and induction derivation to generate divide-and-conquer parallel programs. None of these methods exploit parallelism across stages. Also the generated parallel programs are not cache-efficient.

The parallel Viterbi algorithm [18] used for homology search in HMMER uses SSE2 instructions and reduces L1 cache misses. Though the phrase “cache-oblivious” appears in the title of the paper, the presented algorithm is not oblivious of the cache parameters as it uses loop-tiling with the tile size determined based on the size of the L1 data cache. Also the algorithm works only for three states, and it is not clear how the method behaves for arbitrarily large number of states as in the case of a general Viterbi algorithm.

The EasyPDP system [32] parallelizes the Viterbi algorithm and also reduces cache misses. However, it requires the user to specify loop tile sizes making it cache-aware. Also the reduction in cache misses is not significant.

The Viterbi algorithm is inherently sequential across stages which constraints parallelism along the time dimension. A parallel Viterbi algorithm presented in [26, 27] based on rank convergence is the first to exploit parallelism across stages. However, this algorithm is processor-aware and not cache-efficient.

Our Contributions. Our major contributions are: (1) an efficient cache-oblivious parallel multi-instance Viterbi algorithm (Sect. 3), (2) an efficient cache-oblivious parallel single-instance Viterbi algorithm (Sect. 5) based on our multi-instance algorithm (Sect. 3) and Maleki et al.’s rank convergence algorithm (Sect. 4), and (3) experimental results (Sect. 6) comparing our algorithms with Maleki et al.’s algorithms on modern multicore platforms.

2 Cache-inefficient Viterbi Algorithm

In this section, we formally describe the Viterbi dynamic program (DP), and describe a simple cache-inefficient Viterbi algorithm based on divide-and-conquer.

Fig. 1.
figure 1

Iterative and recursive Viterbi algorithms.

Formal Specification. The Viterbi DP is described as follows. We are given an observation space \(O = \{ o_1, o_2, \ldots , o_m \}\), state space \(S = \{ s_1, s_2, \ldots , s_n \}\), observations \(Y = \{ y_1, y_2, \ldots , y_t \}\), transition matrix A of size \(n \times n\), where A[ij] is the transition probability of transiting from \(s_i\) to \(s_j\), emission matrix B of size \(n \times m\), where B[ij] is the probability of observing \(o_j\) at \(s_i\), and initial probability vector (or initial solution vector) I, where I[i] is the probability that \(x_1 = s_i\). Let \(X = \{x_1, x_2, \ldots , x_t\}\) be a sequence of hidden states that generates \(Y = \{ y_1, y_2, \ldots , y_t \}\). Then the matrices P and \(P^{\prime }\) of size \(n \times t\), where P[ij] is the probability of the most likely path of getting to state \(s_i\) at observation \(o_j\) and \(P^{\prime }[i, j]\) stores the hidden state of the most likely path (i.e., Viterbi path) are computed as follows. \(P[i, j] = I[i] \cdot B[i, y_1]\), and \(P^{\prime }[i, j] = 0\) when \(j = 1\). Otherwise (i.e., when \(j > 1\)):

$$P[i, j] = {\mathrm {max}}_{{k \in [1, n]}} (P[k, j - 1] \cdot A[k, i] \cdot B[i, y_j]),$$
$${\mathrm {and}}~ P^{\prime }[i, j] = {\mathrm {arg}} {\mathrm {max}}_{k \in [1, n]} (P[k, j-1] \cdot A[k, i] \cdot B[i, y_j]),$$

Cache-inefficient Algorithm. An iterative parallel and a recursive divide-and-conquer-based parallel Viterbi algorithms are given in Fig. 1. As per the Viterbi recurrence, each cell (ij) of matrix P depends on all cells of P in column \(j-1\), all cells of A in column i, and the cell \((i, y_j)\) of B. The function \(\mathscr {A}_{vit}\) fills jth column of P denoted by X using \((j-1)\)th column denoted by U using a divide-and-conquer approach. To compute each column of P, the entire matrix of A has to be read. Hence the recursive algorithm is cache-inefficient. In both algorithms, the stages are computed sequentially, however, all cells in each stage (or timestep) are computed in parallel.

Complexity Analysis. The serial cache complexity of the iterative algorithm is computed as \(\sum _{j=1}^{t} \sum _{i=1}^{n} {\mathcal O}\left( {n/B}\right) = {\mathcal O}\left( {n^2 t / B}\right) \), and that of the divide-and-conquer algorithm is computed as follows. Let \(Q_{\mathscr {A}}(n)\) denote the serial cache complexity of \(\mathscr {A}_{vit}\) on a matrix of size \(n \times n\). Then \(Q_{\mathscr {A}}(n) = {\mathcal O}\left( {{n^2 / B} + n}\right) \) if \(n^2 \le {\gamma _{\mathscr {A}}} M\), and \(4Q_{\mathscr {A}}\left( {n / 2}\right) + {\mathcal O}\left( {1}\right) \), otherwise; where, \({\gamma _{\mathscr {A}}}\) is a suitable constant. Solving, \(Q_{\mathscr {A}}(n) = {\mathcal O}\left( {n^2 / B + n}\right) \). Thus, the serial cache complexity of the recursive algorithm is \({\mathcal O}\left( {n^2 t / B + nt}\right) \) when \(n^2\) is too large to fit in cache.

Both the iterative and recursive algorithms have spatial locality, but they do not have any temporal locality. Hence, these algorithms are not cache-efficient.

The span (i.e., runtime on a machine with an unbounded number of processors) of the iterative algorithm is \({\varTheta }\left( {nt}\right) \), as there are t time steps and it takes n time steps to fully update a cell of P. The span of the recursive algorithm is computed as follows. Let \(T_{\mathscr {A}}(n)\) denote the span of \(\mathscr {A}_{vit}\) on a matrix of size \(n \times n\). Then \(T_{\mathscr {A}}(n) = {\varTheta }\left( {1}\right) \) if \(n = 1\), and \(2 T_{\mathscr {A}}\left( {n / 2}\right) + {\varTheta }\left( {1}\right) \), otherwise. Solving, \(T_{\mathscr {A}}(n) = {\varTheta }\left( {n}\right) \), which implies that the span of the recursive algorithm is \({\varTheta }\left( {nt}\right) \).

3 Cache-efficient Multi-instance Viterbi

In this section, we present a novel cache-efficient cache-oblivious Viterbi algorithm for multiple instances of the problem.

It is easy to see that a standard recursive divide-and-conquer algorithm has no temporal locality because to compute each column of P (\({\varTheta }\left( {n^2}\right) \) work), we have to scan the entire matrix A (\({\varTheta }\left( {n^2}\right) \) space). We can exploit temporal cache locality by solving multiple instances of the problem simultaneously. The existing method that uses multiple instances [25] is cache-inefficient.

Fig. 2.
figure 2

Cache-efficient multi-instance Viterbi algorithm.

Two problems that have the same transition matrix A and emission matrix B are termed two instances of the same problem. The spoken word recognition problem can be considered as an example of multi-instance Viterbi problem. The core idea of the algorithm comes from the fact that by scanning the transition matrix A only once, a particular column of matrix P can be computed for n instances of the problem.

Consider Fig. 2. In the function \(\mathscr {A}_{vit}\) (XUVW), the matrix U is an \(n \times q\) matrix obtained by concatenating \((j-1)\)th columns of q matrices \(P_1, P_2, \ldots , P_q\), where \(P_i\) is the most likely path probability matrix of problem instance i. The algorithm computes X, which is a concatenation of jth columns of the q problem instances. Each problem instance i has a different observations vector \(Y_i = \{ y_{i1}, y_{i2}, \ldots , y_{it} \}\). Matrix W W is a concatenation of \(B[y_{1,j}], B[y_{2,j}],\ldots ,B[y_{q,j}]\). We use \(X_T,X_B,X_L,\) and \(X_R\) to represent the top half, bottom half, left half, and right half of X, respectively. Executing the divide-and-conquer algorithm once computes the second column of all matrices \(P_1\) to \(P_q\). Executing the algorithm again computes the third column of the q matrices. Executing the algorithm t times, the last column of all problem instances would be filled. Note that for each time step (or observation step), W needs to be reconstructed.

Complexity Analysis. The serial cache complexity of the algorithm in Fig. 2 is computed as follows. Let \(Q_{\mathscr {A}}(n,q)\) denote the serial cache complexity of \(\mathscr {A}_{vit}\) on a matrix of size \(n \times q\), and let n and q be powers of two. Then \(Q_{\mathscr {A}}(n,q) = {\mathcal O}\left( {n^2/B + n}\right) \) when \(n^2 + nq \le {\gamma _{\mathscr {A}}} M\); \(Q_{\mathscr {A}}(n,q) = 8Q_{\mathscr {A}}\left( n/2, q/2\right) + {\mathcal O}\left( {1}\right) \) when \(n = q\); \(Q_{\mathscr {A}}(n,q) = 2Q_{\mathscr {A}}\left( n, q/2\right) + {\mathcal O}\left( {1}\right) \) when \(n < q\); and \(Q_{\mathscr {A}}(n,q) = 4Q_{\mathscr {A}}\left( n/2, q\right) + {\mathcal O}\left( {1}\right) \) when \(n > q\); where, \({\gamma _{\mathscr {A}}}\) is a suitable constant. Solving, the cache complexity of the algorithm for t timesteps is \(t \times Q_{\mathscr {A}}(n,q) = {\mathcal O}\left( {n^2qt / (B \sqrt{M}) + n^2qt / M + n(n+q)t/B + t}\right) \). As the algorithm exploits temporal locality, it is cache-efficient. The span of the algorithm remains \({\varTheta }\left( {nt}\right) \).

4 Viterbi Algorithm Using Rank Convergence

We briefly describe and improve Maleki et al.’s Viterbi algorithm [26] below.

Preliminaries. We rewrite the Viterbi recurrence using log-probabilities (i.e., logarithms of all probabilities) as follows so that we can replace multiplications with additions: \(P[i, j] = I[i] + B[i, y_1]\) if \(j = 1\), and \(P[i, j] = \text {max}_{k \in [1, n]} (P[k, j\)-\(1] + A[k, i] + B[i, y_j])\) if \(j > 1\).

We rewrite the recurrence above as \(s[t-1] = s[0] \odot A_1 \odot A_2 \odot \cdots \odot A_{t-1}\), where s[j] is the jth solution vector (or column vector P[.., j]) of matrix P, the \(n \times n\) matrix \(A_i\) is a suitable combination of A and B, and \(\odot \) is a matrix product operation defined between two matrices \(R_{n \times n}\) and \(S_{n \times n}\) as \((R \odot S)[i, j] = \max _{k \in [1,n]} (R[i,k] + S[k,j])\).

Fig. 3.
figure 3

Processor-aware parallel Viterbi algorithm using rank convergence as given in Maleki et al. paper [26].

The rank of a matrix \(A_{m \times n}\) is r if r is the smallest number such that A can be written as a product of two matrices \(C_{m \times r}\) and \(R_{r \times n}\). Vectors \(v_1\) and \(v_2\) are parallel provided they differ by a constant offset. For example, \(\langle 1, 2, 3, 4 \rangle \) and \( \langle 5,6,7,8 \rangle \) are two parallel vectors.

Original Algorithm. The algorithm, shown in Fig. 3, consists of two phases: (i) parallel forward phase, and (ii) fix up phase. In the forward phase, the t stages are divided into p segments, where p is the number of processors, each segment having \(\left\lceil t/p \right\rceil \) stages (except possibly the last stage). The stages in the ith segment are from \(l_i\) to \(r_i\). The initial solution vector of the entire problem is the initial vector of the first segment and it is known. The initial solution vectors of all other segments are initialized to non-zero random values. A sequential Viterbi algorithm is run in all the segments in parallel. A stage i is said to converge if the computed solution vector s[i] is parallel to the actual solution vector \(s_i\). A segment i is said to converge if rank\((A_{l_i} \odot A_{l_i + 1} \odot \cdots \odot A_{j})\) is 1 for \(j \in [l_i, r_i - 1]\).

In the fix up phase a sequential Viterbi algorithm is executed for all segments simultaneously. The solution vectors computed in different segments (except the first) might be wrong. But eventually they will become parallel to the actual solution vectors if rank convergence occurs. If rank convergence occurs at every segment then the solution vectors at every stage will be parallel to the actual solution vectors. Otherwise, the fix up phase is run again and again until rank convergence occurs at some point. In the worst case, which rarely happens in practice, the fix up phase will have to be executed a total of \(p-1\) times for rank converngence to happen.

Improved Algorithm. The algorithm described above is processor-aware, and we make it processor-oblivious as follows.

Fig. 4.
figure 4

Processor-oblivious parallel Viterbi algorithm using rank convergence.

We chose a suitable segment size c (say 256) that is feasibly large, then use a parallel for loop to solve those t / c segments simultaneously. Unlike Maleki et al.’s algorithm, we need to make sure that the segments are non-overlapping at their boundaries and then adjust the fixup phase accordingly as shown in Fig. 4.

Here is how the algorithm works. Let the initial segment size is c (i.e., c consecutive time steps). For convenience we chose \(c=2^i\) where \(i \in [\log c, \log t]\). We divide t time steps into t / c independent segments each of size c. Similar to Maleki et al.’s algorithm, the first solution vectors of all except the first segment are initialized to non-zero valid random probability values. Then in the forward phase we run serial Viterbi algorithm on all segments simultaneously. At the end of the forward phase solution vectors till the \(c^{th}\) column (i.e., all columns in the first segment) will have correct log-likelihood values. Other segments will have values computed from the random values chosen initially which may or may not be parallel to the expected values.

In the fix up phase, we start fixing from the second segment as in the original Maleki et al.’s algorithm. However, in each fix up phase, we work on alternative segments always leaving the first segment of the prior fix up phase. After each fix up phase, the size of each segment being considered doubles and number of segments becomes half with respect to the previous phase. At the end of each fix up phase, we check whether the computed solution vectors are parallel to those in the forward phase, and if the answer is ‘yes’ for all segments under consideration, the program terminates. Otherwise, the next fix up phase is run. In the worst case, the fix up phase is executed \(\lambda \) \(\in [1,\log (t/c)]\) times after which all results are guaranteed to be correct since by that time the result from the original input propagates till the end. In the worst case, the program is like a serial Viterbi algorithm with a constant factor overhead.

Complexity Analysis. Let \(T^{F}_{1}(n, t),\) \(Q^{F}_{1}(n,t),\) \(T^{F}_{\infty }(n,t)\), and \(S^{F}(t)\) denote the work, serial cache complexity, span, and the steps for convergence, respectively, of algorithm \(F \in \{ O, I\}\), where O represents the original rank convergence algorithm and I denotes our modified algorithm. Let f(t) be the number of segments in algorithm O. Note that for Maleki et al.’s original algorithm \(f(t) = p\). Let the number of times the fix up phase is executed in O and I be \(\lambda _{O}\) and \(\lambda _{I}\), respectively. Then \(\lambda ^{O} \in [1, f(t)]\) and \(\lambda ^{I} \in [1,\log {(t/c)}]\).

Work. \(T^O_{1}(n, t) = {\varTheta }\left( {n^2t \cdot \lambda _{O}}\right) \), and \(T^I_{1}(n,t)={\varTheta }\left( {n^2 t \cdot \lambda _{I}}\right) \). In the worst case, \(T^{O}_{1}(n,t)\) is \({\varTheta }\left( {n^2t \cdot f(t)}\right) \), and \(T^I_{\infty }(n,t)\) is \({\varTheta }\left( {n^2t \cdot \log {t}}\right) \).

Serial Cache Complexity. As there is no temporal locality, \(Q^{O}_{1}(n,t) = {\mathcal O}\left( {T^O_{1}(n, t) / B}\right) \) and \(Q^{I}_{1}(n,t) = {\mathcal O}\left( {T^I_{1}(n, t) / B}\right) \), when \(n^2\) does not fit into the cache.

Span. \(T^O_{\infty }(n, t) = {\varTheta }\left( {n (t / f(t)) \cdot \lambda _{O}}\right) \), as the number of stages in each segment is \({\varTheta }\left( {t / f(t)}\right) \), and the span of executing each stage is \({\varTheta }\left( {n}\right) \). In the worst case, \(T^{O}_{\infty }(n,t)\) is \({\varTheta }\left( {nt}\right) \). \(T^I_{\infty }(n, t)\) is computed as follows. In the ith fix up phase, the number of stages in each segment is \(2^i\). Hence, the span of executing all stages for \(\lambda _{I}\) iterations in the fix up phase is \({\varTheta }\left( {\sum _{i=\log c}^{(\log c)+\lambda _{I}} 2^i}\right) = {\varTheta }\left( {2^{\lambda _{I}}}\right) \). Then \(T^I_{\infty }(n,t)={\varTheta }\left( {n 2^{\lambda _{I}}}\right) \). In the worst case, \(T^I_{\infty }(n,t)\) is \({\varTheta }\left( {nt}\right) \).

Steps for Convergence. Let the rank of the matrix \(A_1 \odot A_2 \odot \cdots \odot A_t\) be k. For the original algorithm, \((S^O(t)-1) \times (t / f(t)) < k \le S^O(t) \times (t / f(t))\), which implies \(S^O(t) = \lceil k f(t) / t \rceil \). Similarly, for the improved algorithm, \(2^{S^I(t) - 1 + \log c} < k \le 2^{S^I(t)+ \log c}\), which implies \(S^I(t) = \lceil k/c \rceil \).

5 Cache-efficient Viterbi Algorithm

In this section, we present an efficient cache- and processor-oblivious parallel Viterbi algorithm based on recursive divide-and-conquer, as shown in Fig. 5. The algorithm is derived by combining ideas from the cache-efficient multi-instance Viterbi algorithm (see Sect. 3) and the improved parallel Viterbi algorithm based on rank convergence (see Sect. 4).

Recall that in the multi-instance Viterbi algorithm works on the \(i^{th}\) solution vectors, s[i], of different instances of the problem and generates the \((i+1)^{th}\) solution vectors, \(s[i+1]\), of the instances cache-efficiently. To develop a cache-efficient Viterbi algorithm, in the forward phase, we divide t time steps into t / c independent segments each of size c as we did in the improved parallel Viterbi algorithm using rank convergence shown in Fig. 4, As before, we chose \(c=2^i\) where \(i \in [\log c, \log t]\). Since each segment is independent, we can assume that these segments are different instances of the same Viterbi problem. Therefore, we can use the cache-efficient multi-instance Viterbi algorithm to solve these t / c instances simultaneously. Again, the first solution vectors of all except the first segment are initialized to non-zero valid random probability values.

The fix up phase is similar to that of the Viterbi-Rank-Improved algorithm (see Figs. 4 and 5), except that now we use cache-efficient Multi instance Viterbi algorithms to compute the next solution vector of all segments at once instead of using Viterbi algorithm to compute an entire segment independently. As before, we start fixing from the second segment since the first segment is already fixed after the forward phase.

Fig. 5.
figure 5

An efficient cache- and processor-oblivious parallel Viterbi algorithm using rank convergence. Viterbi-MI refers to \( \textsc {Viterbi-Multi-Instance-D \& C}\) of Sect. 3.

In each fix up phase, we work on alternative segments always leaving out the first segment of the prior fix up phase (already fixed by this time). After each fix up phase, the size of each segment being considered doubles and number of segments halves with respect to the previous phase. For each step, we use multi-instance Viterbi algorithm to compute the \((i+1)\)st solution vector from the ith solution vectors for all segments at once. At the end of each fix up phase, we check whether the computed solution vectors are parallel to those found in the forward phase, and if that is true for all segments under consideration, the program terminates. Otherwise, the next fix up phase is run. In the worse case, the fix up phase is executed for \(\lambda \) \(\in [1,\log (t/c)]\) times after which all results are guaranteed to be correct.

Complexity Analysis. Let \(T_1(n,t), Q_1(n,t),\) and \(T_{\infty }(n,t)\) be the work, serial cache complexity, and span of the cache-efficient Viterbi algorithm, respectively. Let \(\lambda \in [1, \log (t/c)]\) be the number of times the fix up phase is executed.

\(T_1(n,t) = {\varTheta }\left( {n^2 t \cdot \lambda }\right) \). In the worst case, \(T_1(n,t) = {\varTheta }\left( {n^2 t \cdot \log {t}}\right) \). As in Sect. 4, \(T_{\infty }(n,t) = {\varTheta }\left( {n 2^{\lambda }}\right) \). Finally, \(Q_1(n,t) = {\mathcal O}\left( {\sum _{i = \log c}^{(\log c) + \lambda } \left( Q_{\mathscr {A}}\left( n, t/2^i \right) \cdot 2^i \right) }\right) \) \(= {\mathcal O}\left( {n^2t \lambda / (B \sqrt{M} + n^2t \lambda / M + (n (n 2^{\lambda } + t \lambda ) ) / B + 2^{\lambda }}\right) \). If \(n^2, t = {\varOmega }\left( {\sqrt{M}}\right) \) and convergence happens after \(\lambda = {\mathcal O}\left( {1}\right) \) iterations of the fix up phase, \(Q_1(n,t)\) reduces to \({\mathcal O}\left( {n^2t \lambda / (B \sqrt{M}) + n^2t \lambda / M}\right) \) which further reduces to \({\mathcal O}\left( {n^2t \lambda / (B \sqrt{M})}\right) \) when the cache is tall (i.e., \(M = {\varOmega }\left( {B^2}\right) \)).

6 Experimental Results

This section presents our implementation details and performance results.

Fig. 6.
figure 6

Running time and L3 miss of our cache-efficient multi-instance Viterbi algorithm along with the multi-instance iterative Viterbi algorithm.

We used a dual socket 16-core (\(=2 \times 8\)-cores) 2 GHz Intel Sandy Bridge machine to run all experiments presented in the paper. Each core of this machine was connected to a 32 KB private L1 cache and a 256 KB private L2 cache. All the cores in a socket shared a 20 MB L3 cache, and the machine had 32 GB RAM shared by all cores. We used PAPI 5.2 [1] to count the L3 cache misses (event PAPI_L3_TCM) and likwid [35] (i.e., likwid-perfctr) to measure energy and power consumption of the program. The matrices AB,  and I were initialized to random probabilities. We used log-probabilities in all implementations and hence used additions instead of multiplications in the Viterbi recurrence. All matrices were stored in column-major order. We performed two sets of experiments to compare our cache-efficient algorithms with the iterative and the fastest known Viterbi (Maleki et al.’s) algorithms. They are as follows.

Cache-efficient Multi-instance Viterbi Algorithm. We compared our cache-efficient multi-instance recursive Viterbi algorithm with the multi-instance iterative Viterbi algorithm. Both algorithms were optimized and parallelized. To construct matrix \(W_{n \times q}\) (we chose q to be n in this case), instead of copying all the relevant columns of B, only the pointers to the respective columns were used. Wherever possible, pointer swapping was used to interchange previous solution vector (or matrix) and current solution vector (or matrix).

The running time and the L3 cache misses for the two algorithms are plotted in Fig. 6. The number of stages n, which is also the number of instances was varied from 32 to 4096. Note that in the cache-efficient multi-instance Viterbi algorithm, the number of stages does not need to be the same as the number of instances. The variable m was fixed to 32 and the number of timesteps t was also kept the same as n (hence overall complexity is \(O(n^4)\)). The recursive algorithm ran slightly faster than the iterative algorithm in most cases when the number of instances increased. When n was 4096, our recursive algorithm ran around 2.26 times faster than the iterative algorithm.

Cache-efficient Viterbi Algorithm. We compared our cache-efficient parallel Viterbi algorithm with Maleki et al.’s parallel Viterbi algorithm. Both implementations were optimized and parallelized and the reported statistics are averages of 4 independent runs. In all our experiments, the number of processors p was set to 16. The plots of Fig. 7 show the graphs of the running time and L3 cache misses for the two algorithms for \(n = 4096\).

Fig. 7.
figure 7

Running time, L3 miss and energy/power consumption of our cache-efficient Viterbi algorithm along with the existing algorithms.

When \(n=4096\), we varied t from \(2^{12}\) to \(2^{18}\), and kept m fixed at 32. Our algorithm ran faster and incurred significantly fewer L3 misses than Maleki et al.’s algorithm throughout. For \(t = 2^{18}\), our algorithm ran 33 % faster, and incurred a factor of 6 fewer L3 misses. Better cache performance led to lower DRAM energy consumption.

Energy Consumption. We ran experiments to analyze the energy consumption (taking average over three runs) of our cache-efficient recursive algorithm and Maleki et. al.’s algorithm. Our algorithm consumed relatively less DRAM energy compared to the other algorithm.

We used the likwid-perfctr tool to measure CPU, Power Plane 0 (PP0), DRAM energy, and DRAM power consumption during the execution of the programs. The energy measurements were end-to-end, i.e., included all costs during the entire program execution. Note that the DRAM energy consumption is somewhat related to the L3 cache miss of a program as each L3 cache miss results in a DRAM access. Similarly, since CPU energy gives the energy consumed by the entire package (all cores, on chip caches, registers and their interconnections), it is related to a program’s running time. PP0 is basically a subset of CPU energy since it captures energy consumed by only the cores and their private caches.

For \(n = 2048\), t was increased from \(2^{11}\) to \(2^{14}\) while keeping m fixed to 32. Figure 7 shows that the DRAM energy as well as power consumption of our algorithm was significantly less because of the reduced L3 cache misses. When \(t = 16384\), Maleki et al.’s algorithm consumed 60 % more DRAM energy and 30 % more DRAM power than ours.