This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Access to this work was provided by the University of Maryland, Baltimore County (UMBC) ScholarWorks@UMBC digital repository on the Maryland Shared Open Access (MD-SOAR) platform.

## Please provide feedback

Please support the ScholarWorks@UMBC repository by emailing scholarworksgroup@umbc.edu and telling us what having access to this work means to you and why it's important to you. Thank you.

# Blocking Optimization Strategies for Sparse Tensor Computation <br> Jee Choii, Xing Liu ${ }^{1}$, Shaden Smith ${ }^{2}$, and Tyler Simon ${ }^{3}$ <br> ${ }^{1}$ IBM T. J. Watson Research, ${ }^{2}$ University of Minnesota, 3 University of Maryland Baltimore County <br> SIAM Annual Meeting <br> July 12 ${ }^{\text {th }}, 2017$ 

## Tensors are multi-dimensional arrays

- CANDECOMP/Parafac (CP) decomposition creates a set of factor matrices



## The take-away from this presentation

- There is lack of clear understanding about performance bottlenecks in sparse tensor decomposition
- Using various blocking techniques mitigate these bottlenecks
- Our optimizations demonstrate significant speedup on synthetic and real-world data for both shared-memory and distributed implementations


## Fix every other factor matrix and solve for the remaining one

```
procedure CP-ALS ( }X,\textrm{R}
    repeat
    C= X (3)
    normalize columns of C to length 1
B = X X (2)
normalize columns of }B\mathrm{ to length 1
A = X X(1)
store column norms of A in \lambda and normalize to 1
    until max iteration reached or error less than \epsilon
end procedure
```


## Calculating MTTKRP is the primary bottleneck

```
procedure CP-ALS ( }X,\textrm{R}
    repeat
```


## MTTKRP

```
    C= X (3)
    normalize columns of C to length 1
    B = X X (2)
    normalize columns of B to length 1
    A = X X(1)
    store column norms of A in \lambda and normalize to 1
    unti1 max iteration reached or error less than \epsilon
end procedure
```


## Calculating MTTKRP is the primary bottleneck

```
procedure CP-ALS ( }X,\textrm{R}
    repeat
```

Matricized Tensor TKRP


```
normatize columns of C to length 1
B = X (2)
normalize columns of B to length 1
A = X X(1)
store column norms of A in \lambda and normalize to 1
until max iteration reached or error less than \epsilon
end procedure
```


## Calculating MTTKRP is the primary bottleneck

```
procedure CP-ALS ( }X,\textrm{R}
    repeat
    Matricized Tensor Times KRP
    C= X (3)
    normatize columns of C to length 1
    B = X X (2)
    normalize columns of B to length 1
    A = X X(1)
    store column norms of A in \lambda and normalize to 1
    until max iteration reached or error less than \epsilon
end procedure
```


## Calculating MTTKRP is the primary bottleneck

```
procedure CP-ALS ( }X,\textrm{R}
    repeat
                        Matricized Tensor Times Khatri-Rao Product
    C=X(B O A)(B\topB * A
    normalize columns of C to length 1
    B = X X (2)
    normalize columns of B to length 1
    A = X X(1)
    store column norms of A in \lambda and normalize to 1
    until max iteration reached or error less than \epsilon
end procedure
```


## Calculating MTTKRP is the primary bottleneck

```
procedure CP-ALS (X, R)
    repeat
        > 90% total execution time
        C= X (3)
        normalize columns of C to length 1
        B = X X (2) (C O A) (C'C * ATA)
        normalize columns of B to length 1
        A = X X(1)
        store column norms of A in \lambda and normalize to 1
    until max iteration reached or error less than \epsilon
end procedure
```


## Problem is formulated as matrix operations

```
procedure CP-ALS ( }X,\textrm{R}
    repeat
        C= X (3)
        normalize columns of C to length 1
        B = X X (2)
        normalize columns of B to length 1
```

$X \in \mathbb{R}^{I \times J \times K}$

$$
A \in \mathbb{R}^{I \times R}
$$

$$
B \in \mathbb{R}^{J \times R}
$$

$$
C \in \mathbb{R}^{K \times R}
$$

```
A = X X(1)
    store column norms of A in \lambda and normalize to 1
    until max iteration reached or error less than \epsilon
end procedure
```


## Directly computing MTTKRP is very expensive

- For a $1000 \times 1000 \times 1000$ tensor with rank 100...
- $X_{(3)}$ is a $\mathbf{1 , 0 0 0 \times 1 , 0 0 0 , 0 0 0}$ matrix, and
- (B $\odot A)$ is a $\mathbf{1 , 0 0 0 , 0 0 0 \times 1 0 0}$ matrix
- Direct computation is expensive


## But not necessary

- For a $1000 \times 1000 \times 1000$ tensor with rank 100...
- $X_{(3)}$ is a $\mathbf{1}, 000 \times \mathbf{1 , 0 0 0 , 0 0 0}$ matrix, and
- (B ○ $)$ is a $\mathbf{1 , 0 0 0 , 0 0 0 \times 1 0 0}$ matrix
- Direct computation is expensive
- Not necessary for sparse tensors.


## MTTKRP expressed as matrix operations



## MTTKRP simplified



## MTTKRP simplified

Load 1 row each


## MTTKRP simplified

Load 1 row each


## MTTKRP simplified

Load 1 row each


## MTTKRP simplified

Load 1 row each


Can it be done more efficiently?
Load 1 row each


## In 3D space...



## Reduce computing by processing at fiber granularity



## Reduce computing by processing at fiber granularity



## Reduce computing by processing at fiber granularity



## First load rows from B (mode-2 matrix)



## Scale the rows by non-zero values



## Accumulate them to a temporary buffer



## Load the "common" row from C (mode-3 matrix)



## Hadamard product with buffer



## Accumulate to destination matrix $\left(\mathrm{A}^{\prime}\right)$



## This is called compressed sparse fiber (CSF)



## Claimed Savings by others

- Naïve COO kernel

$$
\begin{aligned}
& m=\# \text { of non-zeros } \\
& P=\# \text { of non-empty fibers } \\
& R=\text { rank }
\end{aligned}
$$

- Regular: 3 * $\mathbf{m}$ * $\mathbf{R}$ flops (2mR for initial product + scale, mR for accumulation)
- CSF
- $\mathbf{2 R}(\mathbf{m}+\mathbf{P})$ flops, $P$ is \# of non-empty fibers
- typically $p \lll m$
- DFacTo
- Formulates MTTKRP as SpMV
- Each column is computed independently via 2 SpMV
- $2 \mathbf{R}(\mathbf{m}+\mathbf{P})$ flops
- GigaTensor
- MapReduce
- Increased parallelism vs. more flops
- 5 mR flops


## Does this make sense?

- Sparse computations are memory bandwidth-bound
- SPLATT tries cache blocking through expensive hypergraph partitioning (without much success)


## Roofline model visualized (for an old Intel Nehalem CPU)



## Commonly used scientific kernels



## Roofline model applied to MTTKRP

- Sparse computations are memory bandwidth-bound
- Let's calculate the \# of flops and \# of bytes and compare
- Flops: W = 2R(m + P)
- Bytes: $\mathrm{Q}=\mathbf{2 m}$ (value + mode- 2 index) $+\mathbf{2 P}$ (mode-3 index + mode-3 pointer)
$+(\mathbf{1 - a}) \mathbf{R m}$ (mode-2 factor) + (1-a)RP (mode-3 factor)
- Arithmetic Intensity
- Ratio of work to communication I = W/Q
- I = W / ( O * 8 Bytes ) $=\mathbf{R} /(\mathbf{8}+\mathbf{4 R ( 1 - a )})$


## Arithmetic intensity of MTTKRP with rank = 32



## Arithmetic intensity vs. rank for various cache hit rates

Arithmetic Intensity 1000


## Arithmetic intensity vs. system balance (on the latest CPU)

## Arithmetic

 Intensity 1000

## Our initial conclusion from a theoretical point of view

- On recent systems, MTTKRP is likely memory-bound
- Even with a perfect cache hit rate, MTTKRP should be memory-bound on lower ranks
- If we fail to get good cache re-use, MTTKRP will most likely be memory bound for any rank


## A pressure point analysis reveals the bottleneck

- Pressure point analysis
- Probe potential bottlenecks by creating and eliminating instructions/data access
- If we suspect that \# of registers is the bottleneck, try increasing/decreasing their usage to see if the exec. time changes.
- Inline assembly to prevent dead code elimination (DCE)


## A pressure point analysis reveals the bottleneck

| Time | Pressure point |
| :--- | :--- |
| 2.6 | Baseline $(2 R(m+P)$ flops $)$ |
|  |  |
|  |  |
|  |  |
|  |  |

## Using COO instead of CSF only increases exec. time by $<2 \%$

| Time | Pressure point |
| :--- | :--- |
| 2.6 | Baseline $(2 R(m+P)$ flops $)$ |
| 2.64 | Move flops to inner loop $(3 * m * R$ flops $)$ |
|  |  |

Increasing flops only changes time by < $2 \%$

Removing access to C (accessed once per fiber): exec. time down by $7 \%$

| Time | Pressure point |
| :--- | :--- |
| 2.6 | Baseline $(2 R(m+P)$ flops $)$ |
| 2.64 | Move flops to inner loop ( $3 * m * R$ flops) |
| 2.43 | Access to $C$ removed |
|  |  |

Removing per-fiber access to matrix C has a bigger impact than increasing flops

## Suspicion confirmed: Memory access to B is the bottleneck

| Time | Pressure point |
| :--- | :--- |
| 2.6 | Baseline $(2 R(m+P)$ flops $)$ |
| 2.64 | Move flops to inner loop ( $3 * m * R$ flops) |
| 2.43 | Access to $C$ removed |
| 1.81 | Access to B limited to L1 cache |

## Completely removing it give us an extra $6 \%$ - why?

| Time | Pressure point |
| :--- | :--- |
| 2.6 | Baseline $(2 R(m+P)$ flops $)$ |
| 2.64 | Move flops to inner loop ( $3 * m * R$ flops) |
| 2.43 | Access to $C$ removed |
| 1.81 | Access to B limited to L1 cache |
| 1.63 | Access to B removed completely |

Eliminating it completely gives us an extra 6\% boost

## Conclusions from our empirical analysis

- Flops aren't the issue
- Bottlenecks

1. Data access to B
2. Load instructions

## Cache/register blocking should help alleviate these bottlenecks

- Flops aren't the issue
- Bottlenecks

1. Data access to $B \rightarrow$ cache blocking
2. Load instructions $\rightarrow$ register blocking

## Our baseline implementation

```
procedure mttkrp ( }X\in\mp@subsup{\textrm{R}}{}{[\times]\timesK},\textrm{R}
1: \ulcornerfor i \leftarrow 0 to I do // for each row
2: [for j \leftarrow i_ptr[i] to i_ptr[i+1] do // for each fiber
3: [for k \leftarrow p_ptr[j] to p_ptr[j+1] do // for each nz in fiber
4: for r \leftarrow < to R do // go through entire rank
5: [ buffer[r] += vals[k] * B[j_index[k]][r] // buffer
6: [for r \leftarrow 0 to R do
7: L[[ A[i][r] += buffer[r] * C[k_index[j]][r] // accumulate
end procedure
```


## Our baseline implementation

## Replace buffers with registers

```
procedure mttkrp ( }X\in\mp@subsup{R}{}{[\]\\timesK}, R
1: ffor i \leftarrow 0 to I do // for each row
2: [for j \leftarrow i_ptr[i] to i_ptr[i+1] do // for each fiber
3: [for r \leftarrow0 to R do in 16 increments
4: [for k \leftarrow p_ptr[j] to p_ptr[j+1] do // for each nz in fiber
5: [ registers += vals[k] * B[j_index[k]][r] // buffer
6: L[L A[i][r] += registers * C[k_index[j]][r] // accumulate
end procedure
```


## Replace buffers with registers

```
procedure mttkrp ( }X\in\mp@subsup{R}{}{[\]\\timesK}, R
1: ffor i \leftarrow 0 to I do // for each row
2: [for j \leftarrow i_ptr[i] to i_ptr[i+1] do // for each fiber
3: [for r \leftarrow0 to R do in 16 increments
4: [for k \leftarrow p_ptr[j] to p_ptr[j+1] do // for each nz in fiber
5: [ registers += vals[k] * B[j_index[k]][r] // buffer
6: L[[ A[i][r] += registers c[k_index[j]][r] // accumulate
end procedure
2 LD instructions
```

We use n-D blocking (intuitive) and rank blocking (less intuitive)

- Multi-dimensional blocking
- Rank blocking


## We use n-D blocking (intuitive) and rank blocking (less intuitive)

- Multi-dimensional blocking
- 3D blocking - maximize re-use of both matrix $B$ and $C$
- Rank blocking



## We use n-D blocking (intuitive) and rank blocking (less intuitive)

- Multi-dimensional blocking
- 3D blocking - maximize re-use of both matrix $B$ and $C$
- Rank blocking



## We use n-D blocking (intuitive) and rank blocking (less intuitive)

- Multi-dimensional blocking
- 3D blocking - maximize re-use of both matrix $B$ and $C$
- Rank blocking
- Agnostic to tensor sparsity
- Very little change to the code required



## Rank blocking visualized...



## Rank blocking visualized...



## Performance Summary

| Data set | Dimensions | nnz | Sparsity | \# fibers | Speedup |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Poisson1 | $256 \times 256 \times 256$ | 1.5 M | $8.8 \mathrm{e}-2$ | 54 K | $3.1 \times$ |
| Poisson2 | $2 \mathrm{~K} \times 16 \mathrm{~K} \times 2 \mathrm{~K}$ | 121 M | $1.9 \mathrm{e}-3$ | 2.5 M | $2.5 \times$ |
| Poisson3 | $2 \mathrm{~K} \times 16 \mathrm{~K} \times 2 \mathrm{~K}$ | 6.4 M | $1.0 \mathrm{e}-4$ | 830 K | $2.0 \times$ |
| Netflix | $480 \mathrm{~K} \times 18 \mathrm{~K} \times 80$ | 80 M | $1.2 \mathrm{e}-4$ | 5 M | $2.1 \times$ |
| NELL-2 | $12 \mathrm{~K} \times 9 \mathrm{~K} \times 29 \mathrm{~K}$ | 77 M | $2.4 \mathrm{e}-5$ | 21 M | $2.2 \times$ |

## Register blocking yields large speedups for small data sets

$\square$ Baseline ■ Register


## Poisson 2 - sparsity = 1.9e-3

$\square$ SPLATT ■MB $\square$ RankB $■ M B+$ RankB


## Poisson 3 - sparsity = 1.0e-4



Netflix - sparsity $=1.2 \mathrm{e}-4$


NELL - sparsity $=2.4 \mathrm{e}-5$
$\square$ SPLATT $\square M B \square$ RankB $\square M B+$ RankB


## Distributed rank blocking shows better scalability

| Nodes | NELL2 |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | SPLATT | 3D grid | 3D time | 4D grid | 4D time |
| 1 | 1.028 | $1 \times 1 \times 2$ | 0.718 | $1 \times 1 \times 1 \times 2$ | 0.826 |
| 2 | 0.54 | $1 \times 1 \times 4$ | 0.367 | $1 \times 1 \times 1 \times 4$ | 0.423 |
| 4 | 0.286 | $2 \times 1 \times 4$ | 0.208 | $1 \times 1 \times 1 \times 8$ | 0.217 |
| 8 | 0.138 | $2 \times 2 \times 4$ | 0.107 | $1 \times 1 \times 1 \times 16$ | 0.124 |
| 16 | 0.087 | $2 \times 2 \times 8$ | 0.058 | $1 \times 1 \times 2 \times 16$ | 0.065 |
| 32 | 0.056 | $4 \times 2 \times 8$ | 0.043 | $1 \times 1 \times 4 \times 16$ | 0.034 |
| 64 | 0.03 | $4 \times 4 \times 8$ | 0.028 | $2 \times 1 \times 4 \times 16$ | 0.022 |


| Netflix |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| SPLATT | 3D grid | 3D time | 4D grid | 4D time |
| 3.025 | $2 \times 1 \times 1$ | 1.554 | $1 \times 1 \times 1 \times 2$ | 1.447 |
| 1.158 | $4 \times 1 \times 1$ | 0.727 | $1 \times 1 \times 1 \times 4$ | 0.720 |
| 0.519 | $8 \times 1 \times 4$ | 0.403 | $1 \times 1 \times 1 \times 8$ | 0.401 |
| 0.256 | $16 \times 1 \times 1$ | 0.194 | $1 \times 1 \times 1 \times 16$ | 0.190 |
| 0.113 | $32 \times 1 \times 1$ | 0.103 | $1 \times 1 \times 2 \times 16$ | 0.100 |
| 0.083 | $31 \times 2 \times 1$ | 0.056 | $1 \times 1 \times 4 \times 16$ | 0.055 |
| 0.048 | $64 \times 2 \times 1$ | 0.037 | $2 \times 1 \times 4 \times 16$ | 0.030 |

## The take-away from the section

- There was a lack of clear understanding about performance bottlenecks in tensor decomposition
- We show that the key computation is LD and memory-bound
- Using various blocking techniques mitigate these bottlenecks
- Our optimizations demonstrate significant speedup over synthetic and real-world data for both shared-memory and distributed implementations
- We use 3D and rank blocking strategies to achieve up to $3.2 \times$ speedup on real world-data and 2.0x on synthetic


## Future Work

- Extending this work to do performance modeling
- Correlate tiling/blocking size to cache hit rate
- Take advantage of block structures
- Fiber/slice/cube/etc. permutation - new storage formats for tensors (a la SpMV)


## Q \& A

I am currently on the academic job market! Please email me at jee@gatech.edu or visit http://jeewhanchoi.com for my application materials

