

# Introduction to Compute-in-Memory

#### Laura Fick, Dave Fick

**Mythic** 

## Outline of Talk

- Introduction to Compute-In-Memory
  - What memory?
  - When is it useful?
  - When is it not useful?
- Case Studies:
  - SSD compute-in-memory for list intersection
  - Mythic's analog compute-in-memory for GPS correlation
  - Mythic's analog compute-in-memory for neural networks
- Conclusion

## Compute-in-Memory in a Nutshell

 Compute-in-memory is a <u>broad range</u> of techniques with a single underlying principal:

Sometimes it is better\* to create additional compute at the memory than move data from the memory to the main compute.

\* = faster and/or more efficient

This is particularly important for today's data processing applications, and post-Moore design.



- "Compute-in-Memory" is relatively to an existing system that has both compute and memory.
- Let's consider a standard memory system
  - There are progressively larger caches (including DRAM)
  - Memory closer to CPU is faster & smaller
  - Main memory is in SSD or HDD



- Typically, CPU works on L1 Cache
- "Compute-in-Memory" could mean...
  - Compute at L2/L3
  - Compute in the DRAM chips
  - Compute in the SSD
  - Compute in an accelerator that contains memory

#### Memory Systems are Built for Data Access Patterns

- The existing memory structure has built-in assumptions:
  - Temporal Locality
    - If at one point a particular memory location is referenced, then it is likely that the same location will be referenced again in the near future.
  - Spatial Locality
    - If a particular storage location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future.
  - Probability, not Certainty
    - We do not know exactly what data will be needed next





Compute-in-Memory Captures More Difficult Access Patterns

Some applications have access patterns that do not "play nice" with the traditional memory hierarchy



Compute-in-Memory systems can target these applications.

2019 Custom Integrated Circuits Conference - Education Sessions

## **Other Reasons for Compute-in-Memory**

- Deterministic Data Patterns
  - In some cases, we know the exact data access pattern to be performed, so hierarchical memory systems do not provide a benefit and are inefficient.
- ASIC Capabilities Further Minimize Data Movement
  - In other cases, we can build in ASIC capabilities that take advantage of known data patterns.
- Analog Computation
  - In extreme cases, analog computation can be added to achieve most minimal data movement.

## Compute-in-Memory is Not Always Useful

- General Purpose Computing
  - You need an application to take advantage of.
- Low Application Importance
  - If this application is not >90% of the system time or power, then you will not be able to get a 10x improvement.
- Small Working Sets
  - Compute-in-memory often requires relatively large working sets to make sense.
  - Applications that fit in L1 cache are hard to improve.



# SSD In-Storage Computing for List Intersection

J. Wang, D. Park, Y.S. Kee, Y. Papakonstantinou, S. Swanson UC San Diego, Samsung

Data Management on New Hardware, 2016



- Typically, CPU works on L1 Cache
- "Compute-in-Memory" could mean...
  - Compute at L2/L3
  - Compute in the DRAM chips
  - Compute in the SSD!
  - Compute in an accelerator that contains memory

## **Case Study: List Intersection**

- List intersection finds the common elements in a set of data
- Intersection is prominent in search engines and analytics queries (lots of data)
- Speed of list intersection is dependent on many parameters: algorithm, list length and correlation

Are the data usage requirements amenable to compute-in-memory?

## SSD Architecture

- Modern SSDs are typically architected with a higher (2-4x) internal bandwidth than the host interface bandwidth
- Using the SSD controller as an off-load engine to execute some programs



### Smart SSD Architecture



## **Smart SSD Host Interaction**

- Host system sends only metadata to the Smart SSD (addresses, lengths of lists)
- Load lists are stored on the Smart SSD



## Effects on Smart SSD Performance: Entry Size

- Larger entry size improves both execution time and energy compared to regular SSD implementation
- Above 256 entry size, load data is dominant



## Remember: Uniform workload is better!

- Compute-in-memory is better for uniform data workload applications (wide access pattern)
- Consistently accessing a large set of data



#### However: Larger amount of data is better too

- Workload looks similar, except for the amount of data
- Compute-in-memory benefits applications that have memory loads as the primary bottleneck





# A 36.8 2b-TOPS/W Self Calibrating GPS Accelerator Implemented Using Analog Calculation in 65nm LP CMOS

Skylar Skrzyniarz<sup>1,2</sup>, Laura Fick<sup>2</sup>, Jinal Shah<sup>2</sup>, Yejoong Kim<sup>2</sup>, Dennis Sylvester<sup>2</sup>, David Blaauw<sup>2</sup>, David Fick<sup>1</sup>, Michael B. Henry<sup>1</sup>

> <sup>1</sup>Mythic (fka Isocline), Austin, TX <sup>2</sup>University of Michigan, Ann Arbor, MI

## GPS is a 4-Dimensional Search

- Need to find X, Y,
   Z, and T
- Calculated by acquiring time offset from 4+ satellites
- GPS is a CDMA signal
  - Received below thermal noise floor



2019 Custom Integrated Circuits Conference - Education Sessions

## **Executing GPS Acquisition: The Problem**

- Time-domain search
- Timing offset (T) is calculated
  - Done through correlation
    - 1000's of 2-bit vector multiply
    - 1000's of results to accumulate
- Pattern is very long
  - Reject noise
  - Increase gain



## **Executing GPS Acquisition: The Problem**

- Requires many operations
  - (10-350)×10<sup>9</sup> for civilian
  - $-35 \times 10^{12}$  for military
- Energy intensive
  - 19-665 mJ per satellite
- Time consuming

   10-350 ms per satellite



# Analog Calculation as the Solution

 $\rightarrow$ 

| Analog Calculation   |                                                                   |                                                                |                                                                                                              |
|----------------------|-------------------------------------------------------------------|----------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
|                      | Charge                                                            | Conductance                                                    | $\underbrace{\frac{\text{Current}}{\textcircled{I}_1 \\ \textcircled{I}_2}}_{\textcircled{I}_3 = I_1 + I_2}$ |
| Example Uses         | <ul> <li>Multiply</li> </ul>                                      | <ul> <li>Absolute-value-<br/>of-difference</li> </ul>          | •Summation                                                                                                   |
| <u>Advantages</u>    | <ul> <li>Energy efficient</li> <li>Result is a voltage</li> </ul> | <ul><li>Energy efficient</li><li>Area efficient</li></ul>      | <ul> <li>Can use many inputs</li> <li>Process tolerant</li> </ul>                                            |
| <u>Disadvantages</u> | <ul><li>Area</li><li>Variation</li></ul>                          | <ul><li>Special device<br/>(flash)</li><li>Variation</li></ul> | <ul> <li>Lower energy<br/>efficiency</li> </ul>                                                              |
| Kramer, A.H          |                                                                   |                                                                |                                                                                                              |

## Analog Calculation: Adding 4096 2-Bit Numbers

- Digital:
  - 8168 full adders
  - 15 stage tree



## Analog Calculation: Adding 4096 2-Bit Numbers

- Digital:
  - 8168 full adders
  - 15 stage tree

Analog:

Current-mode summation



## Analog Calculation: Application Advantages

- Main challenge of analog calculation is added noise
  - Noise  $\downarrow$  as input terms  $\uparrow$
  - GPS has many input terms
- Correlation result has narrow dynamic range
  - Zero mean result
  - 48.5-to-51.5% vs. 0-to-100%
  - High resolution
    - 0.5% change in current per LSB
  - Circuits heavily optimized for this small range



2019 Custom Integrated Circuits Conference - Education Sessions Slid







## **Die Photo**

• TSMC65LP • All biases generated on-chip • 0.325 mm<sup>2</sup>



## Results: Implementation vs. Ideal

- Digital simulation is ideal
- Single pass through matched filter

- Measured output overlaid digital
  - 170 MHz

■ 25°C

- 4096 correlation calculations



## **Results: Analog Computation**

- Signal has inherent noise
   – RF front-end
- Analog computation noise:
  - 10× lower
  - Dominated by current source variation



## **Results: Comparison**

- 340-27,000× performance increase
- 67× energy efficiency increase
- Scalable for application

| This Work<br>65 nm                   | MITRE                                                                                                                        | JSSC'05                                                                                                                                                                                                                                                                                                                                  | ISCAS'11                                                                                                                                                                           |
|--------------------------------------|------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 65 nm                                | 4.0.0                                                                                                                        |                                                                                                                                                                                                                                                                                                                                          |                                                                                                                                                                                    |
|                                      | 180 nm                                                                                                                       | 350 nm                                                                                                                                                                                                                                                                                                                                   | 130 nm                                                                                                                                                                             |
| 1.20 (Analog)<br>1.15 (Digital)      | 1.8                                                                                                                          | 2.0                                                                                                                                                                                                                                                                                                                                      | 1.0                                                                                                                                                                                |
| 170                                  | 20.46                                                                                                                        | 8                                                                                                                                                                                                                                                                                                                                        | 0.2                                                                                                                                                                                |
| 18.9*                                | 1,900                                                                                                                        | 2                                                                                                                                                                                                                                                                                                                                        | 0.0004                                                                                                                                                                             |
| 0.70*                                | 1.05                                                                                                                         | 2.05E-3                                                                                                                                                                                                                                                                                                                                  | 2.56E-5                                                                                                                                                                            |
| <b>PS/W</b> 36.8                     |                                                                                                                              | 1.05                                                                                                                                                                                                                                                                                                                                     | 64.0                                                                                                                                                                               |
| TOPS/mm <sup>2</sup> 2.154*          |                                                                                                                              | 0.0038                                                                                                                                                                                                                                                                                                                                   | 0.000197                                                                                                                                                                           |
| 4,096*                               | 51,150                                                                                                                       | 256                                                                                                                                                                                                                                                                                                                                      | 128                                                                                                                                                                                |
| 2-bit                                | 2-bit                                                                                                                        | Analog                                                                                                                                                                                                                                                                                                                                   | Analog                                                                                                                                                                             |
| 0.325*                               | 88.0                                                                                                                         | 0.54                                                                                                                                                                                                                                                                                                                                     | 0.13                                                                                                                                                                               |
| Digital storage/<br>switched current | All digital                                                                                                                  | Analog storage/<br>switched current                                                                                                                                                                                                                                                                                                      | Analog storage/<br>switched capacitor                                                                                                                                              |
|                                      | 1.15 (Digital)<br>1.15 (Digital)<br>170<br>18.9*<br>0.70*<br>36.8<br>2.154*<br>4,096*<br>2-bit<br>0.325*<br>Digital storage/ | 1.15 (Digital)       1.8         115 (Digital)       20.46         170       20.46         18.9*       1,900         0.70*       1.05         36.8       0.55         2.154*       0.0119         4,096*       51,150         2-bit       2-bit         0.325*       88.0         Digital storage/<br>switched current       All digital | 1.15 (Digital)1.82.017020.46818.9*1,90020.70*1.052.05E-336.80.551.052.154*0.01190.00384,096*51,1502562-bit2-bitAnalog0.325*88.00.54Digital storage/<br>switched currentAll digital |

\*The design could be tiled to proportionally scale these metrics.

# Conclusion

- Implemented a 36.8 2b-TOPS/W matched filter for GPS application
- Uses analog calculation to achieve improvement in:
  - Energy
    - 67× gain in energy efficiency compared to all-digital implementation
  - Performance
    - 340-27,000× higher performance than previous analog implementations
  - Area
- Analog calculation has negligible noise contributions



Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip

# Dave Fick, CTO/Founder Mike Henry, CEO/Founder

HotChips 2018

### **DNNs are Largely Multiply-Accumulate**

#### Primary DNN Calculation is Input Vector \* Weight Matrix = Output Vector







#### Intermediate data accesses are amortized **64-1024x** since they are used in many MAC operations



Weight data could need to be stored in *DRAM*, and it does not have the same amortization as the intermediate data

# **DNN Processing is All About Weight Memory**

| Network                                               | Weights | ٨ | MACs  | @ 30 FPS |
|-------------------------------------------------------|---------|---|-------|----------|
| AlexNet <sup>1</sup>                                  | 61 M    | 7 | 725 M | 22 B     |
| ResNet-18 <sup>1</sup>                                | 11 M    | 1 | .8 B  | 54 B     |
| ResNet-50 <sup>1</sup>                                | 23 M    | 3 | 8.5 B | 105 B    |
| VGG-19 <sup>1</sup>                                   | 144 M   | 2 | 22 B  | 660 B    |
| OpenPose <sup>2</sup>                                 | 46 M    | 1 | 80 B  | 5400 B   |
| Norwhard to fit this $\circ$ 10+M parameters to store |         |   |       |          |

in an Edge solution

<sup>1</sup>: 224x224 resolution

<sup>2</sup>: 656x368 resolution

- IU+m parameters to store
- 20+B memory accesses
- How do we achieve...
  - High Energy Efficiency
  - High Performance
  - "Edge" Power Budget (e.g., 5W)

#### Common Techniques for Reducing Weight Energy Consumption

#### Weight Re-use

- Focus on CNN
  - Re-use weights for multiple windows
  - Can build specialized structures
  - ⊗ Not all problems map to CNN well
- Focus on Large Batch
  - Re-use weights for multiple inputs
  - 😕 Edge is often batch=1
  - 😕 Increases latency

#### Weight Reduction

- Shrink the Model
  - Use a smaller network that can fit on-chip (e.g., SqueezeNet)
  - Possibly reduced capability
- Compress the Model
  - Use sparsity to eliminate up to 99% of the parameters
  - Use literal compression
  - Possibly reduced capability
- Reduce Weight Precision
  - 32b Floating Point => 2-8b
     Integer
  - Possibly reduced capability

## Key Question: Use DRAM or Not?

**Benefits of DRAM** 

**Drawbacks of DRAM** 

Can fit arbitrarily large models

Huge energy cost for reading weights

- Not as much SRAM needed on chip
- Eimited bandwidth getting to weight data

 Variable energy efficiency
 & performance depending on application

# Common NN Accelerator Design Points

|             | Enterprise<br>With DRAM | Enterprise<br>No-DRAM | Edge<br>With DRAM | Edge<br>No-DRAM |
|-------------|-------------------------|-----------------------|-------------------|-----------------|
| SRAM        | <50 MB                  | 100+ MB               | < 5 MB            | < 5 MB          |
| DRAM        | 8+ GB                   | -                     | 4-8 GB            | -               |
| Power       | 70+ W                   | 70+ W                 | 3-5 W             | 1-3 W           |
| Sparsity    | Light                   | Light                 | Moderate          | Heavy           |
| Precision   | 32f / 16f / 8i          | 32f / 16f / 8i        | 8i                | 1-8i            |
| Accuracy    | Great                   | Great                 | Moderate          | Poor            |
| Performance | High                    | High                  | Very Low          | Very Low        |
| Efficiency  | 25 pJ/MAC               | 2 pJ/MAC              | 10 pJ/MAC         | 5 pJ/MAC        |

2019 Custom Integrated Circuits Conference - Education Sessions

Slide 46

### **Mythic is Fundamentally Different**

|             | Enterprise<br>With DRAM | Enterprise<br>No-DRAM | Edge<br>With DRAM | Edge<br>No-DRAM | Mythic<br>NVM |
|-------------|-------------------------|-----------------------|-------------------|-----------------|---------------|
| SRAM        | <50 MB                  | 100+ MB               | < 5 MB            | < 5 MB          | < 5 MB        |
| DRAM        | 8+ GB                   | -                     | 4-8 GB            | -               | -             |
| Power       | 70+ W                   | 70+ W                 | 3-5 W             | 1-3 W           | 1-5 W         |
| Sparsity    | Light                   | Light                 | Moderate          | Heavy           | None          |
| Precision   | 32f / 16f /<br>8i       | 32f / 16f /<br>8i     | 8i                | 1-8i            | 1-8i          |
| Accuracy    | Great                   | Great                 | Moderate          | Poor            | Great         |
| Performance | High                    | High                  | Very Low          | Very Low        | High          |
| Efficiency  | 25 pJ/MAC               | 2 pJ/MAC              | 10 pJ/MAC         | 5 pJ/MAC        | 0.5 pJ/MAC    |

2019 Custom Integrated Circuits Conference - Education Sessions

Slide 47

## Mythic is Fundamentally Different

|             | Enterp<br>With D |                                                   | Enterprise<br>No-DRAM | Edge<br>With DRAM | Edge<br>No-DRAM | Mythic<br>NVM |
|-------------|------------------|---------------------------------------------------|-----------------------|-------------------|-----------------|---------------|
| SRAM        | <50 MB           |                                                   | 100+ MB               | < 5 MB            | < 5 MB          | < 5 MB        |
| DRAM        | 8+ GB            |                                                   | -                     | 4-8 GB            | -               | -             |
| Power       | 70+ W _          |                                                   | 70+ W                 | 3-5 W             | 1-3 W           | 1-5 W         |
| Sparsity    | Light            | ight Also, Mythic does this in a 40nm             |                       |                   |                 |               |
| Precision   | 32f / 1          | <sup>32f / 1</sup> process, compared to 7/10/16nm |                       |                   |                 | 1-8i          |
|             | 8i L             |                                                   | 01                    |                   |                 |               |
| Accuracy    | Great            |                                                   | Great                 | Moderate          | Poor            | Great         |
| Performance | High             |                                                   | High                  | Very Low          | Very Low        | High          |
| Efficiency  | 25 pJ/M          | AC                                                | 2 pJ/MAC              | 10 pJ/MAC         | 5 pJ/MAC        | 0.5 pJ/MAC    |

#### Mythic is a PCIe Accelerator



(specified via TensorFlow, Caffe2, or others) Applications Interfaces Mythic Driver

#### We Also Support Multiple IPUs



Inference Model (specified via TensorFlow, Caffe2, or others) Operating System Applications Interfaces Mythic Driver

# Mythic's New Architecture Merges Enterprise and Edge

- Mythic introduces the Matrix Multiplying Memory
  - Never read weights
- This effectively makes weight memory access *energy-free* (only pay for MAC)
- And eliminates the need for...
  - Batch > 1
  - CNN Focus
  - Sparsity or Compression
  - Nerfed DNN Models



Made possible with Mixed-Signal Computing on embedded flash

## **Revisiting Matrix Multiply**

#### Primary DNN Calculation is Input Vector \* Weight Matrix = Output Vector



## Analog Circuits Give us the MAC We Need

Flash transistors can be modeled as **variable resistors** representing the weight

The V=IR current equation will achieve the math we need:

> Inputs (X) = DAC Weights (R) = Flash transistors Outputs (Y) = ADC Outputs

The ADCs convert current to digital codes, and provide the non-linearity needed for DNN



# DACs & ADCs Give Us a Flexible Architecture

We have a **digital** toplevel architecture:

- Interconnect
- Intermediate data storage
- Programmability (XLA/ONNX => Mythic IPU)



# To Simplify we use Digital Approximation

To improve time-tomarket, we have left the Input DAC as a future endeavor

We achieve the same result through digital approximation

<u>Silver lining:</u> we have future improvements available



Y<sub>B</sub> Y<sub>C</sub>



### We Account For All Energy Consumed

- Numbers are for a typical application, e.g. ResNet-50
  - Batch size = 1
  - We are relatively applicationagnostic (especially compared to DRAM-based systems)
- 8b analog compute accounts for about half of our energy
  - We can also run lower precision
  - Control, storage, and PCIe accounts for the other half







#### Mythic Mixed-Signal Computing

#### Single Tile





Made possible with Mixed-Signal Computing on embedded flash



## **System Overview**

#### **Initial Product**

- 50M weight capacity
- PCle 2.1 x4
- Basic Control Processor

#### **Envisioned Customizations (Gen 1)**

- Up to 250M weight capacity
- PCle 2.1 x16
- USB 3.0/2.0
- Direct Audio/Video Interfaces
- Enhanced Control Processor (e.g., ARM)

#### Intelligence Processing Unit (IPU)



Slide 61



#### Wrapping Up

### What is Possible with Compute-in-Memory?

- >10x improvement in energy efficiency
- >10x improvement in performance
- Application specific benefits
  - Not every algorithm can benefit from CiM!
  - Some benefit more than others

### **Compute-in-Memory Considerations**

- What does the working set look like?
  - Is it "wide"?
  - Is it "large"?
- How important is this algorithm to our system?
   Does it use up 90% of something?
- How predictable are our data patterns?
  - Can we reduce data movement somehow?