Quentin Dariol's PhD defense - 27.11.2023

# Early Timing and Energy Prediction and Optimization of Artificial Neural Networks on Multi-Core Platforms

#### Evaluation committee:

- Dr. Kim GRÜTTNER
- Prof. Dr. Matthias JUNG
- Dr. Angeliki KRITIKAKOU
- Prof. Dr. Frédéric PÉTROT
- Prof. Dr. Gregor SCHIELE

#### Guest:

Dr. Domenik HELMS

#### Reviewers before defense:

- Prof. Dr. Matthias JUNG
- Dr. Angeliki KRITIKAKOU

#### Thesis work supervised by:

- Prof. Dr. Sébastien PILLEMENT
- Dr. Sébastien LE NOURS



## **Context – Artificial Neural Networks (NNs)**



Raise of interest for Al algorithms and especially for NNs.



Sources:

Maslej, N.; et al., "The Al Index 2023 Annual Report", Stanford Institute for Human-Centered Artificial Intelligence (HAI), 2022 Symbols from: https://www.flaticon.com/

## Context – NNs on edge devices





=> Metrics that matter at the edge
=> Need evaluation flow to find optimized mappings

**Source:** Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. *An Analysis of Deep Neural Network Models for Practical Applications*. 2017. arXiv: 1605.07678.

#### Proposition and presentation outline





- I. Fundamentals & hypothesis
- II. Timing prediction flow
- III. Power and energy analysis flow
- IV. Design Space Exploration (DSE) flow
- V. Conclusion & Prospects

# I. Fundamentals & hypothesis — Challenges of NN deployment on multi-core platforms





#### Other aspects:

- Use of power management
- Platform size (number of cores, memory)
- NN different workloads => no « one fits all » solution

#### I. Fundamentals & hypothesis – Related work









#### Rapid prototyping

- Implementation and measurement on prototyping platform
  - Highest accuracy
- Slow (need to deploy NN and measure)
- Limited architectural search

#### Analytical models

- Mathematical formula
   Fast to execute
- Lower accuracy than rapid prototyping
- Limitations to describe complex phenomenoms

#### Simulation

- Virtual platform described in HDL
- Slower than analytical, faster than rapid prototyping,
  - High accuracy

- [Galanis2020] Galanis I. et al. "Inference and Energy Efficient Design of Deep Neural Networks for Embedded Devices", IEEE Computer Societ
  Annual Symposium on VLSI (ISVLSI), 2020
- [Tsimpourlas 2018] Tsimpourlas F. et al. "A Design Space Exploration Framework for Convolutional Neural Networks Implemented on Edge Devices", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCADICS), 2018
- [VelascoMontero2020] Velasco Montero D. et al. "PreVlous: A Methodology for Prediction of Visual Inference Performance on IoT Devices", IEEE Journal of Internet of Things, 2020
- [Guo2023] Guo X. et al. "Automated Exploration and Implementation of Distributed CNN Inference at the Edge", IEEE Journal of Internet of Things, 2023
- [Osterwind2022] Osterwind A. et al. "Hardware Execution Time Prediction for Neural Network Layers", IoT, Edge, and Mobile for Embedded Machine Learning (ITEM), 2022
- [Venieris2019] Venieris, S. and Bouganis, C.-S. "fpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs", IEEE Transactions on Neural Networks and Learning Systems, 2019
- [Parashar2019] Parashar, A. et al. "Timeloop: A Systematic Approach to DNN Accelerator Evaluation", ISPASS 2019
- [Garbay2021] Garbay, T. et al. "CNN Inference Costs Estimation on Microcontrollers: the EST Primitive-based Model", IEEE International Conference on Electronics, Circuits, and Systems (ICECS), 2021
- [Lee2022] Lee, J. et al. "Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices" ACM Transactions on Design Automation of Electronic Systems (TODAES), 2022
- [Sombatsiri2019] Sombatsiri, S. et al. "A Design Space Exploration Method of SoC Architecture for CNN-based Al Platform", Synthesis And System Integration of Mixed Information technologies (SASIMI), 2019

| Work                 | HW target | Evaluation<br>speed | Accuracy<br>Timing | 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 |             | Inter-layer<br>parallelism | Intra-layer<br>parallelism | Power<br>management | HW<br>dimensions |
|----------------------|-----------|---------------------|--------------------|---------------------------------------|-------------|----------------------------|----------------------------|---------------------|------------------|
| [Galanis2020]        | GPU       | ×                   | <b>*</b>           | <b>~ ~ ~</b>                          | <b>*</b>    | <b>~</b>                   | <del>~</del>               | ×                   | ×                |
| [Tsimpourlas2018]    | VPU       | ×                   | <b>*</b>           | <b>*</b>                              | <b>&gt;</b> | <b>&gt;</b>                | <b>~</b>                   | ×                   | ×                |
| [VelascoMontero2020] | Multicore | <b>**</b>           | <b>&gt;</b>        | 0                                     | ×           | <b>&gt;</b>                | <b>~</b>                   | ×                   | ×                |
| [Guo2023]            | All       | <b>*</b>            | <b>&gt;</b>        | <b>&gt;</b>                           | ×           | <b>~</b>                   | <b>~</b>                   | ×                   | ×                |
| [Osterwind2022]      | VPU       | <b>*</b>            | ×                  | 0                                     | ×           | <b>~</b>                   | <b>~</b>                   | ×                   | ×                |
| [Venieris2019]       | FPGA      | <b>**</b>           | <b>&gt;</b>        | 0                                     | ×           | <b>&gt;</b>                | <b>&gt;</b>                | ×                   | ×                |
| [Parashar2019]       | FPGA, GPU | <b>*</b>            | <b>~</b>           | <b>~</b>                              | ×           | <b>✓</b>                   | <b>✓</b>                   | ×                   | <b>~</b>         |
| [Garbay2023]         | MCU       | <b>~ ~ ~</b>        | <b>~</b>           | <b>~</b>                              | ×           | <b>~</b>                   | ×                          | <b>~</b>            | ×                |
| [Lee2022]            | NPUs      | P                   | <b>&gt;</b>        | 0                                     | <b>eee</b>  | <b>&gt;</b>                | <b>&gt;</b>                | ×                   | <b>✓</b>         |
| [Sombatsiri2019]     | SoC       | <b>X</b> (~100s)    | <b>&gt;</b>        | 0                                     | <b>&gt;</b> | <b>&gt;</b>                | <b>~</b>                   | ×                   | <b>~</b>         |
| THIS WORK            | Multicore | <b>✓</b>            | <b>~</b>           | <b>~</b>                              | <b>~</b>    | <b>~</b>                   | <b>~</b>                   | <b>~</b>            | <b>~</b>         |

# Evaluation speed:

| • |        |
|---|--------|
| < | : ~1ms |
| < | : <60s |
| X | : >60s |

#### Accuracy:



# Other criterias:



# I. Fundamentals & hypothesis – Research challenges to address





# I. Fundamentals & hypothesis – Model of Computation (MoC), Model of Architecture (MoA), mapping





- SDF: Synchronous DataFlow
- Strict separation computation/communication
  - Actors,
  - Channels,
  - Tokens.



- MoA: Model of Architecture
- Two versions:
  - Without power management: polling
  - With power management: interrupt + clock gating

# I. Fundamentals & hypothesis – Model of Computation (MoC), Model of Architecture (MoA), mapping





Artificial Neural Network (NN)

Clustering (description using a dataflow-oriented MoC)

Mapping

Performance/power analysis





- Related work & work hypothesis
- II. Timing prediction flow
- III. Power and energy analysis flow
- IV. Design Space Exploration flow
- V. Conclusion & Prospects

#### II. Timing modeling flow - Overview





- [1] Vu, H.-D. "Fast and Accurate Performance Models for Probabilistic Timing Analysis of SDFGs on MPSoCs", PhD thesis, Université de Nantes, 2021
- [2] Schlaak, C.; Fakih, M. & Stemmer, R. "Power and Execution Time Measurement Methodology for SDF Applications on FPGA-based MPSoCs", International Workshop on High Performance Energy Efficient Embedded Systems (HIP3ES), 2017
- [3] Stemmer, R.; Vu, H.-D.; Le Nours, S.; Grüttner, K.; Pillement, S. & Nebel, W. "A Measurement-Based Message-Level Timing Prediction Approach for Data-Dependent SDFGs on Tile-Based Heterogeneous MPSoCs", Applied Sciences, 2021

## II. Timing modeling flow – Computation time model



#### NN described in SDF



#### II. Timing modeling flow – Computation time model



#### NN described in SDF



## II. Timing modeling flow – Model calibration



# Evolution of cluster execution time while varying N (fixed M=784)



# Evolution of cluster execution time while varying M (fixed N=1)



# Predicted actor computation time in cycles based on M and N



$$D_{dense}(M, N) = M \cdot N \cdot D_{\Sigma} + M \cdot D_{\varphi} + D_{setup}$$
$$D_{\Sigma} = 47^{*}, D_{\varphi} = 146^{*} \text{ and } D_{setup} = 39^{*}$$

Estimated: 
$$D_{\Sigma} = 30$$
,  $D_{\varphi} = 61$  and  $D_{setup} = 6$ 

\*: value in processor cycles

# III. Timing modeling flow – Simulation described in SystemC IIIR







III. Timing modeling flow – Experimental results, test cases







- 2 Evaluation speed: ~20s.
  - 3 NN different workloads 🗸
    - MLP1: 0,83%, MLP2: 0,31%, MLP3: 0,62%, CNN1: 0,43%. OK
  - 4 Communication procedure (polling or interrupt) ✓
  - 5 Number of cores used: 🗸
  - 6 NN clustering complexity: ✓







#### • 8 - Comparison with analytical model:

- Error up to 30% on multi-core scenarios.
- Very high evaluation speed (~1ms)







- Related work & work hypothesis
- II. Timing prediction flow
- III. Power and energy analysis flow
- IV. Design Space Exploration flow
- V. Conclusion & Prospects

# IV. Power modeling flow - Overview





## IV. Power modeling flow - Proposed model



$$P(t) = P_{\text{static}} + P_{\text{comp}}(t) + P_{\text{comm}}(t)$$

#### Without power management △



#### With power management **A**



if at least one tile is reading, writing or polling on shared memory at time t otherwise

$$P_{\blacktriangle, \text{ comm}}(t) = P_{\blacktriangle, \text{ rw}}(t) + P_{\blacktriangle, \text{ cg}}(t)$$

#### IV. Power modeling flow – Power model calibration









## IV. Power modeling flow – Power model calibration





One tile at a time enabled







## IV. Power modeling flow – Experiments





- 1 Overall accuracy: >93% on 54 mappings.
- 2 Evaluation speed: ~20s.
- 3 NN different workloads: average prediction error between 1,8% and 3% for the 4 NNs
- 4 Use of power management: average is 2,11% without, 3,92% with ✓
- 5 Number of cores used and communication rates
- 6 Analytical model:
  - Maximum error: ~20%
  - Evaluation time: ~1ms





#### IV. Power modeling flow – Experiments



 Use to jointly evaluate and optimize multi-core platform architectures and NN deployments under power and energy constraints

#### Multi-core platform versions:









#### Single-core platform versions:



| Static power co | nsumption only | Static + dynamic |             |  |  |  |  |
|-----------------|----------------|------------------|-------------|--|--|--|--|
| Multi-core      | Single-core    | Multi-core       | Single-core |  |  |  |  |
| < 5%            | < 5%           | ~ 5%             | > 10%       |  |  |  |  |





- Related work overview
- II. Technical background
- III. Timing prediction flow
- IV. Power and energy analysis flow
- V. Design Space Exploration flow
- VI. Conclusion & Prospects

# V. Design Space Exploration (DSE) - Overview





#### 2 phases:

- Phase 1: Fast exploration using best case pure analytical models
- Phase 2: Slower but accurate evaluation of most relevant mappings using simulation.
- Branch & Bound enhanced clustering and mapping search
- Possibility to perform several iterations of the flow in order to consider additional branches.

# V. DSE – Branch & Bound enhanced clustering search LETR









clusterings

# V. DSE – Branch & Bound enhanced clustering search **□IETR**







# V. DSE – Branch & Bound enhanced clustering search LETR







# V. DSE – Branch & Bound enhanced mapping search







Branch#1

L1C2

L1C3

L2

L3

Tile#1

L1C1

# V. DSE – Branch & Bound enhanced mapping search







#### V. DSE – Example use of the DSE flow







Latency (cycles)

Found candidate solutions for MLP1
-> Zoom on 50 best

 The flow indicates when power management is worth using to enhance timing and energy.

#### V. DSE – Evaluation of the DSE flow



Comparison of Branch & Bound-enhanced and exhaustive clustering search:



=> The optimal clustering compared to exhaustive search is always found.

These clusterings have a better score than 99% of other clusterings found with exhaustive.

#### V. DSE – Evaluation of the DSE flow



- Comparison of Branch & Bound-enhanced and exhaustive mapping search
  - Similar observations. However optimal mapping is not guaranteed to be found.



- Use of pure analytical models for pruning vs simulation
  - = > Similar results are obtained with the analytical models / simulation.





- . Related work overview
- II. Technical background
- III. Timing prediction flow
- IV. Power and energy analysis flow
- V. Design Space Exploration flow
- VI. Conclusion & Prospects

## VI. Conclusion – Research challenges





- 1. How to provide fast yet accurate evaluation early in design phases of timing and energy properties for streaming NNs deployments on multi-core platforms?
  - ➤ Use hybrid modeling flow: simulation, analytical models, measurements.
- 2. Is a model-based approach more relevant than rapid prototyping?
  - > Yes. 6 times faster with high accuracy + doesn't need the NN to be trained.
- 3. Is a model-based approach suited for early, fast and confident Design Space Exploration (DSE) of streaming NNs deployments on multi-core platforms?
  - Yes, we demonstrated it with our DSE approach.

#### **VI. Conclusion – Limitations**





- Prediction error (standard deviation) on power and energy raises up to 7% with the communication rate per tile (70%).
- On single-core platforms with important private memory allocated (1024kB, 2048kB), power and energy modeling has error > 10%.
- The analytical models used for the DSE flow could be improved.

# VI. Conclusion – Perspectives





- Extend the flow to support Neural Architecture Search (NAS) [1]
- Offer modeling and exploration of external memory accesses (necessary for larger NNs)

[1] Elsken, T.; Metzen, J. H. & Hutter, F. "Neural Architecture Search: A Survey", *Journal of Machine Learning Research (JMLR)*, **2019** 

Quentin Dariol's PhD defense - 27.11.2023

# Early Timing and Energy Prediction and Optimization of Artificial Neural Networks on Multi-Core Platforms QUESTIONS

#### Evaluation committee:

- Dr. Kim GRÜTTNER
- Prof. Dr. Matthias JUNG
- Dr. Angeliki KRITIKAKOU
- Prof. Dr. Frédéric PÉTROT
- Prof. Dr. Gregor SCHIELE

#### Guest:

Dr. Domenik HELMS

#### Reviewers before defense:

- Prof. Dr. Matthias JUNG
- Dr. Angeliki KRITIKAKOU

#### Thesis work supervised by:

- Prof. Dr. Sébastien PILLEMENT
- Dr. Sébastien LE NOURS





# APPENDICES

# **Appendice – Prototype platform**





#### I. Fundamentals & hypothesis – Considered NNs





| NN name | Number of layers | Data-set   | Accuracy |
|---------|------------------|------------|----------|
| MLP1    | 2                | MNIST [11] | 85%      |
| MLP2    | 3                | MNIST [11] | 89%      |
| MLP3    | 3                | GTSRB [90] | 20%      |
| CNN1    | 4                | MNIST [11] | 77%      |
| CNN2    | 7                | MNIST [11] | N.A.     |



#### **Context – Internet of Things (IoT)**



More centralized

Less

centralized



#### **Inspiration for figure:**

- Gunathilake, N. A.; Buchanan, W. J. & Asif,
   R. "Next Generation Lightweight
   Cryptography for Smart IoT Devices:
   Implementation, Challenges and
   Applications«, 2019 IEEE 5th World Forum on Internet of Things (WF-IoT), 2019
- ur Rehman, M. H.; Yaqoob, I.; Salah, K.; Imran, M.; Jayaraman, P. P. & Perera, C., "The role of big data analytics in industrial Internet of Things", Future Generation Computer Systems, 2019

# Appendice – Private memory model for tile sizing



| Tile private memory sections (in order)           | Actual content                                                           | Memory size model (bytes)    |  |  |  |
|---------------------------------------------------|--------------------------------------------------------------------------|------------------------------|--|--|--|
| .vectors                                          | SW/HW exceptions management,<br>reset, etc.                              | 128                          |  |  |  |
| .text                                             | Instructions                                                             | $8192 + 512 \cdot N_{actor}$ |  |  |  |
| .init, .fini, .ctors,<br>.dtors, .rodata, .sdata2 | /                                                                        | Marginal, neglected          |  |  |  |
| .data                                             | Initialized global variables:<br>Weights and input image                 | A                            |  |  |  |
| .sdata, .sbss                                     | /                                                                        | Marginal, neglected          |  |  |  |
| .bss                                              | Uninitialized global variables                                           | 256                          |  |  |  |
| .heap                                             | Dynamically allocated space                                              | 2048                         |  |  |  |
| .stack                                            | Local variables used inside functions,<br>SDF token_buffers and channels | B                            |  |  |  |

$$\mathbf{A}: B_{l=0} + \sum_{a=1}^{A} \lambda_a W_a$$

with 
$$W_a = \begin{cases} 4 \cdot N_{\text{neuron},a} \cdot (N_{\text{inputs}} + 1) & \text{if } a \text{ is an actor from a dense layer} \\ 4 \cdot K_a \cdot (F_{\text{h}} \cdot F_{\text{w}} + N_{\text{inputs}}) & \text{if } a \text{ is an actor from a convolution layer} \end{cases}$$
 and  $\lambda_a = \begin{cases} 1 & \text{if actor } a \text{ is mapped on the considered tile} \\ 0 & \text{otherwise} \end{cases}$ 

$$\mathbf{B}: 4 \cdot \left(2 \cdot N_{\text{channels}} + \sum_{l=0}^{L-1} B_l\right)$$

#### **Appendice** – Communication time model



: Simulation synchronization event



$$D_{RW}(n_T) = \underbrace{t_{init,RW} + t_p}_{\text{Check token availability}} \underbrace{+t_{pre,RW} + t_{RW} \cdot n_T + t_{RWl} \cdot (n_T - 1)}_{\text{Buffer access}} \underbrace{+t_{post,RW} + t_w}_{\text{Token status update}}$$

| Communication procedure | $t_r$ | $t_p$ | $t_w$ | $t_{rl}$ | $t_{wl}$ | $t_{pl}$ | $t_{r_{loop}}$ | $t_{w_{loop}}$ | $t_{p_{loop}}$ | $t_{pr_r}$ | $t_{po_r}$ | $t_{pr_w}$ | $t_{po_w}$ | $t_{init_r}$ | $t_{init_w}$ |
|-------------------------|-------|-------|-------|----------|----------|----------|----------------|----------------|----------------|------------|------------|------------|------------|--------------|--------------|
| Polling                 | 8     | 8     | 5     | 14       | 13       | 7        | 22             | 18             | 15             | 15         | 11         | 15         | 9          | 15           | 16           |
| Interrupt               | 8     | 0     | 5     | 14       | 13       | 0        | 22             | 18             | 0              | 15         | 11         | 15         | 9          | 348          | 349          |

\*: All delays in processor cycles