

# AUTONN-SW EXECUTIVE **SUMMARY** EXECUTIVE<br>Prepared by: AUTONN-SW Team<br>Approved by: N. Pérez

# AUTONN-SW

Approved by: N. Pérez

Authorized by: N. Pérez

Prepared by:<br>
Authorized by:<br>
N. Pérez<br>
Authorized by:<br>
N. Pérez<br>
Code:<br>
Code:<br>
Code:<br>
AUTONN-SW-GMV-ESR-0001<br>
Version:<br>
1.0<br>
Date:<br>
23/01/2024<br>
Internal code: GMV 20725/24 V1/24 Prepared by:<br>
Approved by:<br>
N. Pérez<br>
Authorized by:<br>
N. Pérez<br>
Code:<br>
Code:<br>
Letterion:<br>
1.0<br>
Date:<br>
23/01/2024<br>
Internal code: GMV 20725/24 V1/24 Prepared by:<br>
Approved by:<br>
N. Pérez<br>
Authorized by:<br>
N. Pérez<br>
Authorized by:<br>
N. Pérez<br>
Code:<br>
23/01/2024<br>
Muscon:<br>
1.0<br>
Date:<br>
23/01/2024<br>
Internal code: GMV 20725/24 V1/24 Prepared by: AUTONN-SW Team<br>
Approved by: N. Pérez<br>
Authorized by: N. Pérez<br>
Code: AUTONN-SW-GMV-ESR-0001<br>
Version: 1.0<br>
Date: 23/01/2024<br>
Internal code: GMV 20725/24 V1/24





# DOCUMENT STATUS SHEET





Code: AUTONN-SW-GMV-ESR-0001 Date: 23/01/2024 Version:  $1.0\,$ Page: 3 of 14

# **TABLE OF CONTENTS**







# LIST OF TABLES AND FIGURES





## 1 INTRODUCTION

### 1.1 PURPOSE

This document is the Executive Summary of AUTONN-SW project where the main findings of the projects are summarized.

### 1.2 SCOPE

The scope of this Executive Summary is the whole activity carried out in the frame of AUTONN-SW contract.

## 1.3 DEFINITIONS AND ACRONYMS

Acronyms used in this document and needing a definition are included in the following table:

Table 1-1 Acronyms





# 2 REFERENCES

## 2.1 APPLICABLE DOCUMENTS

The following documents, of the exact issue shown, form part of this document to the extent specified herein. Applicable documents are those referenced in the Contract or approved by the Approval Authority. They are referenced in this document in the form [AD.x]:

#### Table 2-1 Applicable Documents



## 2.2 REFERENCE DOCUMENTS

The following documents, although not part of this document, amplify or clarify its contents. Reference documents are those not applicable and referenced within this document. They are referenced in this document in the form [RD.x]:

#### Table 2-2 Reference Documents





Code: AUTONN-SW-GMV-ESR-0001<br>
Date: 23/01/2024<br>
rsion: 1.0<br>
Page: 7 of 14 Code: AUTONN-SW-GMV-ESR-0001<br>Date: 23/01/2024<br>rsion: 1.0<br>Page: 7 of 14 Code: AUTONN-SW-GMV-ESR-0001<br>
Date: 23/01/2024<br>
Version: 1.0<br>
Page: 7 of 14 Code: AUTONN-SW-GMV-ESR-0001<br>
Date: 23/01/2024<br>
rsion: 1.0<br>
Page: 7 of 14<br>
19 used to solve very different kind of

Code: AUTO<br>
Date:<br>
NEW PROJECT SUMMARY<br>
Convolutional Neural Networks (CNNs) are more and more being used to solve ve<br>
problems on ground and are also suitable to solve Space problems, e.g. visual ba<br>
Embedding a CNN into Convolutional Neural Networks (CNNs) are more and more being used to solve very different kind of problems on ground and are also suitable to solve Space problems, e.g. visual based tasks for rovers. Embedding a CNN into an operational on-board software (OBSW) or hardware is challenging as the CNNs are composed by hundreds of operations and they request an extensive amount of memory that need to be used efficiently. Code: AUTONN-SW-GMV-ESR-0001<br>
Date: 23/01/2024<br>
Date: 23/01/2024<br>
Most popular Most popular Metal Networks (CNNs) are more and more being used to solve very different kind of<br>
problems on ground and are also suitable to so Grace: AUTONN-SW-GRV-GRV (2001)<br>
Detect: 23/01/2024<br>
Version: 23/01/2024<br>
Version: 23/01/2024<br>
Version: 23/01/2024<br>
Convolutional Neural Networks (CNNs) are more and more being used to solve very different kind of<br>
Convolu Version:<br>
The method of a couries and the method of a control in the couries and the control in the control in the control in the method of a control in the and of the smooth of the method of operations on the method in a

Open Neural Network Exchange (ONNX) is an open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing capability to export the definition of the neural network to ONNX.

In the frame of AUTONN-SW project a proof of concept of a potential future tool (CAELUM) to autocode CNN defined in ONNX format and integrate this code in and OBSW or HW using HDL has been developed.

CAELUM is a promising tool that will easy the introduction of Convolutional Neural Networks in operational environments where, up to now, it was not possible due to limited resources or

Its proposed architecture provides flexibility adapting the obtained outputs to a wide range of CNN models:

The C generator tool has been developed in python with the following objectives:

- Read an ONNX file
- Process each neuron of the model, storing the relevant information to generate the source code
- Extract and store operators additional inputs in a binary file
- Perform a memory budget analysis to evaluate the feasibility of implementing the network in an operational environment
- Generate the network in C code

It extracts information from an ONNX file (by means of ONNX python package provided by the ONNX official repository) and generates an associated C source code file and header file as well a memory evolution report and a binary file.

The inputs and outputs for the onnxtoC tool are shown in the Error! Reference source not found.:



Figure 3-1. CAELUM tool structure

AUTOMIN-SW<br>
AUTONIN-SW AUTONIN-SW<br>
AUTONIN-SW AUTONIN-SW AUTONIN-SW Executive Summary<br>
AUTONIN-SW Expected Figure 3-1. CAELUM tool structure<br>
Trispersion file - Niemary evolution field.<br>
This politicalization is based on a One important achievement of this tool is the memory optimization included in the CNN autogenerated file. This optimization is based on a pool of variables that are statically allocated in the memory that is available for the neural network. The variables of this pool are in charge to store the result of every The C generator tool has been developed in python with the following objectives:<br>
• Process each neuron of the model, storing the relevant information to generate the source code<br>
• Perform a memory budget analysis to eval is that the tool can understand when a variable is no longer needed. This variable value can be overwritten by another ones, so this way no new extra variables need to be defined. An example of this iterative process is shown in the following diagram:





#### Figure 3-2: Iterative use of the pool variables for memory optimization

Finally, network operators have been implemented in a standalone .c/.h files allow their reuse between different networks since operator's implementation is independent from the network under use, being the only modification the order the specific network code calls them and with which inputs they are called. An example of how all the source code produced in the frame of CAELUM tool is provided in the following figure:





3.2 GENERATION OF HDL<br>
Sing as input to BAMBU tool the operators source code developed for the C code autogeneration, a<br>
ybrid hardware architecture which implement single kernel accelerators for relevant<br>
every did produ Using as input to BAMBU tool the operators source code developed for the C code autogeneration, a hybrid hardware/software architecture which implement single kernel accelerators for relevant operations while the remaining operators are implemented in SW. This approach would generate several digital circuits, one per each accelerated kernel. This approach is much more easily scalable, since the kernels take up a much smaller FPGA real estate when compared to the full network, and they can be reused across different layers of the same type, and across different models using the same layer. If many different models use the same type of kernel, it is not necessary to reconfigure Finally, network operators have been implemented in a standalone *c*/**h** files allow their reuse between<br>the endy modification the order the specific returned when converts interested in the method which inputs they are<br>di the FPGA to increase parallelism if Bambu is unable to optimize enough. Further, it is much easier to optimize these kernels as opposed to the full network, since in the case of the full network, Bambu will of the method is the specific in the method in the network wide respected to the full network under such as opposed to the full network under the particular state between the full network under the full network in the spec optimize each kernel in combination with all other layers.

The architecture is presented in the following figure:



The platform is a dual-core ARM Cortex A9 CPU, included in the Zynq 7000 SoC from AMD. The SoC holds a UART controller and driver for input and output, as well as an SD card controller for permanent data storage. This is where the network weights are stored, and the input images used during the validation tests.

The SoC also features an embedded FPGA, which is used to implement three kernels. Each kernel has a control interface and a high-bandwidth data interface. The control interface is based on an AXI4 Lite protocol, for which a dedicated interface was written in VHDL. The data interface is based on an AXI4 full Master protocol.

The control interface is a set of memory-mapped registers to configure and control the kernel. Using the registers, the application software can:

- store the output data in.
- 
- 
- 

Through the high-bandwidth data interface, the kernel gathers both input data and input weights for the corresponding layer.

Additional activities have been carried out to identify improvements in the C source code developed to optimize the kernels. The main modifications performed were:

- 1. Hard-coded tensor dimensions: In the original version, the convolution operator takes all involved tensor dimensions as a function argument. Since they become a variable parameter, it is not possible for the HLS tool to synthesize highly parallel code.
- 2. No if in internal loop: In the original version, since the tensor dimensions are variable, checks are made to verify that the computation loop does not go out of bounds. Once the tensor sizes are hard-coded, these checks can be removed, since having branching in the inner loop prevents parallelization.
- 3. Loop unrolling: Once the previous optimizations are made, it is possible to perform loop unrolling, e.g. telling the HLS tool to take the inner loop and run the different loop iterations in parallel.
- example with the conservation of the state of the conservation and the conservation and the conservation and the state is the members. The main modifications performed were:<br>
1. Hard-coded tensor dimensions: In the origina 4. Load-Compute-Store model: It is usually more efficient to first load all input data into the accelerator local memory, to be able to later operate on this data in parallel without the latency of the memory accesses. In this optimization, all input data is read first into local memory (which is a software scenario would be stored on the stack), and all output data is stored initially also in local memory, and only sent to the memory when the whole computation is finished.
	- 5. Integer precision: In machine learning and FPGAs it is very common to avoid floating point arithmetic through quantization, sacrificing some arithmetic precision for lower inference latency and/or model size. The HLS tool fully supports floating point arithmetic, but the





performance would benefit greatly from using a quantizer to change to integer or fixed point arithmetic.

After all the previous modifications, the hardware accelerator for convolutional layers became 5.42 times faster than the original version. This would make the accelerator performance equivalent to the LEON3-FT processor, or half the performance of the ARM Cortex A9.

Code: AUTONN-SW-GMV-ESR-0001<br>
Date: 23/01/2024<br>
Date: 23/01/2024<br>
Version: 1.0<br>
Page: 10 of 14<br>
performance would benefit greatly from using a quantizer to change to integer or fixed point<br>
arithmetic.<br>
After all the previ Finally, the tool has been preliminary integrated within TASTE environment developing an export/import process that will allow the reuse of CAELUM component between different TASTE MBSE models.

The flow of the model is detailed in the Figure 3-5:



#### Figure 3-5: TASTE execution flow.

Figure 3-5: TASTE execution flow.<br>ADIE model has been developed in the Interface View, using component and functions implemented<br>C++, SDL and PySide to develop an interactive and functional model, whit the simulation capab TASTE model has been developed in the Interface View, using component and functions implemented in C++, SDL and PySide to develop an interactive and functional model, whit the simulation capability. In the Interface View a main component named CAELUM has been defined to facilitate the reuse and integration of this component-model in other MBSE TASTE model without configuration effort.





Figure 3-6: Caelum components and functions allocation

The main component contains 2 subcomponents, one for hardware implementation and a second for software implementation. It also contains a functionality named GUI to implement a user interface to control and modify settings during simulation. The hardware and software component also contains two functions each. The first contain a SDL implementation to explain the model subcomponet process and the second function contains a C++ implementation to execute the tool developed during this activity.

GUI Function: This Function has been developed in PySide selecting GUI language in the implementation options. A ui file has been defined to define the graphic interface with the user. The main objective of this function in send and receive the information thought the connections defined in the Interface View. The output GUI signals are defined in TASTE as telecommand and the input GUI signal as telemetry. The telecommand signals contains the required information of the settings selected by the user defined and the telemetries contains the required information to update the table depending on the user settings, this is especially necessary in the Hardware execution where user could execute an operation.





Figure 3-7: GUI Component Execution

#### Hardware component

SDL Hardware: This function receives settings and associate parameters and check the status  $\bullet$ of each parameter to process or not depending on the user previous selection. It has been defined as a state behaviour diagram of the Figure 3-8.



Figure 3-8: Hardware SDL diagram

Hardware Analysis SH: this function receives the settings selected by the user and execute the hardware operations assigned to the selection. Once execution is ended it emits a signal to update the table whit the obtained results. The execution process could be checked all the moment by the information messages in the terminal.



#### Software component

• SDL Function: This function receives the software settings and associate parameters and check the validity of each parameter to be set into the configuration.ini file. It has been defined as a state behaviour diagram of the



Figure 3-9: software SDL Diagram

Python Execution Function: this Function receives the onnx setting and the output path and  $\bullet$ completes the configuration template for each execution. Once template is complete it!<br>executes the python script to perform the required onnx to c operations. When the execution<br>is finished this function emits the signal completes the configuration template for each execution. Once template is complete it executes the python script to perform the required onnx to c operations. When the execution is finished this function emits the signal to the gui Function to update the table with the obtained result. The execution process could be checked all the moment by the information messages in the terminal.



Code: AUTONN-SW-GMV-ESR-0001 Date: 23/01/2024 Version: 1.0 Page: 14 of 14

END OF DOCUMENT