## It Takes Two - Balancing Data Movement and Compute in a Radar Application for Maximum Performance on a Vector DSP



# Agenda

#### Introduction Sensor fusion applications and procers1 G[Sens)-4(or f)-2(us)- 21.24 8wTf1 0 0or 24 8wTf



## RADAR, LiDAR and Vision Have Complementary Strengths





## Generic Architecture for FMCW Radar



Example RADAR case

4 channels, 50 frames/s, 256 chirps/frame, 512 samples/chirp Range FFTs:  $4 \times 50 \times 256 = 51,200$  complex FFTs of length 512 per second Velocity FFTs:  $4 \times 50 \times 512 = 102,400$  complex FFTs of length 256 per second Size of radar cube with 32b+32b complex data is 4.2MB Bandwidth requirement for a single data stream is 50 x 4.2 = 210MB/s



# Agenda

Introduction Sensor fusion applications and processing requirements

#### Data movement and data processing on a vector DSP Problem definition illustrated with a Radar case

Techniques to balance data movement and compute To achieve maximum performance on a vector DSP

Conclusion



### Synopsys ARC VPX DSP IP Family



# Example DMA Scheme for Hiding DMA Latency

Double-buffering with four buffers in VCCM



Double buffer scheme with four buffers in VCCM (B1, B2, B3, B4)

Two input buffers (B1, B2) and two output buffers (B3, B4)

Upon each processing step

Two buffers are used for processing, one for input and one for output

DMA transfers data from/to other two buffers



# Example DMA Scheme for Hiding DMA Latency

Double-buffering with three buffers in VCCM



Double buffer scheme with three buffers in VCCM (B1, B2, B3)

Upon each processing step

Two buffers are used for processing, one for input and one for output

DMA transfers data from/to third buffer

Input buffer of previous step is used as output buffer for current step

Benefit of a 3-buffer scheme is that less VCCM buffer space is required than with a 4-buffer scheme



# Smart VCCM Arbitration to Avoid Processor Stalls

#### With high-bandwidth access to VCCM



Processor core and DMA/DMI compete for access to VCCM Must minimize stalls for LD/ST of processor core DMA/DMI needs a throughput guarantee for access to VCCM

Unified VCCM allows flexible buffer allocation Buffers can be anywhere in VCCM, no artificial boundaries
Unified VCCM is arbitrated on a per cycle basis Per cycle either the core or DMA/DMI gets access to full VCCM Core takes priority as long as DMA/DMI is not under-served
Allows full memory bandwidth to be utilized Double-wide memory access every cycle Access by either core or DMA/DMI, with programmable ratio
Smart arbiter allows easy programming of bandwidth guarantees Employing simple accounting scheme



