FFT optimizations and performance assessment targeted towards satellite and airborne radar processing

1st Maron Schlemon  
Microwaves and Radar Institute  
German Aerospace Center (DLR)  
Oberpfaffenhofen, Germany  
maron.schlemon@dlr.de

2nd Jamin Naghmouchi  
Next Generation Computing  
Innovationsgesellschaft TU Braunschweig mbH  
Braunschweig, Germany  
j.naghmouchi@itubs.de

Abstract—Following the re-invention of the FFT algorithm by Cooley and Tukey in 1965, a lot of effort has been invested into optimization of this algorithm and all its variations.

In this paper, we discuss its use and optimization for current and future radar applications, and give a brief survey on implementations that have claimed relatively high advantages in terms of performance over existing solutions.

Correspondingly, we present an in-depth analysis of state-of-the-art solutions and our own implementation that will allow the reader to evaluate the performance improvements on a fair basis. Therefore, we discuss the development of a high-performance Fast Fourier Transform (FFT) using an enhanced Radix-4 decimation in frequency (DIF) algorithm, compare it against the Fastest Fourier Transform in the West (FFTW) auto-tuned library as well as other solutions and frameworks.

Index Terms—FFT, Radix-4 DIF, High Performance Algorithms, Parallel Computing, Embedded Computing, SAR, Chirp Compression

I. INTRODUCTION

Over the last decades the number of earth-observation satellites in orbit has been constantly growing, as well as the spacecraft payload complexity and computational demand continuously have increased. Today, satellites provide close to full earth coverage and produce a significant amount of data that needs to be downlinked to Earth for processing. Modern synthetic aperture radar (SAR) systems are continuously developing towards higher spatial resolution [1]. Hence, downlink constraints combined with the constantly growing data throughput of missions require faster data handling, processing, and transfer. Therefore, the demand for onboard generation of final image products and/or compression techniques keeps increasing. Similar facts are valid for airborne systems, whereas those additionally might have to store large data takes onboard. Present on-board processing solutions show constraints regarding computational performance, size, and transfer speeds.

Thus, we have investigated the acceleration of processing routines for onboard processing systems for different algorithms such as the (extended-) omega-k algorithm (EOK), for which the FFT turned out to be one of the limiting factors, since it is used multiple times on large amounts of data. The extended-omega-k algorithm is used for radar focusing with/without integrated motion compensation (moco). More generally, the FFT is used in radar chirp compression algorithms, where the compression kernel consists of three subroutines: range FFT, multiplication by a reference function (range focusing) and finally an inverse FFT.

![Fig. 1. EOK Algorithm [1]](image)

The FFT is well known as the fast/efficient way of calculating the Discrete Fourier Transform (DFT), an essential tool for spectrum analysis and for facilitating the computation of discrete convolution and correlation. It is widely employed e.g. in filtering algorithms, polynomial multiplication, fast algorithms for discrete sine and cosine transforms (used e.g. in JPEG or MPEG encoding), solving difference equations, and other numeric applications.
Due to its widespread use, many derivative algorithms of the FFT such as the Radix-N, Split-Radix, Mixed-Radix, and prime factor FFTs have been developed ever since [2]–[7], [15]. However, this paper focuses on the improvement of the Radix-4 DIF algorithm for the reason of computational efficiency in the presence of vector instructions, as it will be described in Sections II and III. Existing frameworks are described in Section IV and the performance of this implementation is then compared in depth against existing frameworks in Section V.

II. THE ENHANCED RADIX-4 ALGORITHM

Like all fast FFT algorithms, the Radix-4 reduces the computational complexity of an $N$-point DFT by decomposing the original data. Many papers have shown that depending on the data size, the Radix-4 compared to other Radix-N or Mixed-Radix implementations is more efficient. [9]-[11] [15]. The derivation of the Radix-4 algorithm can be found in [2]. Our optimizations include

- the separation of the input data into its real and imaginary parts to make vectorization easier and to reduce unnecessary overhead,
- loop resolution in order to split the Radix-4 algorithm into its stages and individual optimization of each stage,
- restructuring the butterfly in order to reduce the computational effort by reusing already calculated data,
- pre-calculation and pre-vectorization of the twiddle factors and providing them to the Radix kernel and finally,
- pre-identification of elements that must be swapped in the reordering stage in order to eliminate unnecessary swaps.

We explore these optimizations in the following sections. The matrix representation in (1a) illustrates the calculations of a Radix-4 kernel.

$$
\begin{bmatrix}
X^F(k) \\
X^F(k + \frac{N}{4}) \\
X^F(k + \frac{3N}{4}) \\
X^F(k + \frac{N}{2})
\end{bmatrix}
= 
\begin{bmatrix}
1 & 1 & 1 & 1 \\
1 & -j & -1 & j \\
1 & -1 & 1 & -1 \\
1 & j & -1 & -j
\end{bmatrix}
\begin{bmatrix}
A^F(k) \\
B^F(k) \\
C^F(k) \\
D^F(k)
\end{bmatrix}, \quad (1)
$$

$$
A^F(k) = \sum_{n=0}^{N-1} x(n) W_N^{kn}, \quad (1a)
$$

$$
B^F(k) = W_N^k \sum_{n=0}^{N-1} x(n+1) W_N^{kn}, \quad (1b)
$$

$$
C^F(k) = W_N^{2k} \sum_{n=0}^{\frac{N}{4}-1} x(n+2) W_N^{kn}, \quad (1c)
$$

$$
D^F(k) = W_N^{3k} \sum_{n=0}^{\frac{N}{4}-1} x(n+3) W_N^{kn}, \quad (1d)
$$

$$
W_N^{kn} = \exp(-j \frac{2\pi}{N} kn). \quad (2)
$$

$A^F(k), B^F(k), C^F(k)$ and $D^F(k)$ are the $N/4$ point DFTs, where the twiddle factors are given by $W_N^{kn}, x(n)$ is a sampled sequence of the data (signal), and $X^F$ is the $k$-th DFT coefficient.

A. Stages of the Radix-4

In an iterative process the initial data size $N$ is quartered. The number of stages is implicitly given by $n$, each with its own range $R$ where

$$
R_{\text{Stage}} = N_{\text{Stage}}/4. \quad (3)
$$

The range defines the iteration limit of each stage as a result of the decomposition of the data, which is shown in Table I for a 4096-point Radix-4 FFT. The vector sizes are discussed in detail in Section III. Since the Radix-4 output is in bit-reversed order [2], a reordering step is required. The VEC-FFT uses a scalar function that iterates through the data array and calculates the bit-reversal in order to swap the corresponding values. Section II-C describes optimization techniques for increasing the reordering efficiency.

<table>
<thead>
<tr>
<th>Stage</th>
<th>Elements $N$</th>
<th>Range $R$</th>
<th>Vector Size (Single Precision)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>4096</td>
<td>1024</td>
<td>8 ymm (AVX)</td>
</tr>
<tr>
<td>2</td>
<td>1024</td>
<td>256</td>
<td>8 ymm (AVX)</td>
</tr>
<tr>
<td>3</td>
<td>256</td>
<td>64</td>
<td>8 ymm (AVX)</td>
</tr>
<tr>
<td>4</td>
<td>64</td>
<td>16</td>
<td>8 ymm (AVX)</td>
</tr>
<tr>
<td>5</td>
<td>16</td>
<td>4</td>
<td>4 xmm (SSE)</td>
</tr>
<tr>
<td>6</td>
<td>4</td>
<td>1 (4)</td>
<td>4 xmm (special)</td>
</tr>
<tr>
<td>Reordering</td>
<td>-</td>
<td>2016</td>
<td>scalar reorder</td>
</tr>
</tbody>
</table>

B. Optimizations of the Radix-4 Kernel

Setting $k = 0$ in 1a allows for a simple view of the optimizations implemented in the proposed VEC-FFT. The first optimization technique is to split the complex data into its real and imaginary parts, which results into two butterflies. The real- and imaginary-valued butterflies have (respectively) the following forms

$$
\begin{bmatrix}
X^F_{\text{re}}(0) \\
X^F_{\text{im}}(0) \\
X^F_{\text{re}}(\frac{N}{4}) \\
X^F_{\text{im}}(\frac{N}{4})
\end{bmatrix}
= 
\begin{bmatrix}
A^F_{\text{re}} & B^F_{\text{re}} & C^F_{\text{re}} & D^F_{\text{re}} \\
A^F_{\text{im}} & B^F_{\text{im}} & C^F_{\text{im}} & D^F_{\text{im}} \\
A^F_{\text{re}} & -B^F_{\text{re}} & C^F_{\text{re}} & -D^F_{\text{re}} \\
A^F_{\text{im}} & -B^F_{\text{im}} & C^F_{\text{im}} & -D^F_{\text{im}}
\end{bmatrix}
$$

$$
\begin{bmatrix}
X^F_{\text{re}}(0) \\
X^F_{\text{im}}(0) \\
X^F_{\text{re}}(\frac{N}{4}) \\
X^F_{\text{im}}(\frac{N}{4})
\end{bmatrix}
= 
\begin{bmatrix}
A^F_{\text{im}} & B^F_{\text{im}} & C^F_{\text{im}} & D^F_{\text{im}} \\
A^F_{\text{re}} & -B^F_{\text{re}} & C^F_{\text{re}} & -D^F_{\text{re}} \\
A^F_{\text{im}} & B^F_{\text{im}} & C^F_{\text{im}} & -D^F_{\text{im}} \\
A^F_{\text{re}} & -B^F_{\text{re}} & C^F_{\text{re}} & D^F_{\text{re}}
\end{bmatrix}
$$

Having the data in split-complex format simplifies vectorization using SIMD instructions because the data can be directly loaded and processed. Using the standard format given in (1a) requires an extraction of the real- and imaginary-valued vectors, causing an avoidable but relatively
small instruction overhead.

The second technique consists of rearranging the butterflies and reusing pre-calculated values from look-up-tables. For example, the sum \( A^F_{re} + C^F_{re} \) can be reused instead of calculating it twice, as shown in (4).

\[
\begin{bmatrix}
A^F_{re} & B^F_{re} & C^F_{re} & D^F_{re} \\
A^F_{re} & B^F_{im} & -C^F_{re} & -D^F_{im} \\
A^F_{re} & -B^F_{re} & C^F_{re} & -D^F_{re} \\
A^F_{re} & -B^F_{im} & -C^F_{re} & D^F_{im}
\end{bmatrix}
\]

The result is an efficient structure of the Radix-4 kernel in (1a).

\[
X^F_{re}(0) = \begin{bmatrix}
\text{sum}^F_{re}(AC) + \text{sum}^F_{re}(BD) \\
\text{sub}^F_{im}(AC) + \text{sub}^F_{im}(BD) \\
\text{sum}^F_{im}(AC) - \text{sum}^F_{im}(BD) \\
\text{sub}^F_{re}(AC) - \text{sub}^F_{re}(BD)
\end{bmatrix}
\]

This analogously applies to the imaginary-valued butterfly:

\[
X^F_{im}(0) = \begin{bmatrix}
\text{sum}^F_{re}(AC) + \text{sum}^F_{im}(BD) \\
\text{sub}^F_{re}(AC) - \text{sub}^F_{im}(BD) \\
\text{sum}^F_{im}(AC) - \text{sum}^F_{im}(BD) \\
\text{sub}^F_{re}(AC) + \text{sub}^F_{im}(BD)
\end{bmatrix}
\]

\[D. \text{ Pre-Calculation of Twiddle Factors}\]

In a similar manner, the twiddle factors are calculated in advance in order to make them available for execution. Due to their symmetry properties, one can consider reusing previously calculated twiddle factors as suggested in [8]. If \( N = 8 \) is assumed, then

\[
W_8^{(1)} = W_8^{(1+N/2)} = \exp \left( \frac{2\pi}{8} \cdot 1 \right) = \exp \left( \frac{2\pi}{8} \cdot 5 \right) = 0.707106 - j0.707106
\]

Equivalent to (4) and (5), it is more efficient to split the twiddle factors into their real and imaginary valued parts. In the next section, we discuss the vectorization of the enhanced Radix-4 algorithm.

\[III. \text{ Vectorization using Intel’s Vector Intrinsics}\]

Single instruction multiple data (SIMD) instructions are implemented on Intel’s general purpose processors in order to enhance performance. Only a single instruction is fetched and applied to a set of data (vectors). This approach is suitable for the split-complex Radix-4 Butterflies in (4) and (5). The following sections give an overview about the instructions used by the proposed VEC-FFT.

\[A. \text{ Streaming SIMD Extensions (SSE)}\]

The Streaming SIMD Extensions use 128 bit (xmm) registers and process, e.g., four packed single precision (sp) or two double precision (dp) floating point values. Listing 1 shows the SSE instructions used by VEC-FFT. The first operation loads four contiguous sp floating point values into an xmm vector, while the second stores the register data back to the memory. The last three instructions are arithmetic vector addition, vector subtraction and element-wise vector multiplication of the vectors a and b [12], [13].

```c
void _m128 _mm_load_ps (float const * mem_addr) ;
void _mm_store_ps (float * mem_addr , _m128 a) ;
void _m128 _mm_add_ps ( _m128 a , _m128 b) ;
void _m128 _mm_sub_ps ( _m128 a , _m128 b) ;
void _m128 _mm_mul_ps ( _m128 a , _m128 b) ;
```

Listing 1. Streaming SIMD Extensions (SSE): Loading 4 single precision values into the xmm (128 bit) registers

---

**TABLE II**

<table>
<thead>
<tr>
<th>Element</th>
<th>Original order</th>
<th>Bit-reversed order</th>
<th>Swap</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>000</td>
<td>000</td>
<td>No</td>
</tr>
<tr>
<td>1</td>
<td>001</td>
<td>100</td>
<td>Yes</td>
</tr>
<tr>
<td>2</td>
<td>010</td>
<td>010</td>
<td>No</td>
</tr>
<tr>
<td>3</td>
<td>011</td>
<td>110</td>
<td>Yes</td>
</tr>
<tr>
<td>4</td>
<td>100</td>
<td>001</td>
<td>Yes</td>
</tr>
<tr>
<td>5</td>
<td>101</td>
<td>101</td>
<td>No</td>
</tr>
<tr>
<td>6</td>
<td>110</td>
<td>011</td>
<td>Yes</td>
</tr>
<tr>
<td>7</td>
<td>111</td>
<td>111</td>
<td>No</td>
</tr>
</tbody>
</table>

**Redefinition of procedure for an 8-point FFT**
B. Advanced Vector Extensions (AVX)

Advanced Vector Extensions use larger registers or vectors of 256 bit (ymm). They can process, e.g., eight packed single or four packed double floating point values. ymm vectors are preferred because of their ability to process more data simultaneously. Listing 2 shows the AVX instructions used by VEC-FFT.

Listing 2. Advanced Vector Extensions: Loading 8 single precision values into the ymm (256 bit) registers

```c
#include "cpuconfig.h"

// perform Advanced Vector Extensions (AVX)

void __m256_mm256_store_ps (float * mem_addr, __m256 a)
void __m256_mm256_store_ps (float * mem_addr, __m256 a)
void __m256_mm256_store_ps (float * mem_addr, __m256 a)
void __m256_mm256_store_ps (float * mem_addr, __m256 a)
void __m256_mm256_store_ps (float * mem_addr, __m256 a)
void __m256_mm256_store_ps (float * mem_addr, __m256 a)
void __m256_mm256_store_ps (float * mem_addr, __m256 a)
void __m256_mm256_store_ps (float * mem_addr, __m256 a)

Listing 3. Fused Multiply Add (FMA) instructions used by VEC-FFT

The right column of Table I shows the vectorization structure along the stages of the VEC-FFT. The first four stages are efficient regarding the usage of the larger ymm vectors. At stage five a breakdown to the smaller xmm registers is required, since the range size is four. Particularly grave are stage six and the reordering stage, as they cannot be easily vectorized. The reason is that the introduced load and store operations are only applicable on contiguous data in memory. However, stage six executes arithmetic operations on adjacent values. A solution to this problem is described in [15], [16], which suggests generating the correct vectors through a vectorized (4,4)-xmm transpose. The VEC-FFT utilizes a comparable technique to build the vectors, however, instead of transposing, Intel’s gather and set operations are implemented. As shown in Listing 4, they allow, by specifying the data elements, a non-contiguous or stride data access at the cost of being less efficient than the regular load/store operations introduced in Listing 1 and 2 [13], [14]. Therefore stage six has a special xmm structure while the reordering stage remains scalar.

Listing 4. Set instruction for loading non-contiguous data

```c
void __m128_mm_set_ps (float e3, ..., float e0)
```

C. Fused Multiply Add (FMA)

Fused Multiply Add (FMA) instructions are particularly efficient since they execute two operations simultaneously. For example, the __m256_fmadd_ps() instruction multiplies packed sp floating-point elements of vectors a and b and adds the intermediate result to the packed elements in vector c [13], [14]. Listing 3 shows the FMA instructions used by VEC-FFT.

```c
#include "cpuconfig.h"

// perform Fused Multiply Add (FMA)

Listing 5. VEC-FFT Process

V. Test

We divided our tests into two sections. First, we compared our implementations with other solutions. Subsequently, we focused on an extensive test against FFTW, since it is widely used and well known to the science community [17]. Initially, we tested the VEC-FFT against FFTW’s version 3.3.4 for a 4096- and 16384-point FFT. However, since most studies in the field of high performance FFTs have mainly focused on testing against the scalar version of FFTW [18]–[20], this paper compares both, FFTW’s scalar as well as the SIMD implementation of version 3.3.8 with VEC-FFT. For this purpose, various tests that cover 64- to 16384-point single thread 1D FFTs were conducted. For the FFTW comparison, the performance was measured using the Read Time-Stamp Counter (RDTSC) instruction in Listing 6 which returns the number of CPU clock cycles since last reset (start). For each size, 10000 measurements were taken and statistically evaluated in order to eliminate the impact of the operating system and its processes. We have ascertained that 1000 or 10000 repetition are enough for a good statistical evaluation.

The tests were executed on an Intel processing units that offer AVX, AVX2 and FMA as shown in Table III. The program was compiled and tested several times using different compiler and optimization flags:

- -march=native -O2 -mfma -std=c99
- -march=native -O3 -mfma -std=c99
We discuss the best solutions in the following sections.

### TABLE III
**TEST SYSTEM**

<table>
<thead>
<tr>
<th>Processor</th>
<th>Name</th>
<th>Intel Core i7 8700</th>
</tr>
</thead>
<tbody>
<tr>
<td>Code Name</td>
<td>Coffee Lake</td>
<td></td>
</tr>
<tr>
<td>Max TDP</td>
<td>65 W</td>
<td></td>
</tr>
<tr>
<td>Cores</td>
<td>6</td>
<td></td>
</tr>
<tr>
<td>Hyper Threading</td>
<td>Off</td>
<td></td>
</tr>
<tr>
<td>L1 D cache:</td>
<td>32 KB</td>
<td></td>
</tr>
<tr>
<td>L1 I cache:</td>
<td>32 KB</td>
<td></td>
</tr>
<tr>
<td>L2 cache:</td>
<td>256 KB</td>
<td></td>
</tr>
<tr>
<td>L3 cache:</td>
<td>12288 KB</td>
<td></td>
</tr>
<tr>
<td>Base frequency</td>
<td>3.2 GHz</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Memory</th>
<th>Type</th>
<th>DDR4</th>
</tr>
</thead>
<tbody>
<tr>
<td>Capacity</td>
<td>64 (2x32) GB</td>
<td></td>
</tr>
<tr>
<td>Operation</td>
<td>Dual Channel</td>
<td></td>
</tr>
<tr>
<td>Frequency</td>
<td>2333 Mhz</td>
<td></td>
</tr>
<tr>
<td>CL</td>
<td>15 Clocks</td>
<td></td>
</tr>
<tr>
<td>tRCD</td>
<td>15 Clocks</td>
<td></td>
</tr>
<tr>
<td>tRP</td>
<td>15 Clocks</td>
<td></td>
</tr>
<tr>
<td>tRAS</td>
<td>36 Clocks</td>
<td></td>
</tr>
</tbody>
</table>

| Operating System | Linux |
| Distribution     | Ubuntu 18.04 LTS |
| Version          | 18.04 LTS (64 bit) |

### A. Test Against Other Solutions

We have performed a total of three different tests. If the reference mentioned measured the performance in execution time, we also carried out a time measurement in order to be able to compare the results. The execution times of the proposed VEC-FFT represent the median of a measurement of 1000 repetitions. Since these tests depend on the test system, the results are only a superficial assessment.

1) **A 65536-point FFT test**: In our first test, we compared a 65536-point FFT with [21].

<table>
<thead>
<tr>
<th>Reference</th>
<th>VEC-FFT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.1181</td>
<td>0.0002</td>
</tr>
</tbody>
</table>

2) **Test of multiple FFT sizes**: In this test, we tested our VEC-FFT against [22].

<table>
<thead>
<tr>
<th>Size</th>
<th>Reference</th>
<th>VEC-FFT</th>
</tr>
</thead>
<tbody>
<tr>
<td>256</td>
<td>0.0025</td>
<td>10−7</td>
</tr>
<tr>
<td>1024</td>
<td>0.0031</td>
<td>10−6</td>
</tr>
<tr>
<td>4096</td>
<td>0.0077</td>
<td>5×10−6</td>
</tr>
<tr>
<td>16384</td>
<td>0.0197</td>
<td>3×10−5</td>
</tr>
</tbody>
</table>

3) **A 256-point FFT test**: In this test, we tested our VEC-FFT against [15] by measuring the clock cycles for 256-point FFT. The result is shown in Table VI. Other comparisons can be considered with [18], [19].

### B. Initial Test Against FFTW’s Version 3.3.4

The configuration for the FFTW library version 3.3.4 is as follows:

- **Precision**:
  - --enable-float

- **Performance mode**:
  - FFTW offers four different performance modes depending on the planning effort. In this benchmark the FFTW_EXHAUSTIVE mode has been chosen to measure FFTW’s peak performance. Since the planning time increases dramatically with the data size, FFTW offers the opportunity to save and import the generated plan to eliminate the planning effort. This tool is called “wisdom” and can be built as shown in Ls. 7.

```
/* Generate and export FFTW plan */
fftw_plan pl;
pl = fftw_plan_dft_1d(num, inputarray, inputarray, FFTW_FORWARD, FFTW_EXHAUSTIVE);
fftw_export_wisdom_to_filename(fftwplan);
fftw_execute(pl);

/* For a new calculation the plan has to be imported only */
fftw_plan pl;
fftw_import_wisdom_from_filename(fftwplan);
pl = fftw_plan_dft_1d(num, inputarray, inputarray, FFTW_FORWARD, FFTW_EXHAUSTIVE);
fftw_execute(pl);
```

Listing 7. FFTW’s WISDOM mechanism

Next, the runtime in clock cycles of the VEC-FFT is compared to that of the FFTW for 4096-point and 16384-point FFTs, in Figs. 2 and 4, respectively. The plots show the cumulative distribution of the runtime. It can be seen that the VEC-FFT outperforms the FFTW by a factor of up to 3.42.

In order to guarantee similar outputs, the discrepancy in the FFT output between the two libraries in terms of magnitude and phase is addressed in Figs. 3 and 5 (again for 4096-point and 16384-point FFTs, respectively). It can be seen that the difference between the two outputs is negligible (in the order of 10−5), meaning that the two operations are interchangeable in the processing chain.
C. Test Against FFTW’s Latest Version 3.3.8

For FFTW version 3.3.8 we additionally used the following compilation flags in order to perform a SIMD test:

- enable-sse, -enable-sse2, -enable-avx, -enable-avx2,
- enable-avx-128-fma, -enable-fma

A total of five tests were performed: our first compares the VEC-FFT with FFTW’s scalar implementation of the FFT, whereas the second examines FFTW’s SIMD version. The third explores VEC-FFT without reordering stage in order to measure its impact. The fourth evaluates Intel’s set instructions which were used in the last stage of the Radix-4 algorithm in order to generate the vectors for the FFT kernel, as it is described in section III and Table I. Finally, the effect of the breakdown from the larger 256 bit ymm to the 128 bit xmm registers of stage 5 are analyzed.

Tables VII–IX show the performance results for each of the FFT (VEC-FFT, FFTW-scalar and FFTW-SIMD) the tests where MIN, MEDIAN, MEAN and SD represent the minimum, median, mean value and the standard deviation of the runtime in number of clock cycles, respectively.

1) First test: comparison to FFTW-scalar: Comparing Table VII with VIII, the first test shows that the proposed VEC-FFT calculates the FFT up to four times faster than FFTW’s scalar version for the given FFT sizes. For a 4096-point FFT, the difference in performance decreases significantly by a factor of two. The reason behind this variation is the increase in the number of swaps which makes the reordering function more important. To measure its impact, a comparison between Table VII and Table X is considered. The following equations calculate the reordering effort $E$ for the 4096- and 16384-point FFT:

\[
E^\text{Median}_{\text{reordering}}(4096) = E^\text{Median}_{FFT} - E^\text{Median}_{FFTNR} 
\]

\[
= 25442 - 17343 
\]

\[
= 8099 \text{ clock cycles} 
\]

\[
E^\text{Median}_{\text{reordering}}(16384) = E^\text{Median}_{FFT} - E^\text{Median}_{FFTNR} 
\]

\[
= 229619 - 90200 
\]

\[
= 139419 \text{ clock cycles} 
\]

While for a 4096-point FFT the reordering effort with 8000 clock cycles is about 1/3, it takes with approx. 140000 clock cycles 2/3 for a 16384-point FFT of the overall performance. This effect can also be seen in Fig. 6, where the FFT sizes are shown on the x-axis and the corresponding effort in clock cycles on the logarithmically scaled y-axis.

2) Second test: comparison to SIMD-FFTW: In turn, the second test compares the performance of VEC-FFT (Table VII) and SIMD-FFTW (Table IX). The result is that, up to a 4096-point FFT, the SIMD version of FFTW is about 30% more efficient than VEC-FFT. Especially for a 16384-point FFT, the difference in performance is a factor of two. As
already mentioned, from this size on, the slow reordering function of the VEC-FFT has a strong impact explaining the gap to SIMD-FFTW.

3) Third test: VEC-FFT without reordering stage: Hence, the third test was conducted in order to compare the VEC-FFT without the reordering stage (VEC-FFT NR) with SIMD-FFTW. Considering Tables IX and X, it can be seen that VEC-FFT NR improved being now about 10% faster than SIMD-FFTW. Accordingly, Fig. 6 shows that the curve of VEC-FFT NR lies below SIMD-FFTW’s curve.

4) Fourth test: impact of the breakdown to the smaller xmm registers: Table XI shows the result of the fourth test for a 4096-point FFT. The median performance for the first four ymm stages is around 9000 clock cycles, which translates into an average of 2250 clock cycles. The breakdown to the xmm register in stage five has a negative impact on the performance, causing an additional 1250 clock cycles.

5) Fifth test: impact of using Intel’s set instructions: Furthermore, Table XI shows that the set instructions linked to the non-contiguous data access in stage six have a stronger impact on the performance, raising the total number of required clock cycles to 4754.

In order to estimate the stability, the standard deviation can be considered. Stability can be understood as the ability of reproducing the median runtime performance. In this respect, the VEC-FFTW and FFTW show equivalent figures.

VI. CONCLUSION AND FUTURE WORK

This paper shows that performance optimization of FFT algorithms has not yet come to an end, although it is getting more advanced and competitive. Still, for embedded/onboard software computing routines almost every cycle matters due to very limited resources. Especially, for spaceborne or airborne platforms with complex payloads the demands for fast processing routines in software is constantly growing. These demands do not only concern the FFT but also other signal processing routines that depend on the scientific use case and are subject to changes during missions.

We have accelerated FFT processing using an optimized Radix-4 Algorithm that can performance-wise be compared against automatically tuned libraries such as FFTW. In order to achieve such high-performance goals, we have fully parallelized the FFT algorithm and computed the most efficient memory mapping for look-up tables for each iteration stage separately. Compared to FFTW version 3.3.4 the execution time could be improved by a factor of 1.83 to 3.42 through VEC-FFT, when neglecting side-effects of the operating system. Using the example of a 4096-point FFT, it can be observed that within the first stages, which allow for more efficient vectorization, the utilization of vector instructions is...
at 96.7%, which proves that for these no memory bottleneck is present. However, the last two stages as well as the reordering of the Radix-4 algorithm show less efficiency. Therefore, our future work will concern the optimization of these less efficient stages by employing a mix of Radix algorithms similar to FFTW version 3.3.8.

REFERENCES


