CUDA-Based Particle Swarm Optimization in Reflectarray Antenna Synthesis

The synthesis of electrically large, highly performing reflectarray antennas can be computationally very demanding both from the analysis and from the optimization points of view. It therefore requires the combined usage of numerical and hardware strategies to control the computational complexity and provide the needed acceleration. Recently, we have set up a multi-stage approach in which the first stage employs global optimization with a rough, computationally convenient modeling of the radiation, while the subsequent stages employ local optimization on gradually refined radiation models. The purpose of this paper is to show how reflectarray antenna synthesis can take profit from parallel computing on Graphics Processing Units (GPUs) using the CUDA language. In particular, parallel computing is adopted along two lines. First, the presented approach accelerates a Particle Swarm Optimization procedure exploited for the first stage. Second, it accelerates the computation of the field radiated by the reflectarray using a GPU-implemented Non-Uniform FFT routine which is used by all the stages. The numerical results show how the first stage of the optimization process is crucial to achieve, at an acceptable computational cost, a good starting point. INDEX TERMS Reflectarray antenna synthesis, CUDA, parallel programming, Graphics Processing Units (GPUs), Particle Swarm Optimization.


I. INTRODUCTION
Microstrip reflectarrays (RAs) have attracted much interest in the last years [1], [2] and several new application fields have been introduced to RAs, as THz frequencies [3], 5G [4] and smart antennas for indoor applications [5]. The design of high performance microstrip reflectarrays requires the ability of exploiting all the degrees of freedom (DoFs) of the antenna as the positions, the orientations, the characteristics of the scattering elements or the shape of the reflecting surface. Managing a large number of DoFs requires a proper synthesis strategy trading-off effectiveness and efficiency through efficient and effective numerical algorithms guaranteeing reliability, keeping the computational complexity low and exploiting highly performing computing hardware. To guarantee reliability and accuracy, a multistage approach has been devised [6]- [8]. The idea is to employ a rough modeling of the structure to enable robustness against suboptimal solutions at the early stages and then refining the model from stage to stage. The success of the whole procedure is submitted to the capability of the early stages to mitigate the false solution issue. To this end, the unknowns of the problem are given proper representations enabling a progressive enlargement of the parameters space as well as a modulation of the computational complexity of the approach [6]- [8]. At the beginning of the whole procedure, few properly chosen polynomials are employed for the command phases and the element positions (Zernike polynomials for the phases and Lagrange polynomials for the positions in this paper) and progressively increased in number [6]- [8]. The success of the whole procedure is submitted to the capability of the early stages to mitigate the false solution issue. To this end, a global optimizer should be adopted. However, choosing the global optimization algorithm is not a simple task and a wide discussion is contained in [9]. The No Free Lunch Theorems [10] provide insight in choosing the strategy to get a successful algorithm for the problem at hand. They explore the relationship between an efficient optimization algorithm and the problem that it is asked to solve. These results state that a general purpose algorithm that can be efficiently used in all optimization problems does not exist: on average, the performances of any two optimization algorithms are the same across all the possible optimization problems [10]. Explicitly or implicitly inserting some a priori information on the structure of the problem at hand should improve the performance. It is suggested that an algorithm can be "aligned" to the structure of the problem because of an implicit tuning procedure due to training and/or to the years of research [10]. On the other side, the high dimensionality of a problem is the main obstacle towards effective global optimization. The amount of computing time required to get a reliable solution becomes inordinate as long as the number of variables increase [11], [12]. In particular, the dependence of the performances of a global optimization algorithm versus the dimension of the searching space can be rigorously established thanks to Nemirovsky and Yudin Theorem [12]. It proves the exponential dependence of the number of minimization steps on the dimensionality of the problem. Accordingly and once again, the number of unknowns of the problem must be modulated during the design by using a proper representation of the unknowns enabling its progressive enlargement [6]- [8]. Then, global optimization can be exploited as starting stage, while local tools can be employed in the subsequent ones. In this paper, we use a Particle Swarm Optimization (PSO) approach [13]. PSO, indeed, is a simple, but powerful optimization algorithm which searches for the optimum of a function following rules inspired by the behavior of flocks of bees looking for food. PSO has recently gained more and more popularity due to its robustness, effectiveness, and simplicity. Furthermore, PSO enables a very efficient implementation since it maps into a highly parallelizable computing pattern and is amenable of acceleration using Graphics Processing Units (GPU) computing. We show how PSO can take profit from this technology. In particular, we provide an implementation of PSO in CUDA language, the NVIDIA Compute Unified Device Architecture that represents the extension of C++ to let a large class of users able to program GPUs for numerical computing applications. CUDA has indeed already shown to be very promising for antenna analysis [14], [15] and synthesis [8] to accelerate computations. We show how computing time can be kept at very convenient duration for the application at hand. The presented approach is further accelerated by the use of Non-Uniform FFTs [16] for the evaluation of the field radiated by the RA [14]. The Authors have been the first employing NUFFT algorithms for the calculation of the field radiated by aperiodic arrays and RAs [14]. They have presented an improved version of [16] in [17]. The paper is organized as follows. In Section II, the approach is sketched. In particular, the radiative model is described along with its calculation by a NUFFT. Furthermore, the NUFFT algorithm is briefly described along with its GPU implementation. Section III is devoted to discuss the GPUbased PSO implementation. In Section IV, the results are presented. Finally, in Section V, the conclusions are drawn.

II. THE APPROACH
A simplified layout of the multi-stage synthesis approach is illustrated in Fig. 1. We refer to [6]- [8] for a complete description of the approach. In the initial stages, an approximate radiative model, namely, the Phase-Only (PO) model, is adopted leading to the Phase-Only Synthesis (POS) stage. For a POS, the unknowns to be determined are the element positions as well as the command phases. Then, an accurate model exploiting all the DoFs refines the results from the POS stages and leads to the Accurate Synthesis (AS). For the AS, the unknowns turn to be the internal DoFs of each element. The element positions can be further refined or, for the sake of simplicity, the result of the POS, in terms of element positions, can be kept fixed. In this paper, attention is focused to the POS only. In order to limit the number of unknowns and make the use of a global optimizer affordable to generate a good initial guess, in stage 1), few polynomials are exploited to represent command phases and element positions. In stage 2), the number of the unknowns is progressively enlarged by increasing the order of the involved polynomials and a local optimizer is considered. Stage 3) exploits an impulsive representation of the unknowns involving all the effective DoFs at disposal for the command phases. The details common to the three synthesis stages are now in order. The purpose of each stage is to find the unknown parameters, which change from stage to stage, in order to satisfy the coverage requirements. The unknown parameters, say R for the command phases and P for the element positions, should be obtained by minimizing a proper objective functional Φ given by: where (A co , A cr ) is the relevant radiation operator, namely E co and E cr are the co-polar and cross-polar components of the far-field, P Uco and P Ucr are the metric projectors onto the set U co and U cr that contain all the power patterns satisfying the specifications for the co-polar and cross-polar components [18], and · is a properly chosen norm. The sets U co and U cr are defined by mask functions (m co , M co ) and (m cr , M cr ), respectively, determining upper and lower bounds for A co and A cr , respectively [18]. The metric projectors P Uco (A co ) and P Ucr (A cr ) are defined as [18] (3) For the cases of interest here, Φ will refer only to the copolar term and only the A co operator (which will changed depending on the stage) will be relevant to our purposes. In this paper, for the sake of brevity, only some implementation details of the POS stages are discussed.

A. THE RADIATIVE MODEL AND THE NUFFT
We refer to a flat aperiodic RA, made up by N elements, illuminated by a feed located at the origin of the Oxyz reference system as shown in Fig. 2. The feed is assumed to have the typical cos m f (θ n ) pattern, where θ n is the angle under which the feed sees the n-th element having coordinates (x n , y n , z 0 ) and located at a distance r n from the feed. In this way, the 2D aperiodic array factor F is: when evaluated at a regular spectral grid (u k , v l ) = (k∆u, l∆v), with k = −N 1 /2, . . . , N 1 /2 − 1 and l = −N 2 /2, . . . , N 2 /2 − 1, the ϕ n 's being the element control phases, λ being the wavelength and β = 2π/λ the wavenumber. Apart from an unessential conjugation, F has the same expression of a 2D Non-Uniform Discrete Fourier Transform (NUDFT) of NED (Non-Equispaced-Data) type whose expression isẑ In eq. (5), the z i 's represent the sequence to be transformed, sampled at the non-uniform spatial coordinates ( x i , y i ), and theẑ kl 's represent the transformed one. Accordingly, the array factor (4) can be effectively and efficiently evaluated using a 2D NED-NUFFT algorithm [14], [16], [17].

B. THE 2D NED-NUFFT ALGORITHM
To guarantee a fast and accurate processing, the "nonuniformly sampled" exponentials exp (−j2π x i k/N 1 ) and exp (−j2π y i l/N 2 ) appearing in eq. (5) need to be properly interpolated. The idea behind NED NUFFTs is to use "uniformly sampled" exponentials exp (−jmξ). The exponential representation is made possible by the use of the Poisson summation formula [17], [19]. Indeed, under general hypotheses, given a function f , then m∈Z f (ξ + 2mπ) converges absolutely almost everywhere to the 2πperiodic locally integrable function simply expressed through a Fourier series In order to obtain a computationally convenient expression of exp (−jxξ), the denominator must be factored in the variables x and ξ. A straightforward way to achieve factorization is to avoid that the replicas φ(ξ + 2mπ) exp (−j2kπξ) overlap, which is related to the replication period 2π of the Poisson summation formula in eq. (7) and to a proper choice of the support of φ(ξ). In particular, we have: where φ is the window function with support in (−π/c, π/c) having Fourier transformφ with support in (−K, K), c is an oversampling factor, and µ i = Int[c x i ], Int[·] denoting he integer part. A similar expression can be obtained for the exponential with y i , by introducing by assuming that bothφ it andψ it vanish outside 0 ≤ i ≤ M and −K ≤ t ≤ K and by exploiting the fact that exp(−j2π(m + cN 1 )k/(cN 1 )) = exp(−j2πmk/(cN 1 )) and exp(−j2π(n + cN 2 )l/(cN 2 )) = exp(−j2πnl/(cN 2 )), eq. (5) can be written aŝ (11) Therefore, eq. (10) is essentially given by the three steps [17]: • interpolation to obtain u mn (eq. (11)); • standard two-dimensional FFT on cN 1 × cN 2 points; • scaling and decimation. It should be noticed that, in eq. (11), the u mn 's receive contribution only from a very limited number of terms. Indeed, theφ andψ in eq. (11) are different from zero only for where typical values for K are 3 and 6. Finally, let us highlight that the obtained overall computational complexity is O((cN ) 2 log(cN )) for N ∼ N ∼ N 2 , N 1 , N 2 , M K.

C. ACCELERATION OF THE PATTERN EVALUATION
Once described the 2D NED-NUFFT algorithm, let us briefly sketch the GPU acceleration. The most difficult step to be implemented on GPU is the interpolation stage [8], [14]. Accordingly, only for this stage, we present some details of the strategy exploited to profit of the available massive parallelism. If we denote the input index with i and output indices with m and n, according to eq. (12), a given input index contributes only to a small number of output terms. The adopted solution assigns each GPU thread to an input index. Unfortunately, with such a choice, race conditions occur since different inputs can contribute to the same output. The problem is solved by profiting of a special feature of GPU: the atomic operations. They make the access of different threads to the same output location (in the GPU global memory) serial. Dynamic Parallelism has been exploited to perform the two nested "for loops" with indices p and q occurring in eq. (11), each one spanning only 2K + 1 cycles. Indeed, the CUDA Dynamic Parallelism allows each CUDA function (kernel) to give work to itself, thus making the easy implementation of recursive algorithms possible, exploiting different levels of parallelism.
We present now some codes to let the Reader take a closer look at the details of the adopted coding strategy. In the Listing 1, each parent thread is assigned to an input index i. The parent thread (first level of parallelism) generates (2K + 1) × (2K + 1) children (second level of parallelism), by calling the "series terms" kernel shown in Listing 2. Each child thread takes into account for a single loop cycle (double summation). The implemented NUDFT NED provides a significant speedup of the array factor calculation, of one order of magnitude with respect to a CPU code, for a RA made of 100 × 100 elements

III. PSO
PSO [13] is a global optimization algorithm working according to the behavior of flocks of bees in search of food. It shows appealing convergence properties that can be obtained by tuning its parameters in order to align the algorithm to the problem at hand. PSO shows a highly parallelizable computing pattern, making its implementation amenable of acceleration using GPUs [20]. This mitigates the need of large number of iterations and of particle updates and fitness evaluations [13]. The search space dimension in PSO equals the overall number of unknowns N unkn , i.e., the number of polynomial coefficients used to represent the command phase and the element coordinates. The particles' position and the so-called "velocity" X(t) and V (t), respectively, at the iteration step fictitiously represented by the time variable t, are N unkndimensional vectors whose update rule is given by In eqs. (13), the velocity is actually dealt with as a "displacement" vector. Furthermore, in eqs. (13), C 1 and C 2 are proper positive constants, R 1 and R 2 are two random numbers uniformly distributed in [0, 1], X best (t − 1) is the best-fitness position (local best) reached by an individual particle at the time t − 1 and X gbest (t − 1) is the best-fitness position (global best) ever found by the whole swarm up to t − 1. Moreover, the term X best (t − 1) − X(t − 1) represents the cognitive contribution while X gbest (t − 1) − X(t − 1) is the social contribution. Finally, w is the so-called inertia weight [21] balancing global and local search. According to eq. (13), the processing consists of five main steps: a) Initialization; b) Positions update; c) Fitness evaluation function; d) Local best update kernel; e) Global best update kernel. A code snippet of the PSO is illustrated in Listing 3. Note that, in the set up implementation, the PSO optimization can be stopped according to two criteria: a) a fixed number of iterations; b) the global optimum does not change for a certain number (e.g., 3) of consecutive iterations. In other words, the termination condition accounts for either a prefixed computation time, which is related to the overall number of iterations, or a persistent stagnation connected to a significant reduction of the functional with respect to the initial value. Obviously, a prefixed number of iterations does not ensure the algorithm convergence, but it indicates the depletion of an assigned resource, namely, computation time. We have anyway experienced that, for the synthesis cases of our interest, a number of about 100 iterations is sufficient to achieve satisfactory results, so that criterion a) will be henceforth used. The implementation in Listing 3 corresponds to this choice. The five steps are detailed in the following Subsections.

A. INITIALIZATIONS
Outside of the loop, the particles are initialized by using routines from the cuRAND library. Furthermore, the fitness for each particle and the local particle bests are initialized. The latter will be illustrated in Subsection 3.4. The former is sketched in Listing 4.

B. POSITIONS UPDATE
Concerning the positions update, eq. (13) shows that, once the information related to the global best has been achieved, there is no need for any further communication among the particles. Accordingly, the update can occur in parallel both across the particles and across the particle dimensions. The kernel function performing the position update is schematically reported in Listing 4. For the execution of such a kernel, different blocks are assigned to handle different particles, while their corresponding block threads manage the particle dimensions. Care has been taken to store as much as possible local (to the thread) temporary data within the registers while avoiding register spilling. It is noted that the kernel function in Listing 4 also performs the update of the best personal particle position. However, to save global memory storage time, such update is performed only if the neighboring particles have better personal best positions than the particle at hand.

IV. RESULTS
The computational performance of the GPU implementation of the 2D NED-NUFFT algorithm has been compared with that of an analogous CPU implementation for 2D RAs having randomly located elements as well as random command phases. The codes have been run on an Intel Core i7-6700K, 4GHz, 4 cores (8 logical processors), equipped with an NVIDIA GTX 960 card, compute capability 5.2. Fig. 3 shows the GPU vs. CPU speedup when RAs having N elements are considered. The achieved speedup is larger than 10 even though atomic operations are employed: atomic operations on modern GPU architectures are indeed very fast. The speedup also saturates for values of N approaching millions of elements due to the saturation of the GPU resources.
Let us now turn to the results of the RA synthesis. The antenna here considered is an aperiodic RA working at 14.25GHz, with a circular footprint, z 0 = 58.3cm, θ f equal to about 11.2 • , y of f = 11.6cm, feed shape factor m f = 12.
The POS synthesized according to stages 1-3 of Fig. 1 provides the number of elements, their coordinates, and the corresponding command phase. Indeed, the optimization of the spatial distribution of the elements moves them freely, but only those elements falling within the assigned circular antenna footprint are effectively considered as scatterers. However, constraints on the minimum and the maximum spacing, equal to 0.49λ and 0.7λ, respectively, are enforced in all the stages, by discarding solutions not satisfying the requirements.
A target coverage of South America has been adopted for the test case, and enforced by two mask functions having a flat top behavior on the coverage, and a nominal sidelobe level lower than 40dB. PSO has been executed with a swarm of 100 particles, by using 56 unknowns, w = 0.721, C 1 = C 2 = 1.193. The optimization on 100 iterations takes about 2hours on a PC equipped with a NVIDIA GTX 960. Accordingly, thanks to a proper parallelization of the global optimization and of direct radiation, the code has run in a convenient time notwithstanding the computational burden of the problem. A total number of 2032 elements has been considered for the synthesized RA. Fig. 4 reports the directivity pattern synthesized by the first stage (see Fig. 1), together with the mask functions (black lines). In other words, the result in Fig. 4 is the one obtained by the global optimization (PSO) stage which is appointed to operate on a rough model of the radiation in order to reduce the overall number of unknowns and to drive the solution as close as possible to the global optimum of the problem. Two observations can be made. First, the radiated pattern has a coverage corresponding to the prescribed one which suggests that the stage has been successful to bring the radiated solution close to the desired one. Second, as a result of the rough modeling, the pattern does not yet satisfactorily shape the target coverage. In order to refine the synthesis, the outcome in Fig. 4 has been used as input to stage 2) and the outcome of stage 2) as  input to stage 3). The final synthesized pattern at the end of stage 3) is shown in Fig. 5. Thanks to the improved radiation model, the pattern shaping is now satisfactory. We want to finally underline how the use of an efficient global optimization algorithm as PSO is crucial to achieve a satisfactory synthesis matching the design specifications. To this end, we have considered a test case whose first stage has been skipped and we have proceeded directly with a local algorithm using a phase representation based on Zernike polynomials first and on impulses afterwards. The results are shown in Figs. 6 and 7, respectively. As it can be seen, the final result in Fig. 7 is significantly worse than that in Fig. 5.

V. CONCLUSIONS
In this paper, the use of GPU computing in reflectarray antenna synthesis has been discussed.  The described approach uses a multi-stage synthesis structure, using global and local optimizers. Global optimization is afforded by a GPU-based implementation of Particle Swarm Optimization. To further reduce the computational burden, the evaluation of the aperiodic array factor is performed by a GPU-implemented Non Uniform FFT routine. The efficiency of the parallel approach for the analysis and synthesis of reflectarray antennas and of the parallel PSO algorithm in particular have been pointed out. More in detail, the computation time for the direct, parallel evaluation of the radiated field has shown a large speedup with respect to the sequential case. Furthermore, we have shown how the use of PSO in the first algorithm stage is crucial to achieve a satisfactory synthesis matching the design specifications. We have also highlighted how, thanks to a proper parallelization of the computationally most demanding stage of the synthesis process, i.e., global optimization, as well as of direct radiation, the code has run in a convenient time notwithstanding the computational burden of the problem. Finally, several implementation details have been provided to make the approach reproducible by any interested Reader.