Input block remapping FFT method based on FPGA
1. An input block remapping FFT method based on FPGA is characterized by comprising a data input remapping module, a butterfly operation network module and a data output module, wherein:
the data input remapping module is used for optimizing input data into a data stream format of parallel blocks, namely outputting the parallel data of each block through the data input remapping module;
the butterfly operation network module comprises a plurality of FFT butterfly networks;
the data input remapping module maps the parallel data blocks of each sub-block to the corresponding FFT butterfly network according to the designed output sequence;
performing FFT operation on data input into the FFT butterfly networks to respectively obtain discrete Fourier transform data output by each FFT butterfly network;
the data output module is used for outputting the discrete Fourier transform data output by each FFT butterfly network in parallel;
and the data input remapping module, the butterfly operation network module and the data output module are sequentially encapsulated into an IP core, wherein the compiling of the IP core is realized by an HLS compiling tool.
2. The input block remapping FFT method based on FPGA of claim 1, wherein the data input remapping module performs L times of data flow throughput when optimizing the input data, wherein L is the number of paths of ADC parallel output.
3. The input block remapping FFT method based on FPGA of claim 2, wherein the data input remapping module and the data output module are packaged as RAM communication interface on FPGA;
the input data size of the data input remapping module is 1 × 32, wherein 8 data are divided into a group to be input in parallel;
the butterfly operation network module comprises log2N FFT butterfly networks, wherein N is the number of points of FFT operation;
in the butterfly operation network module, after the input data of the ith FFT butterfly network is input, the data dimension is unchanged, and the calculation interval of the output data is 2i,i=1,2,…,log2N。
4. The input block remapping FFT method based on FPGA according to claim 3, wherein the parameter weight, the twiddle factor and the deviation value of the butterfly operation network module are structured by the hardware architecture of FPGA and are quantized in fixed point.
5. The input block remapping FFT method based on FPGA of claim 4, wherein said twiddle factors are divided into real number set and imaginary number set, and are embedded into FPGA in the form of lookup table.
Background
The ultra-high-speed (Gsps) AD converter can be widely applied to the fields of wireless communication, software radio, data acquisition, optical communication, instruments and equipment and the like. The speed of FPGA logic generally cannot keep up with the bus speed of high-speed ADCs, so most FPGAs have a serializer/deserializer (SERDES) module to convert the fast, narrowband serial interface at the converter end to the slow, wideband parallel interface at the FPGA end. In a system platform integrating a high-speed ADC and an FPGA, there are usually associated signal processing algorithms, and these algorithms often need to be improved to ensure that the efficiency is maximized after being implemented on a specific platform. The Fast Fourier Transform (FFT) algorithm is a classical signal processing algorithm that performs the learning and analysis of an input signal by mapping the input time domain signal to the frequency domain.
At present, the output of many high-speed ADCs is in a multi-channel parallel interpolation format, and is limited by the clock frequency of the FPGA, and a common protocol mode such as LVDS converts a serial interface at a converter end into a parallel interface at an FPGA end. Correspondingly, the FFT operation IP core provided by the authorities such as the existing FPGA platform Xilinx receives sequential serial input, so the deserialized parallel data is often written into the FIFO or the RAM for temporary storage and converted into a low-speed serial mode again. In summary, the process of high-speed serial-low-speed parallel-low-speed serial slows down the speed of the system for performing FFT operation, and does not maximally utilize the internal resources of the FPGA.
Disclosure of Invention
The invention aims to solve the problems that the FFT operation speed is low and the internal resources of an FPGA are not utilized to the maximum extent in the existing method, and provides an input block remapping FFT method based on the FPGA.
The technical scheme adopted by the invention for solving the technical problems is as follows:
an input block remapping FFT method based on FPGA comprises a data input remapping module, a butterfly operation network module and a data output module, wherein:
the data input remapping module is used for optimizing input data into a data stream format of parallel blocks, namely outputting the parallel data of each block through the data input remapping module;
the butterfly operation network module comprises a plurality of FFT butterfly networks;
the data input remapping module maps the parallel data blocks of each sub-block to the corresponding FFT butterfly network according to the designed output sequence;
performing FFT operation on data input into the FFT butterfly networks to respectively obtain discrete Fourier transform data output by each FFT butterfly network;
the data output module is used for outputting the discrete Fourier transform data output by each FFT butterfly network in parallel;
and the data input remapping module, the butterfly operation network module and the data output module are sequentially encapsulated into an IP core, wherein the compiling of the IP core is realized by an HLS compiling tool.
The invention has the beneficial effects that: the invention provides an input block remapping FFT method based on FPGA, which adopts a butterfly algorithm with an improved input structure, starts from the calculation process of FFT and the hardware architecture of FPGA, and adopts an HLS compiling tool to initialize a data input remapping module and a butterfly operation coefficient which are input in parallel in FFT operation into an IP core so as to realize the integration with hardware. Experimental results show that the FFT method designed by the invention can perform FFT calculation on parallel input data on an FPGA platform, maximize the FFT operation efficiency of parallel interpolation input signals on the FPGA platform, is superior to IP cores provided by the official in time performance, and realizes the maximum utilization of FPGA internal resources.
The clock period used by the IP core designed by the invention is multiplied by less than that of the official IP core.
Drawings
FIG. 1 is an IP core design framework for the FPGA-based input block remapping FFT method of the present invention;
FIG. 2 is a schematic diagram of an input data remapping table of the present invention;
FIG. 3 is a diagram of the IP interface generation of the present invention;
FIG. 4 is a timing diagram of the Xilinx official FFT function IP core;
FIG. 5 is a timing diagram of the FFT acceleration operation of the present invention;
FIG. 6 is a diagram of the integrated resources for accelerated operations according to the present invention.
Detailed Description
First embodiment this embodiment will be described with reference to fig. 1. The input block remapping FFT method based on FPGA in the embodiment comprises a data input remapping module, a butterfly operation network module and a data output module, wherein:
the data input remapping module is used for optimizing input data into a data stream format of parallel blocks, namely outputting the parallel data of each block through the data input remapping module;
the butterfly operation network module comprises a plurality of FFT butterfly networks;
the data input remapping module maps the parallel data blocks of each sub-block to the corresponding FFT butterfly network according to the designed output sequence;
performing FFT operation on data input into the FFT butterfly networks to respectively obtain discrete Fourier transform data output by each FFT butterfly network;
the data output module is used for outputting the discrete Fourier transform data output by each FFT butterfly network in parallel;
and the data input remapping module, the butterfly operation network module and the data output module are sequentially encapsulated into an IP core, wherein the compiling of the IP core is realized by an HLS compiling tool.
The embodiment directly matches the data stream format of multi-channel input, and directly inputs data into an IP core without arrangement processing after parallel ADC sampling, thereby achieving the efficiency maximization; and performing block mapping on the input data through the designed input remapping table. The data input remapping module is a further improvement of the traditional FFT input-output reverse order relation, and the digital signals deserialized from the ADC and the FPGA interface accumulate points for a certain time through the RAM interface, are input to the top-level function of the IP core, and enter the butterfly operation module. Clock period resources consumed by the FPGA are reduced through a parallel block input mode, and the maximum ADC data utilization rate of matched parallel output is achieved.
The IP core interface generation diagram is shown in fig. 3. Referring to fig. 3, after the FFT operation module is designed, the designed FFT computation unit may be generated into an IP core through synthesis, and input signal data may be input into the IP core through an interface of the RAM, respectively. xin _0_ address 0-7 represents 8 paths of address lines input in parallel, and ce0 and q0 are interface enabling and input data channels respectively; xout _0_ imag _ address0 and xout _0_ real _ address0 represent 8-way parallel FFT result output addresses respectively, and ce0, we0 and d0 represent interface enable, write enable and output data channels respectively. The rest interfaces in the figure are related control interfaces of an IP core, the default port type of the system is the default ap protocol port type, the ap _ clk synchronous sampling clock, the ap _ rst is connected with a system reset signal, the ap _ start is a reserved initial signal, the output FFT calculation result is effective when the ap _ done signal changes from low to high, the ap _ idle indicates that the module is pulled down and is not idle any more, and the ap _ ready is pulled up to indicate that the system can receive new input.
High-level Synthesis (HLS) is an automated design process that interprets algorithmic descriptions of desired behaviors and creates digital hardware to implement the behaviors, decoupling the behaviors from timing (e.g., clock levels). HLS can improve the abstraction level of system design, and uses register transfer level RTL, HLS tool is an IP core development tool which is issued by Xilinx and integrated with Vivado suite, HLS treats the function written by high-level language (C, C + +, etc.) as a functional module (IP core), the function is equivalent to RTL description of module function, and the function call in high-level language is equivalent to module instantiation in circuit description language such as VHDL. This approach can reduce the amount of code written, thereby significantly simplifying the structural code used for system description, and ultimately speeding up the system assembly process.
The second embodiment is as follows: the difference between this embodiment and the specific embodiment is that, when the data input remapping module optimizes the input data, L times of data flow throughput is executed, where L is the number of parallel output paths of the ADC.
Other steps and parameters are the same as those in the first embodiment.
The third concrete implementation mode: the difference between the present embodiment and the first or second embodiment is that the data input remapping module and the data output module are packaged as RAM communication interfaces on the FPGA;
the data output module is packaged into an 8-path RAM communication interface.
The input data size of the data input remapping module is 1 × 32, wherein 8 data are divided into a group to be input in parallel;
the butterfly operation network module comprises log2N FFT butterfly networks, wherein N is the number of points of FFT operation;
in the butterfly operation network module, after the input data of the ith FFT butterfly network is input, the data dimension is unchanged, and the calculation interval of the output data is 2i,i=1,2,…,log2N。
The butterfly operation network module comprises the structuralization of parameters such as twiddle factor operation size and the like, and a calculation module program is compiled and optimized according to instructions. Parameter structurization, extracting the designed twiddle factor weight parameters layer by layer, respectively storing the parameters in a lookup table of an FPGA (field programmable gate array), and then carrying out fixed-point quantization on the obtained parameters (N, 0); and programming programs of the calculation module, wherein the programs are respectively programmed in the HLS development tool according to the parameter pre-input size, and one layer corresponds to one cycle. log (log)2The N cycles are used for calculating a radix-2 FFT through butterfly operation, and the last stage is used for storing the output of the N-point FFT and converting the time domain N-point sequence into the frequency domain N-point complex sequence.
The calculation module program writing is that independent calculation modules are written in the first layer, the second layer and the third layer of the FFT butterfly operation network in the HLS, and each calculation module is optimized in the HLS and comprises a production line, an array division, a function pipelining and the like. The calculation module comprises a multiplication and accumulation module of twiddle factors, and in order to maximize efficiency, an input data remapping module is designed and scheduled by a lookup table structure, so that the parallel advantage of input data is utilized to the maximum extent, and reasonable utilization is realizedThe size of the memory space is reduced, and meanwhile, the operation function is expanded to the maximum extent on the premise of real resource constraint, so that the operation time is reduced to the maximum extent.
Other steps and parameters are the same as those in the first or second embodiment.
The fourth concrete implementation mode: the difference between this embodiment and the first to third embodiments is that the parameter weights, twiddle factors and the bias values of the butterfly computation network module are structured by the hardware architecture of the FPGA, and the fixed point quantization of (N,0) is performed.
Other steps and parameters are the same as those in one of the first to third embodiments.
The fifth concrete implementation mode: the difference between this embodiment and one of the first to the fourth embodiments is that the twiddle factors are divided into a real number set and an imaginary number set, and are embedded into the FPGA in the form of a lookup table.
Other steps and parameters are the same as in one of the first to fourth embodiments.
Examples
In fig. 1, an input block remapping FFT method IP core design framework based on FPGA includes the following steps:
(1) data input, namely adjusting the size of an input signal, extracting sampling points of the signal, and performing block remapping on the extracted sampling points to a butterfly network operation unit;
(2) butterfly network operations comprising log2N computation layers;
(3) and outputting the parallel data and outputting a complex result of the final FT transformation.
Wherein, the step (1) is divided into the following steps:
firstly, structuring a parameter weight, remapping the weight parameters of an FFT convolutional network one by one, and then carrying out fixed point quantization on the obtained parameters under an FPGA (field programmable gate array) architecture;
input data remapping table computation, as shown in fig. 2, maps parallel, sequential input data to FT output data.
Compiling mapping table module program codes, compiling and optimizing mapping module codes in an HLS tool according to fixed-point quantized parameters, and reading in RAM by each mapping module according to addresses;
wherein, the step (2) is divided into the following steps:
structuring cosine and sine coefficients of rotation factor, and converting the rotation factor into sine and cosine coefficientsThe coefficient of (2) is calculated, the periodicity of the sine function is utilized, the central symmetry is presented in a complex number field, and the occupied memory space address is reduced from N to N/2. And then matching the architecture of the FPGA to perform coefficient structuring and converting the architecture into the format of ap _ fifo. The input data in this format will be input to the butterfly operation of the subsequent stage.
Time Decimation (DIT) radix-2 FFT divides the DFT recursion into two half-length DFTs of even-indexed and odd-indexed time samples. The output of these shorter FFTs can be reused to compute many outputs, thereby greatly reducing the overall computational cost.
The formula for calculating the butterfly factor is:
where k represents the time sequence number of the input digital signal. M represents the number of decompositions in the butterfly operation, and generally, M is N/2.
When the time decimation is performed according to base 2, the complexity of the calculation is reduced.
Referring to table 1, a specific framework for performing FFT on the input data can be seen:
(1) inputting signals with the size of 1 × 32, wherein 8 signals are in parallel;
(2) passing data into the data remapping layer of FIG. 1; mapping to the order of outputs required by FT;
(3) inputting input data into a butterfly operation layer 1, keeping the data dimension unchanged, and setting the calculation interval of output data to be 2;
(4) inputting input data into a butterfly operation layer 2, keeping the data dimension unchanged, and setting the calculation interval of output data to be 4;
(5) inputting input data into butterfly operation layer i, keeping data dimension unchanged, and calculating output data with interval of 2i;
(6) And outputting the parallel data.
TABLE 1
After the model is structured according to the above manner, when the calculated data is of a floating point type and a fixed point number type, the resource consumption is very different, and more clock cycles are consumed in time, for example, as shown in fig. 6, the result of the hardware resource is an on-chip resource of the FPGA, and the BRAM can be configured as a dual-port RAM to increase the read-write speed. DSP48E is an FPGA on-chip resource consumed by the multiplier. After data is written into the RAM in advance, inputting 8 groups of parallel data each time is equal to inputting an address for table look-up, finding out the content corresponding to the address, and then outputting. The quantified hardware resource consumption can be seen in fig. 6.
Fig. 4 and 5 compare the timing diagram synthesized by the method used by the present invention and the traditional sequentially input Xilinx official FFT function under 32-point FFT, and the above method of the present invention generates 101 clock cycles used by the IP core, which are times smaller than 3900 clock cycles of the official part.
The input data remapping FFT method IP core based on FPGA provided by the invention is synthesized and realized in Xilinx company V7 series chip xc7vx690 t. Experimental data show that 32-point FFT is processed by the method, parallelization is achieved, and processing time reaches 500ns of delay under 5ns of clocks.
The above-described calculation examples of the present invention are merely to explain the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications of the present invention can be made based on the above description, and it is not intended to be exhaustive or to limit the invention to the precise form disclosed, and all such modifications and variations are possible and contemplated as falling within the scope of the invention.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种基于平滑拟合的频谱拼接方法