Deep convolutional neural network acceleration method, module, system and storage medium
1. A deep convolutional neural network acceleration method, comprising:
acquiring input characteristics;
acquiring high-order features and low-order features of the input features;
performing convolution and maximum pooling on the high-level features to obtain a high-level pooling result;
obtaining a corresponding maximum pooling region according to the high-position pooling result;
convolving the low-order corresponding features to obtain a low-order convolution result, wherein the low-order corresponding features are data corresponding to the largest pooling area in the low-order features;
and obtaining the maximum pooling result of the input features according to the high-order pooling result and the low-order convolution result.
2. The deep convolutional neural network acceleration method of claim 1, wherein the obtaining the high order features and the low order features of the input features comprises:
acquiring a non-zero value of the input feature and a sparse mapping matrix of the input feature according to the input feature, wherein the input feature is a pixel matrix;
and acquiring the high-order features and the low-order features of the nonzero values.
3. The method for accelerating a deep convolutional neural network as set forth in claim 2, wherein before the convolving and max pooling the high-order features, further comprising: determining the weight of the position corresponding to the nonzero value in a convolution kernel according to the sparse mapping matrix;
the convolving the high-order features comprises: convolving the high-order features of the non-zero values by weights of positions in the convolution kernel corresponding to the non-zero values;
the convolving the low-order corresponding features comprises: convolving the low-order corresponding features by weights of the positions corresponding to the non-zero values in the convolution kernel.
4. The method for accelerating the deep convolutional neural network as claimed in claim 2 or 3, wherein the convolving and max pooling the high-order features to obtain a high-order pooling result comprises:
matching the calculation units of the same matrix according to the size of the convolution kernel;
according to the sparse mapping matrix, sending the data of each convolution region in the high-order features to the corresponding computing unit according to the position relation; sending the weight corresponding to the non-zero value position in the convolution kernel to the corresponding computing unit;
performing convolution and maximum pooling on the high-level features to obtain a high-level pooling result; the calculation unit is used for performing convolution on the high-order features.
5. The method for accelerating a deep convolutional neural network as claimed in claim 2 or 3, wherein the convolving the low-order corresponding features to obtain a low-order convolution result comprises:
matching the calculation units of the same matrix according to the size of the convolution kernel;
according to the sparse mapping matrix, sending the data of each convolution area in the low-order corresponding characteristic to the corresponding computing unit according to the position relation; sending the weight corresponding to the non-zero value position in the convolution kernel to the corresponding computing unit;
and the calculating unit is used for performing convolution on the low-order features.
6. An acceleration module for a deep convolutional neural network is characterized by comprising a control module and a calculation module;
the control module is used for acquiring input characteristics;
the calculation module is used for acquiring high-order features and low-order features of the input features; performing convolution and maximum pooling on the high-level features to obtain a high-level pooling result;
the control module is also used for obtaining a corresponding maximum pooling region according to the high-position pooling result;
the calculation module is further configured to perform convolution on a low-order corresponding feature to obtain a low-order convolution result, where the low-order corresponding feature is data corresponding to a largest pooling area in the low-order feature; and obtaining the maximum pooling result of the input features according to the high-order pooling result and the low-order convolution result.
7. The acceleration module of claim 6, wherein the computation module comprises a computation unit array consisting of M × M computation units, the computation unit array being configured to convolve the higher-order features and the lower-order corresponding features, wherein the computation module uses a convolution kernel having a size of N × N, M ≧ N, M and N both being positive integers.
8. The acceleration module of claim 6, wherein the computation module obtaining the high-order features and the low-order features of the input features comprises:
the calculation module acquires a non-zero value of the input feature and a sparse mapping matrix of the input feature, wherein the input feature is a pixel matrix; and acquiring the high-order features and the low-order features of the nonzero values.
9. An acceleration system for a deep convolutional neural network, comprising a processor, an off-chip storage module, and the acceleration module of any of claims 6 to 8; the acceleration module further comprises an on-chip storage module;
the processor is used for controlling data exchange between the off-chip storage module and the on-chip storage module, and/or the processor is a socket general RISC-V processor.
10. A computer-readable storage medium, characterized in that the medium has stored thereon a program which is executable by a processor to implement the method according to any one of claims 1-5.
Background
Deep Convolutional Neural Networks (DCNN) have found extremely wide application in computer vision, speech recognition, and the like. Meanwhile, with the popularity of deep convolutional neural network algorithms, the work of realizing the neural network by hardware is also concerned by more and more people.
In conventional convolution and pooling calculations, 75% of the data that has passed the convolution is discarded after the data passes the maximum pooling (2 × 2 pooling size is an example), and obviously, the processing of this part of the data takes up most of the calculation time and consumes a lot of power.
Therefore, how to increase the computation speed of the deep convolutional neural network and reduce the computation power consumption becomes a difficult point of the deep convolutional neural network technology.
Disclosure of Invention
The invention provides a deep convolutional neural network acceleration method, a module, a system and a storage medium, and aims to solve the problems of low calculation speed and high calculation power consumption of the conventional deep convolutional neural network.
According to a first aspect, an embodiment provides a deep convolutional neural network acceleration method, including:
acquiring input characteristics;
acquiring high-order features and low-order features of input features;
performing convolution and maximum pooling on the high-level features to obtain a high-level pooling result;
obtaining a corresponding maximum pooling area according to the high-position pooling result;
convolving the low-order corresponding features to obtain a low-order convolution result, wherein the low-order corresponding features are data corresponding to the largest pooling area in the low-order features;
and obtaining the maximum pooling result of the input features according to the high-order pooling result and the low-order convolution result.
According to a second aspect, an embodiment provides an acceleration module for a deep convolutional neural network, including a control module and a computation module;
the control module is used for acquiring input characteristics;
the calculation module is used for acquiring high-order characteristics and low-order characteristics of the input characteristics; performing convolution and maximum pooling on the high-level features to obtain a high-level pooling result;
the control module is also used for obtaining a corresponding maximum pooling region according to the high-position pooling result;
the calculation module is further used for performing convolution on the low-order corresponding features to obtain a low-order convolution result, wherein the low-order corresponding features are data corresponding to the largest pooling area in the low-order features; and obtaining the maximum pooling result of the input features according to the high-order pooling result and the low-order convolution result.
According to a third aspect, an embodiment provides an acceleration system for a deep convolutional neural network, which includes a processor, an off-chip storage module, and the acceleration module of the above technical solution; the acceleration module also comprises an on-chip storage module;
the processor is used for controlling data exchange between the off-chip storage module and the on-chip storage module, and/or the processor is a socket general RISC-V processor.
According to a fourth aspect, an embodiment provides a computer readable storage medium, on which a program is stored, the program being executable by a processor to implement the deep convolutional neural network acceleration method as described in the above technical solution.
According to the method, the module, the system and the storage medium for accelerating the deep convolutional neural network provided by the embodiment, the input characteristics are obtained; acquiring high-order features and low-order features of input features; performing convolution and maximum pooling on the high-level features to obtain a high-level pooling result; obtaining a corresponding maximum pooling area according to the high-position pooling result; convolving the low-order corresponding features to obtain a low-order convolution result, wherein the low-order corresponding features are data corresponding to the largest pooling area in the low-order features; and obtaining the maximum pooling result of the input features according to the high-order pooling result and the low-order convolution result. It can be seen that the high-order features of the input features are adopted to perform approximate convolution to find the maximum pooling area, the parts corresponding to the low-order features are convoluted, the low-order pooling result of the low-order features can be directly obtained, and the maximum pooling result of the input features is finally obtained, so that the redundant convolution multiplication and addition operation of 75% of the low-order features can be reduced when the input features are subjected to maximum pooling, the calculation speed of convolution pooling calculation is finally improved, and the calculation energy consumption is reduced.
Drawings
FIG. 1 is a schematic diagram of a convolution-max pooling calculation process of the prior art;
FIG. 2 is a schematic structural diagram of an acceleration module and system for a deep convolutional neural network according to an embodiment;
FIG. 3 is a schematic flow chart diagram of a deep convolutional neural network acceleration method provided by an embodiment;
fig. 4 to fig. 6 and fig. 8 are schematic diagrams illustrating a convolution-max pooling calculation process in a deep convolutional neural network acceleration method according to an embodiment;
FIG. 7 is a schematic diagram of a convolution-max pooling calculation process of the prior art;
FIG. 9 is a schematic diagram illustrating a process of generating a non-zero value and a sparse mapping matrix in an acceleration method for a deep convolutional neural network according to an embodiment;
FIG. 10 is a schematic diagram of a full jump zero calculation process in a deep convolutional neural network acceleration method according to an embodiment;
FIG. 11 is a diagram illustrating the establishment of a logical mapping of a PE array in a deep convolutional neural network acceleration method according to an embodiment;
FIG. 12 is a diagram illustrating a structure of a computing unit in an acceleration module for a deep convolutional neural network according to an embodiment;
FIG. 13 is a schematic structural diagram of a sub-computation unit in an acceleration module for a deep convolutional neural network according to an embodiment;
FIG. 14 is a schematic structural diagram of an acceleration module and system for a deep convolutional neural network according to another embodiment;
fig. 15 is a schematic structural diagram of a sparse mapping module in an acceleration module for a deep convolutional neural network according to an embodiment.
Reference numerals: 10-a processor; 20-off-chip memory module; 30-an acceleration module; 31-a control module; 32-an on-chip memory module; 33-a calculation module; 331-PE array; 3301-controller; 3302-multiplier; 3303-register; 3304-multiplexer; 332-activate function module; 34-sparse mapping module; 341-write address generation module; 342-write cache module; 343-asynchronous FIFO data block; 344 — read in address generation module.
Detailed Description
The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.
Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.
The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning. The term "connected" and "coupled" when used in this application, unless otherwise indicated, includes both direct and indirect connections (couplings).
Example one
As shown in fig. 1, the conventional Convolutional Neural Network (CNN) or Deep Convolutional Neural Network (DCNN) involves convolution and pooling calculations. After convolution is performed on an input feature (generally, a pixel matrix), a corresponding convolution result is generated, each pixel point (or data) of the input feature is generally used as a convolution center for convolution, and filling is also involved in convolution of edge data. The convolution results are then maximally pooled, yielding a maximally pooled result. It can be seen that the input features need to be convolved 16 times to obtain the convolution result, and the maximum pooling only extracts 4 data in the convolution result. That is, in the conventional convolution-max pooling calculation method, 12 data in the convolution result are redundancy calculation results, and do not participate in the max pooling calculation, that is, the redundancy calculation reaches 12/16-75%. Meanwhile, for the calculation of each convolution result, the convolution kernel needs to multiply and add the convolution region of the input feature, and when the size of the convolution kernel is 3 × 3, the 9 weights of the convolution kernel are multiplied by the 9 data of the convolution region of the input feature and then added, which shows that the existing convolution-maximum pooling calculation redundancy calculation process is many. Particularly, the existing deep convolutional neural network has the problems of low calculation speed and high calculation power consumption due to the fact that the existing input feature size is larger and larger. Further, the pooling referred to in this example is the maximum pooling unless otherwise specified.
In view of this, the present invention provides a method, a module, a system and a storage medium for accelerating a deep convolutional neural network, and aims to solve the problems of low computation speed and large computation power consumption of the existing deep convolutional neural network.
As shown in fig. 2, the acceleration module 30 for a deep convolutional neural network provided by the present invention includes a control module 31, an on-chip storage module 32, and a computation module 33, and the acceleration system for a deep convolutional neural network provided by the present invention includes the acceleration module 30, a processor 10, and an off-chip storage module 20.
As shown in fig. 3, the deep convolutional neural network acceleration method provided by the present invention adopts a two-stage cascade convolution mode, and includes the following steps:
step 1: the control module 31 obtains an input feature, for example, a pixel matrix of a picture, wherein the input feature is derived from the off-chip storage module 20, and the control module 31 sends a reading instruction to the off-chip storage module 20 through the processor 10, so as to transmit the input feature in the off-chip storage module 20 to the on-chip storage module 32.
Step 2: the calculation module 33 obtains the high-order features and the low-order features of the input features. The input features may be processed by the activation function module 332, for example, the activation function module 332 may use a ReLU function, so as to obtain the high-order features and the low-order features of the input features.
The high-order feature and the low-order feature may be input features of which the first half is the high-order feature and the second half is the low-order feature. For example, when the pixel value (e.g., gray value) of each pixel of the input features is eight bits of data, the data corresponding to the higher features is the first four bits of data (defined as higher data), and the data corresponding to the lower features is the second four bits of data (defined as lower data). Specifically, taking binary as an example, when the pixel value of a pixel is 200, and the number of bits expressed by binary is 200D 11001000B, the high-order data of the pixel is 1100B 12D, and the second data is 1000B 8D. The high-order data of each pixel in the input features form the high-order features, and the low-order data of each pixel form the low-order features. It should be understood that the above is only an example, and not a limitation to the number of bits of the input features, and the acceleration method provided by the present invention can be applied to even-numbered data such as 8-bit, 16-bit or 32-bit data.
To further illustrate, as shown in FIG. 4, the input feature is a matrix of pixels of an image, and the pixel value of each pixel point may be a non-zero value (e.g., a, b, c … s as shown) or a zero value (e.g., 0 as shown). The pixel point a in the input feature corresponds to the high-order data a1 in the high-order feature and the low-order data a2 in the low-order feature, and it can be simply understood that a is 1+ a 2. It can be seen that the matrix positions of the data in the high-order features and the low-order features are in one-to-one correspondence. That is, each data or pixel point (zero or non-zero value) of the input feature has its own position coordinates.
And step 3: the calculation module 33 performs convolution and maximum pooling on the high-order features to obtain a high-order pooling result.
For example, as shown in fig. 5, each high-order data in the high-order features is used as a convolution center, and the high-order data is subjected to convolution calculation to obtain a high-order convolution result, and before the data at the edge of the convolution, the high-order features can be extended by padding; and performing maximum pooling calculation on the high-order convolution result to obtain a high-order pooling result. The high pooling result may be a register registered in the computing module 33, an on-chip storage module on the storage acceleration module 30, or an off-chip storage module of the acceleration system.
And 4, step 4: the control module 31 obtains the corresponding maximum pooling area according to the high pooling result.
For example, as shown in fig. 5, the data in the high-order pooling result may be used to back-derive the high-order data corresponding to the convolution center in the high-order features. The data a1 in the high-order pooling result as shown in the figure corresponds to the data a1 of the high-order convolution result, and the data a1 of the high-order convolution result is the convolution calculation result with the data a1 of the high-order feature as the convolution center. Therefore, based on the high-order pooling result, the high-order convolution result and the area corresponding to the maximum pooling result in the high-order features (i.e., the data in the shaded portion in the figure) can be obtained (for convenience of description, this area is defined as the maximum pooling area). That is, the position of the high-order data (data when being the convolution center) that produces the largest pooling result among the high-order features can be obtained. From the above analysis, the high-order data of the high-order features corresponds to the input features and the low-order data of the low-order features one-to-one. Therefore, the maximum pooling area can be applied to the convolution calculation of the next lower features.
And 5: the calculation module 33 performs convolution on the low-order corresponding feature to obtain a low-order convolution result, wherein the low-order corresponding feature is data corresponding to the largest pooling area in the low-order feature.
For example, as shown in fig. 6, the control module 31 obtains lower data (shown as shaded data) corresponding to the maximum pooling region in the lower features according to the maximum pooling region, and sends the lower data to the calculation module 33. For convenience of description, a set of low-order data corresponding to the largest pooling region in the low-order features is defined as the low-order corresponding features, and the calculation module 33 performs convolution only on the low-order data of the low-order corresponding features as convolution centers for corresponding convolution regions to obtain a low-order convolution result, which is equivalent to obtaining a low-order pooling result corresponding to the low-order features.
Further, as shown in fig. 7, in the convolution-max pooling calculation method of the prior art, each data in the low-order features is convolved once as a convolution center, and finally, the illustrated low-order pooling result is obtained. It can be seen that the data of the lower convolution results except the shaded part are all redundant convolution results, and the occupation ratio is 75% (corresponding to the pooling window size of 2 × 2). Obviously, maximum pooling of input features is accomplished and this portion of data is not required to be involved.
Therefore, by this step 3, the lower convolution result shown in fig. 6 can be directly obtained, which is equivalent to the lower pooling result shown in fig. 7.
Step 6: the calculation module 33 obtains the maximum pooling result of the input features according to the high-order pooling result and the low-order convolution result.
As shown in fig. 8, in the acceleration method, the high-order features using the input features are subjected to approximate convolution, the maximum pooling region is found according to the high-order pooling result, the part corresponding to the low-order features is convolved according to the corresponding relationship between the high-order features and the low-order features, so that the low-order convolution result of the low-order features (which is equivalent to the result of performing convolution-maximum pooling calculation on the low-order features) can be directly obtained, and finally the maximum pooling result of the input features is obtained according to the high-order pooling result and the low-order convolution result. The correspondence between the high-order features and the low-order features and the convolution correspondence are the convolution mode of the two-stage cascade. Therefore, when the input features are subjected to maximum pooling, the redundant convolution multiply-add calculation of 75% of the low-order features can be reduced, the calculation speed of the convolution pooling calculation is finally improved, and the calculation energy consumption is reduced.
The acceleration method provided by the embodiment can be used for edge computing, for example, the method can be applied to cloud unloading, video analysis and smart home, and can also be applied to face recognition of a security camera, face recognition of entrance guard or acceleration computing of field detection and the like.
For example, in a face recognition application, when the acceleration method is implemented by a computer program, the program may be stored in a storage medium of a face recognition terminal, and the processor of the face recognition terminal executes the acceleration method. The camera of the face recognition terminal acquires a face image, carries out recognition calculation through the acceleration method, produces a face recognition result at the face recognition terminal, uploads the face recognition result to a server and other systems, and carries out door opening, voice playing and other operations through the face recognition result.
Example two
In the application of the deep convolutional neural network, data with more pixel points in the input characteristics are zero, and non-zero values exist less. In the prior art, when convolution calculation is performed, a conventional acceleration method is as follows: the redundancy calculation is reduced by skipping calculation after the zero value is judged to be zero, and at the moment, a clock cycle is needed to judge a zero value, and then the calculation is skipped.
In order to further improve the calculation speed of the deep convolutional neural network, the deep convolutional neural network acceleration method provided by the invention can further improve the following steps: and a full zero-jump calculation mode is adopted to realize convolution acceleration.
The step 2: the calculating module 33 obtains the high-order features and the low-order features of the input features, and may include:
step 201: the calculation module 33 obtains a non-zero value of the input feature and a sparse mapping matrix of the input feature according to the input feature.
Step 202: the calculation module 33 obtains the high-order features and the low-order features of the non-zero values.
As shown in fig. 9, a "1" data bit in the sparse mapping matrix represents a non-zero value, a "0" data bit is a zero value, that is, a "1" is a non-zero value sparse mapping flag, and a "0" is a zero value sparse mapping flag. The '1' and the '0' form a corresponding matrix according to the position relation of the pixel points, and the matrix defines a sparse mapping matrix. Through the non-zero value and the sparse mapping matrix, the pixel matrix of the input features can be restored. For example, the non-zero values are arranged and filled into the data bits of "1" of the sparse mapping matrix from left to right and from top to bottom, so as to obtain the input features.
And processing the data corresponding to the nonzero value to obtain the high-order features and the low-order features of the nonzero data set. In the subsequent convolution results, the data participating in the convolution are all non-zero values, and zero values are not required to be judged, so that the acceleration of the convolution calculation is realized. In actual data storage, only non-zero values are stored in either the on-chip memory module 32 or the off-chip memory module 20, which serves to reduce time, both in terms of computation and data reading.
Similarly, the sparse mapping matrix is applied to the feature images corresponding to the higher order convolution result and the lower order convolution result, that is, when the higher order convolution result and the lower order convolution result are used as a new input feature and are subjected to secondary convolution, the acceleration method can be adopted to accelerate the secondary convolution of the higher order convolution result and the lower order convolution result. Or to other calculations such as pooling, i.e., only non-zero values in the convolution results participate in pooling.
In the step 3: before the calculating module 33 performs convolution and maximum pooling on the high-order features, the acceleration method may further include:
step 203: the control module 31 determines the weight of the position corresponding to the non-zero value in the convolution kernel according to the sparse mapping matrix.
When the high-order data is convolved, all pixel points in the high-order features still need to be convolved once as convolution centers, at the moment, convolution kernels move point by point, the convolution centers of convolution areas corresponding to the high-order features can be zero values and non-zero values, but the data of the convolution areas necessarily comprise at least one of the non-zero values and the zero values. In order to increase the calculation speed, the present embodiment provides an acceleration method of full jump zero calculation.
As shown in fig. 10, the positions of non-zero values and zero values in the next convolution region can be obtained by sparse mapping matrix. For example, when "0" (shaded data in the figure, third column in the second row) in the high-order feature is the convolution center, the convolution region and its data are determined according to the size of the convolution kernel (3 × 3 in the figure). The non-zero values (b 1, e1, and h1 shown in the figure) and their positions of the corresponding convolution region when the "0" data is the convolution center can be determined from the sparse mapping matrix. Meanwhile, the weights (W, A and X as shown) of the non-zero values (b 1, e1 and h1 as shown) of the convolution kernel when convolved with this convolution region can also be determined according to the sparse mapping matrix. Then the convolution result P (0) when this "0" data is the convolution center is P (0) ═ W × b1+ a × e1+ X × h 1. The calculation of P (0) can be completed with only three clock cycles.
In the conventional convolution process, the convolution result obtained when the "0" data is calculated as the convolution center is:
the method can reduce redundant calculation generated by zero values and occupied clock cycles in the convolution process, namely calculation of zero values in skip high-order features (corresponding to input features), namely skip zero values, which are simply referred to as skip zero.
Therefore, by adopting the technical scheme, when the convolution is carried out on each convolution area, the weight of the position corresponding to the non-zero value in the convolution kernel can be determined, the full jump zero calculation can be realized on the high-order feature, and the calculation speed is greatly improved. Based on the corresponding relation between the high-order features and the low-order features, when the low-order features are subjected to convolution calculation, the full-jump zero calculation can be realized, and finally the full-jump zero calculation of the input features is realized.
Based on this, in step 2 of the above technical solution, the convolving the high-order features may include: convolving the high-order features of the nonzero values through the weights of the positions corresponding to the nonzero values in the convolution kernels;
in step 5 of the above technical solution, the convolving the low-order corresponding features may include: the low-order corresponding features are convolved by the weights of the positions corresponding to non-zero values in the convolution kernel.
EXAMPLE III
The sizes of the conventional convolution kernels are generally 3 × 3 and 5 × 5, but larger sizes are also adopted. The calculation module 33 usually adopts a calculation unit array (PE array) to perform convolution operation, and when the matrix size of the PE array is smaller than the size of the convolution kernel, the convolution area needs to be divided for a plurality of times to perform operation, which is complex in logic and slow in calculation speed. For example, when a convolution calculation of a 5 × 5 convolution kernel is performed by a 3 × 3PE array, since the 3 × 3PE array calculates 9 data at most in sequence, the calculation needs to be completed in at least three times.
In order to further improve the calculation speed of the deep convolutional neural network, the deep convolutional neural network acceleration method provided by the invention can further improve the following steps: and (3) establishing logical mapping between the convolution data (the weight and non-zero value of the convolution kernel) and the PE array, and simplifying the logic of calculation.
The step 3: convolving and max pooling the high-order features to obtain a high-order pooling result, which may include:
step 301: the control module 31 matches the calculation units of the same matrix according to the size of the convolution kernel. The calculation units matched with the same matrix are the calculation units matched with the same number as the weight number of the convolution kernel, and each calculation unit calculates the weight of one position. For example, for a convolution kernel of 3 × 3, 9 compute units are matched to form a PE array for performing convolution operations. The convolution check should have 1 to 9 weights, numbered from left to right and top to bottom, with the compute units in the PE array numbered correspondingly. For example, the "Q" weight in the convolution kernel as in FIG. 11 can be numbered 1, i.e., 1 in the upper left corner of the PE array. That is, the number of calculation units cannot be smaller than the number of weights of the convolution kernel.
Step 302: the control module 31 sends the data of each convolution region in the high-order features to the corresponding calculation unit according to the position relation according to the sparse mapping matrix; and sending the weight of the corresponding non-zero value position in the convolution kernel to the corresponding calculation unit.
For example, as shown in fig. 10 and 11, when convolution is performed on "0" data as a convolution center, the weight of the convolution kernel corresponding to the positional relationship is transmitted to the corresponding calculation unit, that is, the weight "W" is transmitted to the calculation unit (shown in the drawing PE-2) of number 2, the weight "a" is transmitted to the calculation unit (shown in the drawing PE-4) of number 4, and the weight "X" is transmitted to the calculation unit (shown in the drawing PE-8) of number 8, in accordance with the matrix positional relationship, by the sparse mapping matrix.
Step 303: the calculation module 33 performs convolution and maximum pooling on the high-order features to obtain a high-order pooling result; the calculation unit is used for performing convolution on the high-order features.
As shown in fig. 5, 10, and 11, by using the above-mentioned logical mapping for data transmission and calculation according to the corresponding position relationship, the PE array can quickly complete convolution calculation, and the size of the PE array is ensured to always satisfy the size of the convolution kernel.
Similarly, the step 5: convolving the low-order corresponding features to obtain a low-order convolution result, which may include:
step 501: and matching the calculation units of the same matrix according to the size of the convolution kernel.
Step 502: according to the sparse mapping matrix, sending the data of each convolution area in the low-order corresponding characteristics to a corresponding calculation unit according to the position relation; and sending the weight of the corresponding non-zero value position in the convolution kernel to the corresponding calculation unit.
Step 503: and the calculating unit is used for performing convolution on the low-order features.
It can be seen that the convolution process of the high-order features and the low-order corresponding features is illustrated by a process of fully jumping to zero, but the convolution process is also applicable to the conventional convolution calculation without jumping to zero. The simple mapping logic of the convolution kernel or the convolution area and the PE array saves the time wasted by data flowing, convolution kernel interception and the like, and improves the calculation speed.
In combination with the three embodiments, the deep convolutional neural network acceleration method provided by the invention respectively explains the technical schemes from three aspects of two-stage cascade (step convolution of high-order features and low-order features), full-jump zero calculation and logic mapping relation of a calculation unit.
Example four
The acceleration method provided by the above embodiment has various hardware structures capable of implementing the method, and the present embodiment provides an acceleration module 30 for a deep convolutional neural network, which may include a control module 31, a calculation module 33, and an on-chip storage module 32. The acceleration module 30 is configured to complete convolution calculation by using the acceleration method, so as to increase the speed of convolution calculation.
The control module 31 is used for acquiring the input features and sending the input features to the calculation module 33.
The calculating module 33 is configured to obtain a high-order feature and a low-order feature of the input features; and performing convolution and maximum pooling on the high-order features to obtain a high-order pooling result.
The control module 31 is further configured to obtain a corresponding maximum pooling region according to the high-order pooling result, and send a low-order corresponding feature to the calculating module 33 according to the maximum pooling region, where the low-order corresponding feature is data corresponding to the maximum pooling region in the low-order feature.
The calculating module 33 is further configured to perform convolution on the low-order corresponding features to obtain a low-order convolution result; and obtaining the maximum pooling result of the input features according to the high-order pooling result and the low-order convolution result.
The on-chip storage module 32 is configured to store at least one of input features, high-order features, low-order features, high-order pooling results, maximum pooling areas, low-order convolution results, and maximum pooling results.
As shown in fig. 1 and fig. 11, the calculating module 33 may include a calculating unit array (PE array) composed of M × M calculating units (PE), and the calculating unit array is configured to convolve the high-order features and the low-order corresponding features, wherein the size of the convolution kernel adopted by the calculating module 33 is N × N, M ≧ N, and M and N are positive integers. By ensuring a sufficient number of computation units, a simpler mapping logic can be achieved, i.e. weights at different positions of the convolution kernel are directly applied to PEs with the same number for convolution calculation. For example, with the 5 × 5PE array 331, when performing convolution calculation using a 3 × 3 convolution kernel, the control module 31 controls 9 calculation units to participate in convolution calculation, and the remaining 16 calculation units sleep.
As shown in fig. 12, each computing unit is composed of a plurality of Sub-computing units, for example, each computing unit includes 36 Sub-computing units (which may also be referred to as Sub-bundles). As shown in fig. 13, each sub-calculation unit may include a controller 3301, a multiplier 3302, a register 3303, and two multiplexers 3304 to complete convolution operation of input data. Controller 3301 requires the output SEL and Out _ SEL signals to select the source of the input signal and the destination of the output signal. The input sources include data from the memory portion and the memory, data from the PE register, and data from the external register. The simultaneous out-of-process includes entering the activate function module 332, entering the portion and memory responsible for storing the partial sum, the PE register, and the external register. Of course, the sub-calculation unit may be any one of the existing minimum multiply-add operation units that can achieve the above-described effects.
The plurality of sub-calculation units may form a group of sub-calculation units, and one sub-calculation unit or a group of sub-calculation units is used to calculate the weight of a corresponding position of the convolution kernel of one channel. Because the convolution kernel has depth parameters in addition to size, one depth corresponds to one channel. When the convolution kernel has a plurality of channels, each sub-computation unit or each group of sub-computation units performs computation for one channel. And matching the corresponding calculating unit and the sub-calculating unit according to the size and the number of the channels of the actual convolution kernel.
Before convolution, the control module 31 may also enable a corresponding number of sub-computation units according to the number of channels (depth) of the convolution kernel. Thus, when convolution is performed for one convolution region of the high-order features or the low-order features, the PE array 331 including the calculation units having the plurality of sub-calculation units is used, and convolution calculation can be completed in a shortest time by matching convolution kernels of the plurality of channels.
The calculation module 33 is configured to obtain a non-zero value of the input feature and a sparse mapping matrix of the input feature, where the input feature is a pixel matrix; and acquiring high-order features and low-order features of non-zero values.
In one implementation, as shown in FIG. 1, the calculation module 33 may further include an activation function module 332. The activation function module 332 is configured to obtain a non-zero value of an input feature, a sparse mapping flag of the non-zero value, and a sparse mapping matrix of the input feature, where the input feature is a pixel matrix; and acquiring high-order features and low-order features of non-zero values. The activation function module 332 may employ a ReLU function, among others.
As shown in fig. 13, the acceleration module 30 provided by the present invention further includes a sparse mapping flag and data storage module (referred to as a sparse mapping module 34 for short) for generating an address according to the sparse mapping matrix and writing the non-zero value and the non-zero value sparse mapping flag into the off-chip storage module 20. Sparse mapping module 34 is also used to read non-zero values and non-zero sparse mapping flags from off-chip storage module 20.
For example, as shown in fig. 14, the sparse mapping module 34 may include a write address generation module 341, a write cache module 342, an asynchronous FIFO data module 343, and a read address generation module 344. The non-zero values of the input features are activated by the function module 332 to generate output features (i.e., high-order features and low-order features), sparse mapping flags, and coordinate information data associated with the output features (high-order features and low-order features). The output characteristic and the sparse mapping flag data are fed to the write cache module 342 and arithmetically expanded into corresponding bit data, and the coordinate information data associated with the data are fed to the write address generation module 341 to be converted into a write address according to a specific formula. Since the clock frequency of the on-chip memory module 32 is different from the clock frequency of the off-chip memory module 20, the asynchronous FIFO data module 343 is used for data interaction. The read address generating module 344 obtains a read address according to the coordinate information sent by the control module 31, reads the nonzero value and the sparse mapping flag data from the off-chip storage module 20 to the on-chip storage module 32, and sends a data loading success signal to the control module 31.
More specifically, the coordinates of a pixel point in the input features (pixel matrix) obtained by the sparse mapping matrix may be represented as (x, y, ch), where x represents an abscissa, y represents an ordinate, and ch represents a channel. The write address ad1 of the sparse mapping flag is x × y × ch/16+ y × ch/16+ ch/16; the write address ad2 of the output signature (high-order signature and low-order signature) is x × y × ch/2+ y × ch/2+ ch/2.
The on-chip storage module 32 may be configured to divide different types of data, where one part stores input features, high-order features, and low-order features, the other part stores convolution kernel data, the other part stores sparse mapping matrix data, and the last part stores Partial sum (Partial sum) generated by convolution of the PE array 331.
EXAMPLE five
As shown in fig. 1, the acceleration system for a deep convolutional neural network provided by the present invention includes a processor 10, an off-chip storage module 20, and an acceleration module 30 according to the above technical solution;
processor 10 is configured to control data exchanges between off-chip memory module 20 and on-chip memory module 32, and/or processor 10 is a socket general RISC-V processor.
The socket processor includes an open source processor, a data path, a data cache (DCache) and an instruction cache (ICache), and mainly functions to control the operating state of the acceleration module 30. Compared with other neural network acceleration systems, the special compiler is adopted, the socket processor is a RISC-V-based processor, belongs to an open source project, brings great convenience to development and application, reduces the cost of extra learning, and has better universality and portability. Different types of processors are generally developed for corresponding computer languages, and part of the processors are strictly bound with the computer languages and are not open sources, so that the portability is low.
EXAMPLE six
In this embodiment, the acceleration method implemented by the acceleration system for a deep convolutional neural network provided by the present invention is further illustrated to implement a two-stage cascade full-jump zero acceleration method.
Step 10: the off-chip memory module 20 loads the input features, and the processor 10 pre-processes the input features such that the number of bits of the pixel points in the input features matches the number of bits calculated by the acceleration module 30, e.g., converts the input features into 16-bit data.
Step 20: the processor 10 sends the input features to the on-chip storage module 20 and the control module 31 sends the input features to the calculation module 33.
Step 30: the calculation module 33 processes the input features by using the activation function module to generate non-zero high-order features, low-order features, a sparse mapping matrix and coordinate information, and the control module 31 sends the high-order features, the low-order features, the sparse mapping matrix and the coordinate information to the sparse mapping module 34.
Step 40: the sparse mapping module 34 generates a write address and a read address corresponding to the high-order data and the low-order data according to the coordinate information of the high-order features and the low-order features, and registers the high-order features and the low-order features in the off-chip storage module 20 or the on-chip storage module 32.
Step 50: the control module 31 sends the high-order features to the calculation module 33, and the calculation module 33 performs convolution and maximum pooling on the high-order features to obtain a high-order pooling result. And when convolution is carried out, the participated data in the high-order features and the weight corresponding to the convolution kernel are sent to the corresponding calculating unit for calculation according to the position relation.
Step 60: the control module 31 obtains the maximum pooling region according to the high pooling result, and determines data corresponding to the maximum pooling region in the low-level features, that is, determines the low-level corresponding features according to the maximum pooling region.
Step 70: the control module 31 sends the low-order corresponding features to the calculation module 33, and the calculation module 33 performs convolution on the low-order corresponding features to obtain a low-order convolution result. And when convolution is carried out, the participated data in the low-order corresponding characteristics and the weight corresponding to the convolution kernel are sent to the corresponding calculating unit for calculation according to the position relation.
Step 80: the calculation module 33 superimposes the high pooling result and the low convolution result to obtain the maximum pooling result of the input features.
Step 90: and the processor generates an identification result according to the maximum pooling result, and sends the result to the display for displaying, so that the identification result of the input features is reflected.
Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above embodiments are implemented by a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated in a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above embodiments may be implemented.
The present invention has been described in terms of specific examples, which are provided to aid understanding of the invention and are not intended to be limiting. For a person skilled in the art to which the invention pertains, several simple deductions, modifications or substitutions may be made according to the idea of the invention.