Fixed-point method and device for neural network
1. A method for localizing a neural network, the method comprising,
performing low bit quantization on the neural network, wherein the low bit quantization is at least one of the following:
a first low bit quantization is performed on convolutional layer input activation values in a neural network,
performing a second low bit quantization on the convolution kernel weights in the convolutional layer,
performing third low-bit quantization on the input activation value of the non-convolution functional layer except the convolution layer in the neural network;
after the arbitrary low bit quantization is performed, retraining is performed based on the neural network after the current low bit quantization,
performing fixed-point processing based on the quantization result of each low bit in the retrained neural network;
loading a fixed-point neural network;
wherein the first low bit, the second low bit and the third low bit are all within 1 bit to 8 bits.
2. The dotting method of claim 1, wherein said first low bit quantization of convolutional layer input activation values in a neural network comprises,
inputting training data to a neural network, and calculating a first activation value input by each convolutional layer;
for each first activation value of any convolutional layer:
determining the current quantization step size according to the distribution of each first activation value of the convolutional layer;
calculating the upper limit of the first active value according to the current quantization step and the first low bit,
taking the value of each first activation value of the convolutional layer according to the value upper limit to obtain each second activation value of the convolutional layer,
and quantizing the second activation value of the convolutional layer to obtain a quantized activation value of the convolutional layer.
3. The dotting method of claim 1, wherein said second low bit quantization of convolution kernel weights in convolutional layers comprises,
for any output channel in any convolutional layer:
obtaining each first weight of the convolution kernel corresponding to the output channel,
determining the current weight quantization threshold value of the output channel according to each weight distribution,
based on the current weight quantization threshold, obtaining the current quantization weight of each first weight of the output channel, estimating the error between the current quantization weight and the first weight, adjusting the current quantization threshold according to the error until the error between the current quantization weight and each first weight is minimum, and taking the quantization threshold corresponding to the minimum error as the weight quantization threshold coefficient of the output channel;
calculating a first amplitude of the output channel based on the weight quantization threshold coefficient;
based on the first amplitude of the output channel, adopting a second low bit to carry out low bit expression of the first weight to obtain a second weight; and quantizing the weight quantization threshold coefficient of the output channel according to the comparison result of the second weight and the first amplitude of the output channel to obtain a coefficient quantization value of the weight quantization threshold coefficient of the output channel.
4. The spotting method of claim 3 wherein said determining a current quantization threshold for the output channel based on each weight distribution comprises,
accumulating the squares of all first weights in the convolution kernels corresponding to the output channel to obtain an accumulated result of the output channel, and determining a current weight quantization threshold in a specified threshold range based on the accumulated result of the output channel;
the current quantization weight of the output channel is obtained based on the current quantization weight threshold, including,
calculating the product of the accumulated result of the output channel and the current weight quantization threshold value of the output channel to obtain the current amplitude value of the output channel,
comparing each first weight value in the convolution kernel corresponding to the output channel with the current amplitude value,
if the first weight is larger than the current amplitude, assigning the amplitude to the first weight to obtain a current quantization weight;
if the first weight is smaller than the negative current amplitude, assigning the negative amplitude to the first weight to obtain a current quantization weight;
if the first weight is more than or equal to the negative current amplitude and less than or equal to the current amplitude, assigning the first weight as 0 to obtain a current quantization weight;
the error of the current quantization weight and the first weight is estimated, and the current quantization threshold is adjusted according to the error, including,
calculating the average error or the accumulated error between the current quantized weight of the first weight of the output channel and the first weight, and if the error is larger than the set error range, re-executing the step of determining the current weight quantization threshold of the output channel according to each weight distribution;
and calculating a first amplitude of the output channel based on the weight quantization threshold coefficient, wherein calculating the product of the accumulation result of the output channel and the weight quantization threshold coefficient of the output channel to obtain the first amplitude.
5. The dotting method of claim 4, wherein said low bit representation of the first weight using a second low bit based on the amplitude of the output channel comprises, for insensitive convolutional layers, using a second low bit base to represent the first weight; for the sensitive convolution layer, more than two second bit bases are adopted to express a first weight;
after the convolutional layer inputs the activation value and is subjected to the first low bit quantization, and/or after the convolutional kernel weight value in the convolutional layer is subjected to the second low bit quantization, retraining is carried out based on the neural network after the current low bit quantization, including,
for any convolutional layer:
and calculating each weight of each output channel according to the coefficient quantization value of the weight quantization threshold coefficient of each output channel in the convolutional layer and the first amplitude of each output channel to obtain a first quantization weight, and substituting each first quantization weight of each output channel into the neural network for retraining.
6. The dotting method according to claim 5, wherein said first weight is expressed using a second low bit-base for insensitive convolutional layers, comprising,
determining a fixed-point numerical value of a weight quantization threshold coefficient according to the fixed-point numerical value range determined by the second low bit, and calculating the product of the fixed-point numerical value and the first amplitude to obtain a second weight;
the quantizing the weight quantization threshold coefficient of the output channel according to the comparison result of the second weight and the first amplitude of the output channel to obtain a coefficient quantization value of the weight quantization threshold coefficient of the output channel, including,
comparing the second weight with the first amplitude of the output channel,
if the second weight is larger than the first amplitude, the fixed point numerical value of the weight quantization threshold coefficient is used as the first coefficient quantization value of the weight quantization threshold coefficient;
if the second weight is smaller than the negative first amplitude, taking the fixed-point numerical value of the negative weight quantization threshold coefficient as the first coefficient quantization value of the weight quantization threshold coefficient;
if the second weight is greater than or equal to the negative first amplitude and less than or equal to the first amplitude, taking 0 as a first coefficient quantization value of the weight quantization threshold coefficient;
calculating each weight of each output channel according to the coefficient quantization value of the weight quantization threshold coefficient of each output channel in the convolutional layer and the first amplitude of each output channel to obtain a first quantization weight, including,
for any output channel in any convolutional layer:
and multiplying the first amplitude of the output channel by the first coefficient quantization value of the output channel weight quantization threshold coefficient to obtain a first quantization weight, which is used as a quantization result of each first weight in the convolution kernel corresponding to the output channel.
7. The spotting method of claim 6 wherein expressing the first weights using more than two second bitbases for the sensitive convolutional layers comprises,
determining more than two fixed-point numerical values of the weight quantization threshold coefficient according to the fixed-point numerical value range determined by the second low bit to obtain a combined weight quantization threshold coefficient;
selecting more than two combined output channels from the convolutional layer; the number of the combined output channels is the same as the number of the combined weight quantization threshold coefficients;
multiplying a first amplitude value of a combined output channel by a combined weight quantization threshold coefficient, and adding the multiplication results to obtain a second weight;
the quantizing the weight quantization threshold coefficient of the output channel according to the comparison result of the second weight and the first amplitude of the output channel to obtain a coefficient quantization value of the weight quantization threshold coefficient of the output channel, including,
for any combined output channel in any convolutional layer:
the result of multiplying the first amplitude value of the combined output channel by the combined weight quantization threshold coefficient of the combined output channel is taken as the second amplitude value,
comparing the second weight with a second amplitude of the output channel,
if the second weight is larger than the second amplitude, taking the fixed-point numerical value of the combined weight quantization threshold coefficient as a second coefficient quantization value of the combined weight quantization threshold coefficient;
if the second weight is smaller than the negative second amplitude, taking the fixed-point numerical value of the negative combined weight quantization threshold coefficient as a second coefficient quantization value of the combined weight quantization threshold coefficient;
if the second weight is greater than or equal to the negative second amplitude and less than or equal to the second amplitude, taking 0 as a second coefficient quantization value of the combined weight quantization threshold coefficient;
calculating each weight of each output channel according to the coefficient quantization value of the weight quantization threshold coefficient of each output channel in the convolutional layer and the second amplitude of each output channel to obtain a first quantization weight, including,
for any convolutional layer:
and calculating the sum of products of the first amplitude of each combined output channel and the second coefficient quantization value of the combined output channel to obtain a first quantization weight value, and using the first quantization weight value as a quantization result of each first weight value in the convolution kernel corresponding to the output channel.
8. The spotting method of claim 1 wherein performing a third low bit quantization of input activation values of a non-convolution functional layer other than a convolutional layer in a neural network comprises,
acquiring a third activation value input by each non-convolution function layer;
for each third activation value input for any of the non-convolution functional layers:
determining a current quantization step according to the distribution of each third activation value of the non-convolution functional layer;
calculating the upper limit of the third active value according to the current quantization step and the third low bit,
taking the value of each third activation value of the non-convolution functional layer according to the value upper limit to obtain each fourth activation value of the convolution layer,
and quantizing the fourth activation value of the convolution layer to obtain the quantized activation value of the non-convolution function layer.
9. The spotting method of claim 1 wherein, after performing a third low-bit quantization of input activation values of a non-convolution functional layer other than a convolutional layer in a neural network, further comprising,
judging whether the activation function of the activation function layer in the neural network is a linear correction function ReLU function or not,
if yes, converting the quantized activation value output by the ReLU function from an unsigned number to a signed number, then deleting the ReLU function, compensating the offset in the functional layer to which the activation value is input, then executing the step of performing fixed-point processing based on the quantization result of each low bit in the retrained neural network,
otherwise, directly executing the quantization result based on each low bit in the retrained neural network, and performing fixed point processing.
10. A device for localizing a neural network, comprising,
a quantization module for low bit quantization of the neural network, at least one of:
a convolutional layer input activation value quantization module for performing a first low bit quantization on the convolutional layer input activation value in the neural network, and/or
A convolutional layer weight quantizing module for performing second low bit quantization on the convolutional core weight in the convolutional layer, and/or
The non-convolution functional layer input activation value quantization module is used for performing third low-bit quantization on the input activation values of the non-convolution functional layer except for the convolution layer in the neural network;
a retraining module for retraining based on the neural network after the low bit quantization of any quantization module,
the fixed-point module is used for carrying out fixed-point processing on the basis of the quantization result of each low bit in the retrained neural network;
the loading module is used for loading the fixed-point neural network;
wherein the first low bit, the second low bit and the third low bit are all within 1 bit to 8 bits.
Background
In a computer, numerical values can occupy different bits of the computer for storage according to different types such as integer types and floating point types, for example, floating point numbers are generally represented by high-precision 32 bits or 64 bits, in an allowable precision loss range, behaviors of representing floating point numbers by using fewer bits, for example, 4 bits, 8 bits and 16 bits, are called quantization, and values of quantized numerical values obtained through a dynamic quantization algorithm are discrete.
The neural network is an important algorithm in a machine learning algorithm, the development of the computer field is promoted, along with the continuous research of people, the calculation and storage complexity of the algorithm is increased, the continuously expanded calculation complexity and space complexity provide challenges for the performance of computing equipment, and each parameter in the neural network is quantized under the condition of ensuring the target performance of the neural network, so that the neural network becomes the basis of the application of the neural network.
With the wide application of the neural network, more and more miniaturized devices such as embedded systems need to implement various scene applications through the neural network, are limited by hardware resources, and it is required to reduce processing resources occupied by the neural network when the neural network is operated.
Disclosure of Invention
The application provides a fixed-point method of a neural network, which is used for reducing processing resources occupied by the neural network.
The application provides a method for fixing a point of a neural network, which comprises the following steps,
performing low bit quantization on the neural network, wherein the low bit quantization is at least one of the following:
performing a first low bit quantization on convolutional layer input activation values in the neural network, performing a second low bit quantization on convolutional kernel weights in the convolutional layer,
performing third low-bit quantization on the input activation value of the non-convolution functional layer except the convolution layer in the neural network;
after the arbitrary low bit quantization is performed, retraining is performed based on the neural network after the current low bit quantization,
performing fixed-point processing based on the quantization result of each low bit in the retrained neural network;
wherein the first low bit, the second low bit, and the third low bit are within 1 bit to 8 bits.
The application also provides a fixed-point device of the neural network, which comprises,
a quantization module for low bit quantization of the neural network, at least one of:
a convolutional layer input activation value quantization module for performing a first low bit quantization on a convolutional layer input activation value in the neural network,
a convolutional layer weight quantizing module for performing a second low bit quantization on the convolutional core weight in the convolutional layer,
the input activation value quantization module of the non-convolution function layer quantizes the input activation value of the non-convolution function layer except the convolution layer in the neural network at a third low bit,
a retraining module for retraining based on the neural network after the low bit quantization of any quantization module,
the fixed-point module is used for carrying out fixed-point processing on the basis of the quantization result of each low bit in the retraining neural network;
wherein the first low bit, the second low bit and the third low bit are all within 1 bit to 8 bits.
A computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the neural network spotting method.
The method ensures that the activation value and the convolution kernel weight value in the final fixed-point neural network are represented by low-ratio fixed-point, thereby being conveniently transplanted to an embedded platform and an application-specific integrated circuit; the convolutional layer input activation value and/or convolutional kernel weight value are fixed into low bits, and the calculated amount and the storage space are greatly compressed; by adopting a staged quantization mode, layer-by-layer calculation or repeated calculation is not needed, meanwhile, the low error of staged quantization performance can be ensured, and the functional layer parameters needing to be fixed in point can be flexibly selected according to the target performance requirements of the neural network; the neural network retraining is carried out after the convolutional layer input activation value is quantized and the convolutional layer weight is quantized, so that error accumulation caused by a layer-by-layer training mode is avoided, end-to-end retraining is realized, the error of the whole network is effectively reduced, and the quantized network still has enough target performance.
Drawings
FIG. 1 is a diagram of a hidden layer in a convolutional neural network model.
Fig. 2a and 2b are a flowchart of a method for spotting a convolutional neural network according to an embodiment of the present disclosure.
FIG. 3 is a diagram illustrating weights of convolution kernels corresponding to each output channel in a convolution layer.
FIG. 4 is a diagram illustrating the relationship between related parameters in the process of quantizing convolution kernel weights.
Fig. 5 is a flowchart of a localization method of a convolutional neural network according to a second embodiment of the present application.
Fig. 6 is a flowchart of a localization method of a convolutional neural network according to a third embodiment of the present application.
Fig. 7 is a schematic diagram of a neural network localization apparatus according to an embodiment of the present application.
Fig. 8 is a schematic diagram of a low-specific-point neural network applied to target detection based on a vehicle-mounted camera according to an embodiment of the present application.
Fig. 9 is a schematic diagram of a low-specific-point neural network applied to the identification based on the door access photographic image in the embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical means and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings.
The application provides a neural network fixed-point calculation method, which solves the problem that the neural network occupies more hardware resources due to floating point number operation by using a low-ratio fixed-point neural network.
For ease of understanding, the following description will be given taking the fixed point of the convolutional neural network as an example. In view of the fact that even though the same neural network model is used in various scenarios, the parameters involved are different, in order to explain the technical solution of the present application, the following description will be centered around the neural network model itself.
The convolutional neural network model includes: the system comprises an input layer, a hidden layer and an output layer, wherein the input layer processes multidimensional training data, for example, the multidimensional training data is applied to a convolutional neural network of machine vision, and the input layer normalizes original pixel values distributed in the range of 0-255 so as to be beneficial to improving the learning efficiency and the performance of the convolutional neural network; the hidden layer is used for training data of the input layer, and the output layer is used for outputting a training result of the convolutional neural network model, for example, for an image classification problem, the output layer outputs a classification label by using a logic function or a normalized exponential function (softmax function).
Referring to fig. 1, fig. 1 is a schematic diagram of a hidden layer in a convolutional neural network model. The hidden layer generally includes, in order, a convolution layer 1 that inputs an activation value of a previous activation function layer, a normalization pool 1, an activation function layer 1, a pooling layer 1, a convolution layer 2, a normalization layer 2, an activation function layer 2, a convolution layer 3, a normalization layer 3, an addition layer, and an activation function layer 4. In the same convolution layer, at least more than one activation value of the input convolution layer can obtain different output results through different convolution kernels, and one output result can be regarded as one output channel. It should be understood that the structure of the hidden layer is not limited thereto, for example, the non-convolution functional layers such as the pooling layer, the normalization layer, the additive layer, etc. other than the convolutional layer may be designed according to the requirement, for example, the pooling layer 2 is output to the additive layer in the figure, but the convolutional layer is the core of the hidden layer. Therefore, the convolution calculation is the most time-consuming part in the convolution neural network, and the calculation amount of the whole network can be reduced to the maximum extent by reducing the calculation amount of the part.
Example one
Referring to fig. 2a and 2b, fig. 2a and 2b are a flowchart of a method for localizing a convolutional neural network according to an embodiment of the present disclosure.
Step 201, inputting training data to the convolutional neural network, and calculating a first activation value output by each activation function layer in a forward direction, that is, an activation value output by at least one activation function in the activation function layer, that is, each activation value input to the convolutional layer.
Wherein the first activation value is represented by floating point type data, and the parameters in the convolutional neural network are stored by the floating point type data.
Step 202, determining a current quantization step size of the first activation value output by each activation function layer according to the first activation value distribution output by each activation function layer.
In this step, for any activation function layer, the first activation value distribution output by the activation function layer is analyzed, for example, the first activation value distribution output by the activation function layer is sorted, and then a quantization step is selected as the current quantization step of the first activation value output by the activation function layer according to a given scale value.
Preferably, for any active function layer, a quantization step size can be searched by means of searching, so that the quantization error of the first active value output by the active function layer is minimized.
In this way, the current quantization step size of the floating point number of the first active value output by each active function layer can be obtained.
And 203, for the first activation values output by each activation function layer, solving the value upper limit of the first activation value according to the current quantization step length of each activation function layer determined in the previous step, and limiting the value range of the first activation value according to the solved value upper limit to obtain the second activation value of each activation function layer.
In the step, for a first active value output by any active function layer, a formula is calculated according to a floating point quantization step sizeObtaining the upper limit beta of the first activation value output by the activation function layer, where s is the current quantization step size, and the first bit number for quantization of the activation value is baPreferably, baIs low bit, i.e. within 8 bits, preferably baIn general, 4 may be taken.
And comparing each first activation value with the value upper limit obtained by solving, if the first activation value is greater than the value upper limit, taking the first activation value as the value upper limit, and if not, keeping the first activation value unchanged. Thereby, a second activation value of the limited value range of the activation function layer is obtained.
The mathematical expression is:
wherein the subscript i denotes the activation function layer i, βiThe upper limit of the value of the first activation value, s, which represents the output of the activation function layer iiCurrent quantization step size, b, representing the first activation value output by the activation function layer iaiThe number of quantization bits representing the activation value for the activation function layer i output is preferably the same for each activation function layer.
Wherein x isfiAnd a second activation value representing an output of the activation function layer i, the second activation value being a floating point number.
And 204, quantizing the second activation values of the activation function layers to obtain quantized activation values.
In this step, for the second activation value of any activation function layer, the formula is quantized according to the floating point number Obtaining a quantified activation value, wherein the floor operation is rounding off the value, xfIs a second activation value, baThe fixed-point value of a bit is expressed asThe quantized activation value is therefore limited to this range.
The mathematical expression is:
wherein the content of the first and second substances,representing the activation value, x, of the activation function layer i after quantizationfiThe second activation value is the activation function layer i.
And step 205, retraining based on the current convolutional neural network, wherein the activation value input by each convolutional layer in the current convolutional neural network is a quantized activation value.
In the step, the value upper limit beta of the activation value is used as a parameter which can be updated in a trainable iteration way, and the updating is carried out along with the training of the whole network until the convergence, so that the quantization step length is also updated along with the updating.
Step 206, acquiring each first weight in each convolution kernel in each convolution layer based on the current convolution neural network, wherein the weight is floating point data; and accumulating the square number of each first weight of the convolution kernel corresponding to each output channel of each convolution layer to obtain an accumulation result of the square number of the first weight of the output channel. And searching a quantization threshold in a specified threshold range based on the accumulation result, wherein the quantization threshold is used as the current weight quantization threshold of the output channel.
Referring to fig. 3, fig. 3 is a schematic diagram illustrating weights of convolution kernels corresponding to each output channel in a convolution layer. In the same convolution layer, the number of convolution kernels corresponding to each output channel is k, the size of each convolution kernel is the same, and each convolution kernel has n first weights; the number of output channels, the number of convolution kernels and the size of convolution kernels are different between different convolution layers, such as convolution layer 1 and convolution layer i in the figure.
In this step, for any output channel j of any convolutional layer i, calculating the square of each first weight in each convolutional kernel k corresponding to the output channel j to obtain the square number of each first weight in each convolutional kernel corresponding to the output channel j, and accumulating the square numbers of all the first weights in all the convolutional kernels of the output channel to obtain the accumulation result of the output channel j;
the square mathematical expression of all the first weights in all convolution kernels of the accumulation output channel j is as follows:
wherein L isijRepresenting the accumulated result, w, of the convolutional layer i output channel jfijknThe output channel j in convolutional layer i corresponds to the nth first weight in convolutional kernel k.
Accumulating a result L based on the convolution layer and the output channelijAnd searching a quantization threshold value in a specified threshold value range to be used as a current weight quantization threshold value of the convolutional layer i output channel j.
Through this step, the current weight quantization threshold of each output channel in each convolutional layer can be obtained.
Step 207, based on the current weight quantization threshold of each output channel, quantizing each first weight in the convolution kernel corresponding to each output channel to obtain the current quantization weight of each first weight in the convolution kernel corresponding to each output channel.
In the step, for any convolutional layer i, multiplying the accumulated result of the convolutional layer output channel j by the result of the current weight quantization threshold of the output channel, and taking the result as the current amplitude of the output channel; and comparing each first weight in the convolution kernel corresponding to the output channel with the current amplitude, assigning the first weight as the amplitude if the first weight is larger than the amplitude, assigning the first weight as a negative amplitude if the first weight is smaller than the negative number of the amplitude, and assigning the first weight with the middle value as 0, so that the current quantization weight of the output channel can be obtained.
The mathematical expression is:
λij=Lij×tij
wherein, tijQuantize the threshold, λ, for the current weight of the output channel j in convolutional layer iijIs the current amplitude of the output channel j in the convolutional layer i;
wherein the content of the first and second substances,is the current quantization weight, w, of the nth first weight of the convolution kernel k corresponding to the output channel j in the convolution layer ifijknThe output channel j in convolutional layer i corresponds to the nth first weight in convolutional kernel k.
And 208, estimating errors between the current quantization weight of each first weight of the output channel and each first weight, if the errors are greater than a set error threshold, indicating that the current quantization threshold of the output channel does not meet the requirements, searching the quantization threshold again to be used as the current weight quantization threshold of the output channel, and returning to execute the step 207 until the errors are less than the set error threshold, and using the quantization threshold corresponding to the minimum error as a weight quantization threshold coefficient.
In this step, the estimation error may be an average error or an accumulated error between the current quantized weight of the first weight of the output channel and the first weight, that is, an accumulated error or an average error between the current quantized weight of all the first weights of the output channel j and all the first weights, and is mathematically expressed as:
or
Through steps 207-208, the weight quantization threshold coefficient of each output channel in each convolution layer can be obtained.
Step 209, calculating a first amplitude value of each output channel based on the weighted quantization threshold coefficients and the accumulated result of each output channel,
in the step, for any convolution layer i, multiplying the accumulated result of the convolution layer output channel j by the result of the weight quantization threshold coefficient of the output channel to be used as a first amplitude value of the output channel;
the mathematical expression is:
αij=Lij×Tij
wherein, TijQuantizing the threshold coefficient, alpha, for the weight of the output channel j in convolutional layer iijIs a first amplitude of an output channel j in convolutional layer i;
step 210, based on the first amplitude of each output channel in each convolutional layer, adopting a second low bit to represent the first weight in the convolution kernel corresponding to each output channel in each convolutional layer, so as to obtain a second weight; that is, the second low bit quantization is performed for each first weight to fix the first weight.
In order to ensure the target performance of the convolutional neural network, a low bit quantization mode is selected according to the sensitivity of the change of the target performance of the convolutional neural network caused by low bit quantization. If the target performance is not reduced much, the convolutional layer is not sensitive; if the target performance is degraded more when the quantization is low, it indicates that the convolutional layer is very sensitive.
For the insensitive convolutional layer, a second low bit is used to express the first weight. Specifically, a first amplitude of an output channel is multiplied by a weight quantization threshold coefficient expressed by a second low bit; wherein, the weight quantization threshold coefficient expressed by low bit is bwThe fixed point value of the expression is T1, and the value range of T1 isE.g. when the bit b is lowwWhen the value is 2, the value range of T1 is [ -2, 1 [ ]]. Thus, for the nth first weight w in the convolution kernel k corresponding to the output channel j in the convolution layer ifijknI.e. the second weight wijknComprises the following steps:
wijkn=αijT1ij
wherein alpha isiIs a first amplitude of an output channel j in convolutional layer i; t1ijExpressing the low bit of the weight quantization threshold coefficient of the output channel j in the convolutional layer i, namely the weight quantization threshold coefficient after fixed point; w is aijknIs fixed point data.
For the sensitive convolutional layer, a combination of more than two second low-order bases is used to express the floating-point weight of the convolutional layer. Specifically, more than two output channels in the convolutional layer are used as combined output channels, the weight quantization threshold coefficient of the combined output channel is a combined weight quantization threshold coefficient, the first amplitude values of the more than two combined output channels are multiplied by the more than two combined weight quantization threshold coefficients, and all multiplied results are summed; the combination weight quantization threshold coefficients are expressed by the combination of more than two second low bit bases, which are respectively T1, T2,. Tm fixed-point numerical values, wherein m is the number of the second low bit base combinations, and the value range of each combination weight quantization threshold coefficient isFor example, when the second low bit bwWhen the value is 2, the value range of each combined weight quantization threshold coefficient is [ one 2, 1 ]]For example, the value of each combination weight quantization threshold coefficient may be 3 numbers of-1, 0, 1, and in this case, m is 3. For the nth first weight w in the convolution kernel k corresponding to the output channel j in the convolution layer ifijknI.e. the second weight wijknComprises the following steps:
wijkn=αi1T1ij+…+αimTmij
wherein alpha isimA first amplitude for an mth combined output channel of the J output channels in convolutional layer i; tm isijQuantizing the combination weight of output channel j in convolutional layer i expressed by mth combination low bit base in combination of more than two low bit bases, i.e. quantizing the combination weight after the combination of more than two low bit bases is fixedA threshold coefficient; m is the number of the second low bit-base combinations and is less than or equal to J; j is the total number of output channels in convolutional layer i; w is aijknIs fixed point data.
Step 211, quantizing the weight quantization threshold coefficient corresponding to each output channel to obtain a quantized weight quantization threshold coefficient corresponding to each output channel, that is, a coefficient quantization value.
In this step, for an insensitive convolutional layer, for any output channel j, comparing each second weight in the convolutional kernel corresponding to the output channel with the first amplitude, and if the second weight is greater than the first amplitude, the first coefficient quantization value of the weight quantization threshold coefficient of the output channel is the weight quantization threshold coefficient after fixed point processing; if the value is less than the negative number of the amplitude, the first coefficient quantization value of the weight quantization threshold coefficient of the output channel is a negative fixed-point weight quantization threshold coefficient; if the value is in the middle, the weight quantization threshold coefficient of the output channel is assigned to 0, so that the first coefficient quantization value of the weight quantization threshold coefficient corresponding to the output channel can be obtained.
The mathematical expression is:
wherein, wijknThe output channel j in convolutional layer i corresponds to the nth second weight in convolutional kernel k,the first coefficient quantization value of the threshold coefficient is quantized for the weight of the output channel j in convolutional layer i, i.e., the first coefficient quantization value of output channel j.
For a sensitive convolutional layer, for any combination output channel m in any convolutional layer, taking the result of multiplying the first amplitude of the combination output channel by the combination weight quantization threshold coefficient of the combination output channel as a second amplitude, comparing the second amplitude with a second weight, if the second weight is greater than the amplitude, taking the combination weight quantization threshold coefficient as a second coefficient quantization value, if the second weight is less than the negative number of the amplitude, taking the negative combination weight quantization threshold coefficient as a second coefficient quantization value, and if the second weight is in the middle, taking 0 as the second coefficient quantization value, thereby obtaining the second coefficient quantization value of the combination weight quantization threshold coefficient corresponding to the combination output channel.
The mathematical expression is:
wherein, wijknThe output channel j in convolutional layer i corresponds to the nth second weight in convolutional kernel k,the second coefficient quantization value of the threshold coefficient, i.e., the second coefficient quantization value of the combined output channel m, is quantized for the combined weight of the mth combined output channel of the J output channels of convolutional layer i.
For example, if the value range of the combined weight quantization threshold coefficient is three numbers, i.e., -1, 0, 1, the second coefficient quantization value for any combined weight quantization threshold coefficient m is:
and 212, calculating each weight of each output channel according to the coefficient quantization value of the weight quantization threshold coefficient of each output channel and the first amplitude of each output channel to obtain a first quantization weight, and substituting each first quantization weight of each output channel into the convolutional neural network for retraining.
In this step, for the insensitive convolutional layer, the first quantization weight of the nth first weight in the convolutional kernel k corresponding to the output channel j in the convolutional layer i is: the product of the first amplitude value of the output channel j in the convolutional layer i and the first coefficient quantization value; the mathematical formula is expressed as:
wherein the content of the first and second substances,the first quantization weight of the nth first weight in the convolution kernel k corresponding to the output channel j in the convolution layer i means that the first quantization weights of the first weights in the convolution kernels corresponding to the same output channel are the same for the insensitive convolution layer.
For a sensitive convolutional layer, a first quantization weight of an nth first weight in a convolutional kernel k corresponding to an output channel j in a convolutional layer i is: the sum of products of the first amplitude of each combined output channel and the second coefficient quantization value of the combined output channel; the mathematical formula is expressed as:
through steps 209-212, the weight originally stored as 32 bits is quantized to bwA bit. The data relationship in this process can be seen in fig. 4, where fig. 4 is a schematic diagram of the relationship of related parameters in the process of quantizing the convolution kernel weights.
Step 213, normalizing the output result distribution to 0 mean or 1 variance in a normalization layer for performing normalization operation on the output result of each output channel of the convolutional layer according to each channel based on the retrained convolutional neural network, so that the distribution meets the normal distribution; for a first amplitude value of each output channel of the convolutional layer, multiplying each channel parameter of a normalization layer connected with the convolutional layer as a multiplier by the first amplitude value to update the first amplitude value.
In the classical neural network, normalization is performed after the convolutional layer, that is, the output channels of the convolutional layer normalize the output result data distribution to 0 mean or 1 variance according to each channel, so that the distribution satisfies the normal distribution.
For the insensitive convolution layer, multiplying the first amplitude of the output channel by the channel parameter of the channel corresponding to the normalization layer to obtain an updated first amplitude, wherein the updated first amplitude is expressed by a mathematical expression:
αij≡αij×σij
wherein σijThe above expression, for a channel parameter of a channel j in a normalization layer i connected to a convolution layer i, means that the parameter of the normalization layer channel j connected to the output of the convolution layer i is multiplied by a first amplitude value of the output channel j of the convolution layer i, and the first amplitude value is updated with the result of the multiplication.
For the sensitive convolution layer, multiplying the first amplitude of each combined output channel in the output channels by the channel parameter of the output channel corresponding to the normalization layer respectively to obtain an updated first amplitude of the combined output channel, wherein the updated first amplitude is expressed by a mathematical expression:
αim≡αim×σim
wherein σimRepresenting channel parameters corresponding to the combined output channel m in the convolution layer i in the normalization layer i; the above formula represents: multiplying the first amplitude of each combined output channel in the output channel J of the convolutional layer i with the channel parameter corresponding to the combined output channel in the normalization layer i connected with the output of the convolutional layer i, and updating the first amplitude of the combined output channel by using the multiplication result.
The normalization parameters performed by this step are fused so that the first amplitude is updated to remove the effect of the normalization operation on the network performance.
Step 214, quantizing the input activation values of the non-convolution function layers in the neural network through a dynamic quantization algorithm.
The former step solves the convolution layer operation which consumes the most memory space and calculated amount, and changes it into baBit activation value and bwConvolution between bit weights, in a neural network, in addition to convolutional layers, there are phasesAnd non-convolution function layers such as an additive layer and a pooling layer, which occupy a very small amount of calculation compared with the operation of the convolutional layer, and in order to ensure the accuracy of the non-convolution function layers, when the output activation values of the non-convolution function layers are not used as the input of the convolutional layer, a third low bit is adopted to be quantized to a higher precision, generally, the third low bit is 8 bits, and if the output activation values are used as the input of the convolutional layer, the third low bit is quantized to a first low bit ba。
The quantization process is the same as the quantization process for the activation values described in steps 202-204. The method specifically comprises the following steps:
when the first activation value input by each convolution layer is calculated in the forward direction, the third activation value input by each non-convolution function layer can also be calculated,
for each third activation value input for any of the non-convolution functional layers:
determining a current quantization step size according to the distribution of each third active value of the non-convolution functional layer, for example, sorting each third active value, and then selecting the current quantization step size according to a given ratio value, or searching a quantization step size which minimizes the quantization error of the third active value as the current quantization step size;
calculating a value upper limit for limiting the value range of each third active value of the non-convolution functional layer according to the current quantization step and the third low bit, wherein in the step, the value upper limit is obtained through a floating point quantization step calculation formula; the floating point quantization step calculation formula is as follows:
wherein eta isiIs the upper limit of the third activation value, r, of the non-convolution functional layer iiCurrent quantization step size for the third activation value of the non-convolution functional layer i, bpiIn order to be the third lowest bit of the data,
taking a value of each third activation value of the non-convolution functional layer according to the value upper limit to obtain each fourth activation value of the non-convolution functional layer; in the step, comparing the third activation value with the upper value limit, if the third activation value is greater than the upper value limit, giving the upper value limit to the third activation value as a fourth activation value, and if the third activation value is less than or equal to the upper value limit, keeping the third activation value unchanged as the fourth activation value; expressed mathematically as:
quantizing the fourth activation value of the non-convolution functional layer to obtain an activation value after quantization of the non-convolution functional layer; in this step, the fourth activation value is quantized according to a floating point number quantization formula, where the floating point number quantization formula is:
wherein the content of the first and second substances,representing the quantized activation value, y, of the non-convolution functional layer ifiFor the non-convolution functional layer i third activation value, riFor the current quantization step size of the third activation value of the non-convolution functional layer i, the floor representation is a rounding of the value.
Step 215, determining whether the activation function of the activation function layer in the neural network is a Linear correction function (ReLU), if so, subtracting an offset from the quantized activation value output by the ReLU activation function of the activation function layer as a whole, and then making up the offset in the convolution layer or other function layers input by the activation value, and then executing step 216; otherwise, step 216 is performed directly.
In particular, since the range of values of the data input to the convolutional layer by the ReLU function is a low-bit unsigned number, e.g., the first low-bit number b for the quantization of the activation valuesaWhen the output characteristic of the ReLU function is 4, the value range of the fixed point numerical value is determinedIn order to be more effectively transplanted to a hardware platform, a low-bit unsigned number is converted into a low-ratio specific sign number, namely, an offset is determined, so that a quantized activation value output by the ReLU function is converted from an unsigned number into a signed number, the activation value can be expressed by using a first low-ratio specific sign number, and an input value input to the ReLU function is subjected to data shifting according to the offset. For example, in baThe input value of each ReLU function is subtracted by 8 when the value is 4, so that the value range becomes [ -8, 7]According to the output characteristic of the ReLU function, the output value of the ReLU function is also reduced by 8, so that the data can be guaranteed to be stored by 4-bit signed numbers; to avoid the effect of the ReLU function taking a signed number to 0, then each ReLU function layer is deleted and the offset is compensated back in the layer next to the deleted ReLU function layer, e.g. the next layer is a convolutional layer, then the offset can be added to the offset data of the convolutional layer, and the next layer is an additive layer, then the offset is added to the preset value of the layer.
In the step, the input value and the output value of the ReLU activation function are changed from an unsigned number to a signed number, and the offset generated by changing the unsigned number to the signed number is compensated back in the next layer, so that the resource consumption of the unsigned number can be effectively reduced during transplantation and deployment.
Step 216, the parameters of the whole network are fixed,
in this step, the low-bit quantized values transmitted between the functional layers in the neural network are converted into fixed-point quantized values, specifically, any low-bit quantized value obtained in the above step includes an input activation value after quantization of each convolutional layer, a first amplitude value (after updating of normalization layer fusion parameters) of a sensitive convolutional layer output channel, a first coefficient quantized value of a weight quantized threshold coefficient, a first amplitude value (after updating of normalization layer fusion parameters) of a combined output channel selected from each insensitive convolutional layer output channel, a second coefficient quantized value of a combined weight quantized threshold coefficient, an input activation value after quantization of a non-convolutional functional layer, a signed number of input values and/or output values of an activation function layer whose activation function is ReLU, an input value and/or output value after quantization of a deleted ReLU activation function layer, a fixed-point quantized value, and a fixed-point quantized value, The offset of the activation function layer of each activation function, which is ReLU, is converted into fixed-point numerical values by using low-bit quantized numerical values and stored, so that the low-bit quantized numerical values transmitted among the layers in the current neural network are converted into fixed-point numerical values, and the parameter fixed-point of the whole network is realized.
In the above steps 205, 212, after quantization of the convolutional layer input activation value and after quantization of the convolutional layer weight, neural network retraining is performed respectively, because it can be understood that for the convolutional layer, the activation value is multiplied by the convolutional layer weight, and since the activation value is quantized, for example, the value a is quantized from 1.2 to 1, such that the multiplication of the number b and the quantized a is not consistent with the result before quantization, it is necessary to change the values of a and b slowly through retraining to make the multiplication result of a and b approach the result before quantization, thereby being beneficial to ensuring the target performance of the neural network. It should be understood that retraining may also be performed after quantization of convolutional layer input activation values or after quantization of convolutional layer weight values, for example, step 205 may not be performed, but only step 212 may be performed, so that retraining is performed only at step 212, and in this case, step 205 may alternatively be performed, but not step 212.
The embodiment enables the activation value and the convolution kernel weight value in the final fixed-point neural network to be represented by low-ratio specific points, so that the method can be conveniently transplanted to an embedded platform and an application-specific integrated circuit; the convolution layer input activation value and the convolution kernel weight value are fixed into low bits, and the calculated amount and the storage space are greatly compressed; more calculated amount is distributed aiming at the convolution layer weight value which is quantized sensitively, and less calculated amount is distributed aiming at the layer which is not sensitive, so that the nonuniform quantization of the convolution kernel weight value is realized, the network quantization performance error can be effectively reduced, and the target performance of the neural network is improved; by adopting a staged quantization mode, layer-by-layer calculation or repeated calculation is not needed, the low quantization performance errors of two stages of an activation value and a convolution kernel weight can be ensured, and the functional layer parameters needing to be fixed in point can be flexibly selected according to the target performance requirements of the neural network; in the process of fixed-point processing, aiming at a ReLU activation function, the range of the ReLU activation function is changed from an unsigned state to a signed state, and an offset generated by changing the unsigned state to the signed state in the next functional layer is made up, so that the resource consumption of unsigned number can be effectively reduced during transplantation and deployment; the neural network retraining is carried out after the input activation value of each convolutional layer is quantized and the weight value of each convolutional layer is quantized, so that the error accumulation caused by a layer-by-layer training mode is avoided, the end-to-end retraining is realized, the error of the whole network is effectively reduced, and the quantized network still has enough target performance.
Example two
Referring to fig. 5, fig. 5 is a flowchart of a localization method of a convolutional neural network according to a second embodiment of the present application. In this embodiment, the input activation values of the convolutional layers are quantized with a first low bit, and the input activation values of the non-convolutional functional layers other than the convolutional layers are quantized with a third low bit. Specifically, the method comprises the following steps of,
step 501, inputting training data to the convolutional neural network, and calculating a first activation value output by each activation function layer in a forward direction, that is, an activation value output by at least one activation function in the activation function layer, that is, each activation value input to the convolutional layer.
In this case, the first activation value and the third activation value are expressed by floating-point data, and the parameter in the convolutional neural network is stored by the floating-point data.
Step 502, determining a current quantization step size of the first activation value output by each activation function layer according to the first activation value distribution output by each activation function layer at present.
In this step, for any activation function layer, the first activation value distribution output by the activation function layer is analyzed, for example, the first activation value distribution output by the activation function layer is sorted, and then a quantization step is selected as the current quantization step of the first activation value output by the activation function layer according to a given scale value.
Preferably, for any active function layer, a quantization step size can be searched by means of searching, so that the quantization error of the first active value output by the active function layer is minimized.
In this way, the current quantization step size of the floating point number of the first active value output by each active function layer can be obtained.
Step 503, for the first activation values output by each activation function layer, solving the value upper limit of the first activation value according to the current quantization step length of each activation function layer determined in the previous step, and limiting the value range of the first activation value according to the solved value upper limit to obtain the second activation values of each activation function layer.
In the step, for a first active value output by any active function layer, a formula is calculated according to a floating point quantization step sizeObtaining the upper limit beta of the first activation value output by the activation function layer, where s is the current quantization step size, and the first bit number for quantization of the activation value is baPreferably, baIs low bit, i.e. within 8 bits, preferably baIn general, 4 may be taken.
And comparing each first activation value with the value upper limit obtained by solving, if the first activation value is greater than the value upper limit, taking the first activation value as the value upper limit, and if not, keeping the first activation value unchanged. Thereby, a second activation value of the limited value range of the activation function layer is obtained.
The mathematical expression is:
wherein the subscript i denotes the activation function layer i, βiThe upper limit of the value of the first activation value, s, which represents the output of the activation function layer iiCurrent quantization step size, b, representing the first activation value output by the activation function layer iaiThe number of quantization bits representing the activation value for the activation function layer i output is preferably the same for each activation function layer.
Wherein x isfiAnd a second activation value representing an output of the activation function layer i, the second activation value being a floating point number.
And step 504, quantizing the second activation values of the activation function layers to obtain quantized activation values.
In this step, for the second activation value of any activation function layer, the formula is quantized according to the floating point number Obtaining a quantified activation value, wherein the floor operation is rounding off the value, xfIs a second activation value, baThe fixed-point value of a bit is expressed asThe quantized activation value is therefore limited to this range.
The mathematical expression is:
wherein the content of the first and second substances,representing the activation value, x, of the activation function layer i after quantizationfiThe second activation value is the activation function layer i.
Step 505, further, when the first activation value output by each activation function layer is calculated in the forward direction in step 501, the third activation function input into each non-convolution function layer can be further calculated in the forward direction, and the input activation values of each non-convolution function layer in the neural network are quantized through a dynamic quantization algorithm.
In neural networks, other than convolutionIn addition to the layers, there are also non-convolution function layers such as additive layers and pooling layers, which occupy very little computation compared to convolutional layer arithmetic, and in order to ensure the accuracy of these parts, when their output activation values are not input as convolutional layers, the third low bit is used to quantize to higher precision, generally, the third low bit is 8 bits, and if they are input as convolutional layers, the third low bit is quantized to the first low bit ba。
The quantization process is the same as the quantization process for the activation values described in steps 502-504. The method specifically comprises the following steps:
for each third activation value input for any of the non-convolution functional layers:
determining a current quantization step size according to the distribution of each third active value of the non-convolution functional layer, for example, sorting each third active value, and then selecting the current quantization step size according to a given ratio value, or searching a quantization step size which minimizes the quantization error of the third active value as the current quantization step size;
calculating a value upper limit for limiting the value range of each third active value of the non-convolution functional layer according to the current quantization step and the third low bit, wherein in the step, the value upper limit is obtained through a floating point quantization step calculation formula; the floating point quantization step calculation formula is as follows:
wherein eta isiIs the upper limit of the third activation value, r, of the non-convolution functional layer iiCurrent quantization step size for the third activation value of the non-convolution functional layer i, bpiIn order to be the third lowest bit of the data,
taking a value of each third activation value of the non-convolution functional layer according to the value upper limit to obtain each fourth activation value of the non-convolution functional layer; in the step, comparing the third activation value with the upper value limit, if the third activation value is greater than the upper value limit, giving the upper value limit to the third activation value as a fourth activation value, and if the third activation value is less than or equal to the upper value limit, keeping the third activation value unchanged as the fourth activation value; expressed mathematically as:
quantizing the fourth activation value of the non-convolution functional layer to obtain an activation value after quantization of the non-convolution functional layer; in this step, the fourth activation value is quantized according to a floating point number quantization formula, where the floating point number quantization formula is:
wherein the content of the first and second substances,representing the quantized activation value, y, of the non-convolution functional layer ifiFor the non-convolution functional layer i third activation value, riFor the current quantization step size of the third activation value of the non-convolution functional layer i, the floor representation is a rounding of the value.
Step 506, retraining based on the current convolutional neural network, wherein the activation values input by each functional layer in the current convolutional neural network are quantized activation values.
In the step, the upper limit values beta and eta of the activation value are used as parameters for trainable iteration updating, and the parameters are updated along with the training of the whole network until convergence, so that the quantization step lengths s and r are also updated along with the updating.
Step 507, judging whether the activation function of the activation function layer in the neural network is a ReLU, if so, subtracting an offset from the quantized activation value output by the ReLU activation function of the activation function layer as a whole, then making up the offset in the convolution layer or other functional layers input by the activation value, and then executing step 508; otherwise, step 508 is performed directly.
Specifically, the range of data values input to the convolutional layer by the ReLU function is low-bit unsigned numberE.g. a first low bit number b for activation value quantizationaWhen the output characteristic of the ReLU function is 4, the value range of the fixed point numerical value is determinedIn order to be more effectively transplanted to a hardware platform, a low-bit unsigned number is converted into a low-ratio specific sign number, namely, an offset is determined, so that a quantized activation value output by the ReLU function is converted from an unsigned number into a signed number, the activation value can be expressed by using a first low-ratio specific sign number, and an input value input to the ReLU function is subjected to data shifting according to the offset. For example, in baThe input value of each ReLU function is subtracted by 8 when the value is 4, so that the value range becomes [ -8, 7]According to the output characteristic of the ReLU function, the output value of the ReLU function is also reduced by 8, so that the data can be guaranteed to be stored by 4-bit signed numbers; to avoid the effect of the ReLU function taking a signed number to 0, then each ReLU function layer is deleted and the offset is compensated back in the layer next to the deleted ReLU function layer, e.g. the next layer is a convolutional layer, then the offset can be added to the offset data of the convolutional layer, and the next layer is an additive layer, then the offset is added to the preset value of the layer.
In the step, the input value and the output value of the ReLU activation function are changed from an unsigned number to a signed number, and the offset generated by changing the unsigned number to the signed number is compensated back in the next layer, so that the resource consumption of the unsigned number can be effectively reduced during transplantation and deployment.
Based on the quantization result, the relevant parameters are fixed-point, step 508,
in this step, any low-bit quantized numerical value obtained in the above steps, including an input activation value after quantization of each convolution layer, an input activation value after quantization of a non-convolution function layer, a signed number of an input value and/or an output value of an activation function layer with an activation function being a ReLU, an input value and/or an output value after quantization of a deleted ReLU activation function layer, and a numerical value with a low-bit quantized offset of an activation function layer with an activation function being a ReLU, are respectively converted into fixed-point numerical values and stored, so that the low-bit quantized numerical value of the transmission activation value between layers in the current neural network is converted into the fixed-point numerical value, and network parameter fixed-point is realized.
It should be understood that, in the above steps, the retraining in step 506 may also be performed only after quantization of the convolutional layer input activation value, that is, after step 504 and before step 505, the upper limit β of the activation value is taken as a parameter that can be updated iteratively during training of the whole network until convergence, and thus, the quantization step s is also updated; and the neural network retraining can be carried out after the input activation value of the convolution layer is quantized and the input activation value of the non-convolution function layer is quantized.
The method and the device have the advantages that the input activation value of the convolution layer is subjected to low-ratio specific localization, the calculated amount and the storage space are compressed, in the process of fixed-point localization, the range of the ReLU activation function is changed from unsigned to signed, and the offset generated by changing unsigned to signed in the next functional layer is compensated back, so that the resource consumption of unsigned number can be effectively reduced in the transplanting and deploying processes; and retraining the neural network after quantizing the input activation value of the convolutional layer and/or quantizing the input activation value of the non-convolutional functional layer, thereby effectively reducing the error of the whole network and ensuring that the quantized network still has enough target performance.
EXAMPLE III
Referring to fig. 6, fig. 6 is a flowchart of a method for localizing a convolutional neural network according to a third embodiment of the present application. In this embodiment, the weights of the convolution kernels in the convolutional layer are quantized with the second low bit.
601, acquiring each first weight in each convolution kernel in each convolution layer based on the current convolution neural network, wherein the weight is floating point data; and accumulating the square number of each first weight of the convolution kernel corresponding to each output channel of each convolution layer to obtain an accumulation result of the square number of the first weight of the output channel. And searching a quantization threshold in a specified threshold range based on the accumulation result, wherein the quantization threshold is used as the current weight quantization threshold of the output channel.
In this step, for any output channel j of any convolutional layer i, calculating the square of each first weight in each convolutional kernel k corresponding to the output channel j to obtain the square number of each first weight in each convolutional kernel corresponding to the output channel j, and accumulating the square numbers of all the first weights in all the convolutional kernels of the output channel to obtain the accumulation result of the output channel j;
the square mathematical expression of all the first weights in all convolution kernels of the accumulation output channel j is as follows:
wherein L isijRepresenting the accumulated result, w, of the convolutional layer i output channel jfijknThe output channel j in convolutional layer i corresponds to the nth first weight in convolutional kernel k.
Accumulating a result L based on the convolution layer and the output channelijAnd searching a quantization threshold value in a specified threshold value range to be used as a current weight quantization threshold value of the convolutional layer i output channel j.
Through this step, the current weight quantization threshold of each output channel in each convolutional layer can be obtained.
Step 602, quantizing each first weight in the convolution kernel corresponding to each output channel based on the current weight quantization threshold of each output channel, to obtain a current quantization weight of each first weight in the convolution kernel corresponding to each output channel.
In the step, for any convolutional layer i, multiplying the accumulated result of the convolutional layer output channel j by the result of the current weight quantization threshold of the output channel, and taking the result as the current amplitude of the output channel; and comparing each first weight in the convolution kernel corresponding to the output channel with the current amplitude, assigning the first weight as the amplitude if the first weight is larger than the amplitude, assigning the first weight as a negative amplitude if the first weight is smaller than the negative number of the amplitude, and assigning the first weight with the middle value as 0, so that the current quantization weight of the output channel can be obtained.
The mathematical expression is:
λij=Lij×tij
wherein, tijQuantize the threshold, λ, for the current weight of the output channel j in convolutional layer iijIs the current amplitude of the output channel j in the convolutional layer i;
wherein the content of the first and second substances,is the current quantization weight, w, of the nth first weight of the convolution kernel k corresponding to the output channel j in the convolution layer ifijknThe output channel j in convolutional layer i corresponds to the nth first weight in convolutional kernel k.
Step 603, estimating an error between the current quantization weight of each first weight of the output channel and each first weight, if the error is greater than a set error threshold, indicating that the current quantization threshold of the output channel does not meet the requirement, searching the quantization threshold again to be used as the current weight quantization threshold of the output channel, and returning to execute step 601 until the error is less than the set error threshold, and using the quantization threshold corresponding to the minimum error as a weight quantization threshold coefficient.
In this step, the estimation error may be an average error or an accumulated error between the current quantized weight of the first weight of the output channel and the first weight, that is, an accumulated error or an average error between the current quantized weight of all the first weights of the output channel j and all the first weights, and is mathematically expressed as:
or
Through steps 601-603, the weight quantization threshold coefficient of each output channel in each convolution layer can be obtained.
Step 604, based on the weight quantization threshold coefficient and the accumulation result of each output channel, calculating a first amplitude value of each output channel,
in the step, for any convolution layer i, multiplying the accumulated result of the convolution layer output channel j by the result of the weight quantization threshold coefficient of the output channel to be used as a first amplitude value of the output channel;
the mathematical expression is:
αij=Lij×Tij
wherein, TijQuantizing the threshold coefficient, alpha, for the weight of the output channel j in convolutional layer iijIs a first amplitude of an output channel j in convolutional layer i;
step 605, based on the first amplitude of each output channel in each convolutional layer, using a second low bit to represent the first weight in the convolution kernel corresponding to each output channel in each convolutional layer, so as to obtain a second weight; that is, the second low bit quantization is performed for each first weight to fix the first weight.
In order to ensure the target performance of the convolutional neural network, a low bit quantization mode is selected according to the sensitivity of the change of the target performance of the convolutional neural network caused by low bit quantization. If the target performance is not reduced much, the convolutional layer is not sensitive; if the target performance is degraded more when the quantization is low, it indicates that the convolutional layer is very sensitive.
For the insensitive convolutional layer, a second low bit is used to express the first weight. For the nth first weight w in the convolution kernel k corresponding to the output channel j in the convolution layer ifijknI.e. the second weight wijknComprises the following steps:
wijkn=αijT1ij
wherein alpha isiIs a first amplitude of an output channel j in convolutional layer i; t1ijExpressing the low bit of the weight quantization threshold coefficient of the output channel j in the convolutional layer i, namely the weight quantization threshold coefficient after fixed point; w is aijknIs fixed point data.
For the sensitive convolutional layer, a combination of more than two second low-order bases is used to express the floating-point weight of the convolutional layer. For the nth first weight w in the convolution kernel k corresponding to the output channel j in the convolution layer ifijknI.e. the second weight wijknComprises the following steps:
wijkn=αi1T1ij+…+αimTmij
wherein alpha isimA first amplitude for an mth combined output channel of the J output channels in convolutional layer i; tm isijQuantizing a threshold coefficient for a combined weight of an output channel j in a convolutional layer i expressed by an mth combined low bit base in a combination of more than two low bit bases, namely, the quantized threshold coefficient of the combined weight after the combination of more than two low bit bases is fixed; m is the number of the second low bit-base combinations and is less than or equal to J; j is the total number of output channels in convolutional layer i; w is aijknIs fixed point data.
And 606, quantizing the weight quantization threshold coefficient corresponding to each output channel to obtain a quantized weight quantization threshold coefficient corresponding to each output channel, so as to obtain a coefficient quantization value.
In this step, for an insensitive convolutional layer, for any output channel j, comparing each second weight in the convolutional kernel corresponding to the output channel with the first amplitude, and if the second weight is greater than the first amplitude, the first coefficient quantization value of the weight quantization threshold coefficient of the output channel is the weight quantization threshold coefficient after fixed point processing; if the value is less than the negative number of the amplitude, the first coefficient quantization value of the weight quantization threshold coefficient of the output channel is a negative fixed-point weight quantization threshold coefficient; if the value is in the middle, the weight quantization threshold coefficient of the output channel is assigned to 0, so that the first coefficient quantization value of the weight quantization threshold coefficient corresponding to the output channel can be obtained.
The mathematical expression is:
wherein, wijknThe output channel j in convolutional layer i corresponds to the nth second weight in convolutional kernel k,the first coefficient quantization value of the threshold coefficient is quantized for the weight of the output channel j in convolutional layer i, i.e., the first coefficient quantization value of output channel j.
For a sensitive convolutional layer, for any combination output channel m in any convolutional layer, taking the result of multiplying the first amplitude of the combination output channel by the combination weight quantization threshold coefficient of the combination output channel as a second amplitude, comparing the second amplitude with a second weight, if the second weight is greater than the amplitude, taking the combination weight quantization threshold coefficient as a second coefficient quantization value, if the second weight is less than the negative number of the amplitude, taking the negative combination weight quantization threshold coefficient as a second coefficient quantization value, and if the second weight is in the middle, taking 0 as the second coefficient quantization value, thereby obtaining the second coefficient quantization value of the combination weight quantization threshold coefficient corresponding to the combination output channel.
The mathematical expression is:
wherein, wijknThe output channel j in convolutional layer i corresponds to the nth second weight in convolutional kernel k,the second coefficient quantization value of the threshold coefficient, i.e., the second coefficient quantization value of the combined output channel m, is quantized for the combined weight of the mth combined output channel of the J output channels of convolutional layer i.
For example, if the value range of the combined weight quantization threshold coefficient is three numbers, i.e., -1, 0, 1, the second coefficient quantization value for any combined weight quantization threshold coefficient m is:
step 607, calculating each weight of each output channel according to the coefficient quantization value of the weight quantization threshold coefficient of each output channel and the first amplitude of each output channel to obtain a first quantization weight, and substituting each first quantization weight of each output channel into the convolutional neural network for retraining.
In this step, for the insensitive convolutional layer, the first quantization weight of the nth first weight in the convolutional kernel k corresponding to the output channel j in the convolutional layer i is: the product of the first amplitude value of the output channel j in the convolutional layer i and the first coefficient quantization value; the mathematical formula is expressed as:
wherein the content of the first and second substances,the first quantization weight of the nth first weight in the convolution kernel k corresponding to the output channel j in the convolution layer i means that the first quantization weights of the first weights in the convolution kernels corresponding to the same output channel are the same for the insensitive convolution layer.
For a sensitive convolutional layer, a first quantization weight of an nth first weight in a convolutional kernel k corresponding to an output channel j in a convolutional layer i is: the sum of products of the first amplitude of each combined output channel and the second coefficient quantization value of the combined output channel; the mathematical formula is expressed as:
through the steps 604-607, the weight originally saved as 32 bits is quantized to bwA bit.
Step 608, based on the retrained convolutional neural network, normalizing the output result distribution to 0 mean value or 1 variance in a normalization layer for performing normalization operation on the output result of each output channel of the convolutional layer according to each channel, so that the output result distribution meets the normal distribution; for a first amplitude value of each output channel of the convolutional layer, multiplying each channel parameter of a normalization layer connected with the convolutional layer as a multiplier by the first amplitude value to update the first amplitude value.
In the classical neural network, normalization is performed after the convolutional layer, that is, the output channels of the convolutional layer normalize the output result data distribution to 0 mean or 1 variance according to each channel, so that the distribution satisfies the normal distribution.
For the insensitive convolution layer, multiplying the first amplitude of the output channel by the channel parameter of the channel corresponding to the normalization layer to obtain an updated first amplitude, wherein the updated first amplitude is expressed by a mathematical expression:
αij≡αij×σij
wherein σijThe above expression, for a channel parameter of a channel j in a normalization layer i connected to a convolution layer i, means that the parameter of the normalization layer channel j connected to the output of the convolution layer i is multiplied by a first amplitude value of the output channel j of the convolution layer i, and the first amplitude value is updated with the result of the multiplication.
For the sensitive convolution layer, multiplying the first amplitude of each combined output channel in the output channels by the channel parameter of the output channel corresponding to the normalization layer respectively to obtain an updated first amplitude of the combined output channel, wherein the updated first amplitude is expressed by a mathematical expression:
αim≡αim×σim
wherein σimRepresenting channel parameters corresponding to the combined output channel m in the convolution layer i in the normalization layer i; the above formula represents: a first amplitude value of each combined output channel in the output channels J of the convolutional layer i is connected with the corresponding combined output channel in the normalization layer iThe first amplitude of the combined output channel is updated with the result of the multiplication.
The normalization parameters performed by this step are fused so that the first amplitude is updated to remove the effect of the normalization operation on the network performance.
Step 609 performs fixed-point operation on the convolution kernel parameters in the convolution layer based on the updated first amplitude value and the coefficient quantization value of the weight quantization threshold coefficient,
in the step, low-bit quantized values transmitted among convolutional layers in the neural network are converted into fixed-point values, specifically, any low-bit quantized value obtained in the step comprises a first amplitude value obtained after fusion of normalization layer parameters of a sensitive convolutional layer output channel, a first coefficient quantized value, a first amplitude value obtained after fusion of normalization layer parameters of a selected combined output channel in an insensitive convolutional layer output channel, and a second coefficient quantized value, and the first amplitude value and the second coefficient quantized value are converted into fixed-point values respectively and stored, so that the low-bit quantized values transmitted among the convolutional layers in the current neural network are converted into the fixed-point values, and the fixed-point of the network convolutional layer parameters is realized.
It should be understood that, in this embodiment, as in step 07 in embodiment two, the input value and the output value of the ReLU activation function may also be changed from an unsigned number to a signed number, and an offset resulting from the change of the unsigned number to the signed number is compensated back in the next layer, so as to effectively reduce the resource consumption of the unsigned number when migrating and deploying.
The embodiment enables the convolution kernel weights in the final fixed-point neural network to be represented by low specific points, so that the convolution kernel weights can be conveniently transplanted to an embedded platform and an application-specific integrated circuit; the convolution kernel weight is fixed into low bits, and the calculated amount and the storage space are compressed; more calculated amount is distributed aiming at the convolution layer weight value which is quantized sensitively, and less calculated amount is distributed aiming at the layer which is not sensitive, so that the nonuniform quantization of the convolution kernel weight value is realized, the network quantization performance error can be effectively reduced, and the target performance of the neural network is improved; by adopting a staged quantization mode, layer-by-layer calculation or repeated calculation is not required, and the low quantization performance error of each functional layer can be ensured; and retraining the neural network after quantizing the weight of the convolutional layer, effectively reducing the error of the whole network and ensuring that the quantized network still has enough target performance.
Referring to fig. 7, fig. 7 is a schematic diagram of a neural network localization apparatus according to an embodiment of the present application. The device comprises a plurality of devices which are connected with each other,
a convolutional layer input activation value quantization module for performing a first low bit quantization on the convolutional layer input activation value in the neural network, and/or
A convolutional layer weight quantizing module for performing second low bit quantization on the convolutional core weight in the convolutional layer, and/or
A non-convolution functional layer input activation value quantization module for performing third low bit quantization on the input activation value of the non-convolution functional layer except the convolution layer in the neural network, an
The retraining module performs retraining based on the neural network after the input activation value of the convolutional layer is subjected to first low bit quantization, and/or after the convolutional kernel weight value in the convolutional layer is subjected to second low bit quantization, and/or after the input activation value of the non-convolutional function layer is subjected to third low bit quantization,
the fixed-point module is used for carrying out fixed-point processing on the basis of a low bit quantization result of input activation values of all layers in the neural network, and/or a low bit quantization result of convolution kernel weights in the convolution layer, and/or a result of carrying out third low bit quantization on input activation values of the non-convolution function layer;
and the loading module is used for loading the fixed-point neural network. For example, into hardware such as an FPGA, DSP, firmware, or embedded system.
Wherein the first low bit, the second low bit, and the third low bit are within 1 bit to 8 bits. Referring to fig. 8, fig. 8 is a schematic diagram illustrating application of the neural network for low specific point localization to vehicle-mounted camera-based target detection according to the embodiment of the present application. Shooting by a plurality of vehicle-mounted all-around cameras to obtain an image of the current road condition; in the automatic parking terminal equipment realized based on the FPGA chip, the neural network is calculated for the current road condition image based on the low-bit fixed-point neural network so as to identify the relevant detection targets of pedestrians, vehicles and the like in the current road condition; controlling the vehicle based on the current target detection result includes, but is not limited to, determining whether the vehicle is accelerating or decelerating or turning the steering wheel. The hardware memory consumption of the application example in the application scene is greatly reduced, and the hardware bandwidth can reach the speed-up ratio of more than multiple times.
Referring to fig. 9, fig. 9 is a schematic diagram illustrating application of a neural network with low specific point in the embodiment of the present application to entrance guard camera image recognition. The method comprises the steps that an entrance guard camera shoots a target to be recognized to obtain a target image to be recognized, on a recognition terminal device realized based on a DSP chip, neural network calculation is carried out on a current image based on a low bit localization neural network, recognition features of the current image are obtained according to network calculation results, comparison is carried out on the recognition features and existing features in a library, and whether a face recognition result is in the library or not is judged according to comparison results. According to the application scenario, the hardware memory consumption is reduced, and the hardware bandwidth can reach the acceleration ratio of more than multiple times.
It should be understood that the fixed-point neural network described in the present application is not limited to the above-mentioned applications, but may also be applied to, including and not limited to, image-based target pedestrian detection, posture detection, road condition detection, video analysis, etc., and by loading the fixed-point neural network into hardware for implementing the applications, data required by various applications are processed through the neural network, so as to adapt to miniaturization of devices and reduce consumption of hardware resources. For example, a low bit-spotting neural network is loaded into at least one of the following hardware:
camera, on-vehicle intelligent system, robot of sweeping the floor, cargo carrying robot, unmanned aerial vehicle, unmanned vending machine etc..
The embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and the computer program is executed by a processor to perform the steps of the above embodiments.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
For the device/network side device/storage medium embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for the relevant points, refer to the partial description of the method embodiment.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the phrase "comprising a. -. said" to define an element does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:计算装置、方法、板卡和计算机可读存储介质