Fixed point quantization method and device of deep learning model
1. A fixed point quantization method of a deep learning model is characterized in that calibration data are input into a target model, model parameters and activation values of the target model are sequentially used as quantization objects, and the following steps are executed:
inputting calibration set data, extracting the quantization objects of the target model according to layers, obtaining a distribution histogram of the quantization objects, scaling the distribution histogram of the quantization objects through a self-adaptive KL divergence equation, obtaining KL divergence values corresponding to different decimal point positions based on a preset quantization digit, and comparing to obtain a first quantization result of the quantization objects.
2. The fixed-point quantization method of the deep learning model according to claim 1, further comprising the steps of:
determining a measurement function according to the task type of the target model, and obtaining the measurement loss degree of the target model through the measurement function;
and selecting at least one of the model parameter or the activation value as an optimization object according to the metric loss degree, and optimizing the first quantization result of the optimization object based on the metric function.
3. The method of claim 2, wherein the optimizing the first quantization result of the optimized object based on the metric function comprises:
adjusting the position of a decimal point corresponding to the first quantization result of the optimized object according to a positive direction and a negative direction respectively, predicting through the calibration set data to obtain a measurement floating value, determining an adjustment direction based on a measurement floating threshold value, and obtaining an optimal solution of the measurement function according to the adjustment direction and the quantization digit to obtain a second quantization result of the optimized object.
4. The method of claim 3, wherein determining an adjustment direction based on the metric float threshold comprises:
if the metric float values corresponding to the positive and negative adjustment directions are both greater than the metric float threshold, the adjustment directions are configured to: the direction to which the larger of the metric float values corresponds;
otherwise, the adjustment direction is configured to: a direction to which the metric float value that is greater than the metric float threshold corresponds.
5. The method according to claim 3, wherein the obtaining an optimal solution of the metric function according to the adjustment direction and the quantization bits comprises:
if the adjustment direction is a positive direction, searching from the decimal point position corresponding to the first quantization result of the optimized object to the maximum decimal point position corresponding to the quantization digit in sequence, respectively obtaining the value of the measurement function through the calibration data set, and comparing to obtain the optimal solution of the measurement function;
if the adjustment direction is a negative direction, searching sequentially from the decimal point position corresponding to the first quantization result of the optimized object to the smallest decimal point position corresponding to the quantization digit, respectively obtaining the values of the measurement function through the calibration data set, and comparing to obtain the optimal solution of the measurement function.
6. The method of claim 2, wherein determining a metric function according to the task type of the target model comprises:
if the task type of the target model is a classification task, the metric function is configured to: the first result is the weighted sum of the probability of the correct result and the probability of the correct result existing in the previous results;
if the task type of the target model is a segmentation task, the measurement function is configured to be MIoU;
if the task type of the target model is a detection task, the measurement function is configured to be mAP;
if the task type of the target model is an identification task, the metric function is configured as a weighted sum of an identification precision and an edit distance.
7. The fixed-point quantization method of the deep learning model according to claim 1, wherein the adaptive KL divergence equation specifically is:
wherein y represents a distribution scaling equation of the adaptive KL divergence, exp () represents an exponential function with a natural constant e as the base, α and β are parameters of the adaptive KL divergence equation, the mathematical meaning is the parameters of the adaptive KL divergence equation, and x represents an input parameter, i.e., a value in a model parameter distribution histogram or an activation parameter distribution histogram.
8. An apparatus for fixed point quantization of a deep learning model, the apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method steps of any one of claims 1 to 7 when executing the computer program.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 7.
Background
In order to improve network accuracy, a common deep learning model usually has a large number of parameters and layers, so that the storage size of the model is increased sharply, and the reasoning speed is slow. The slow reasoning enables a plurality of high-precision deep learning networks to only run on a GPU system with higher computing power all the time, and the landing application is difficult to realize.
With the gradual temperature rise of deep learning in recent years, the deployment demand thereof is increasingly increased, and the end side tends to use a low-precision computing unit for balancing cost and speed. Compared with a full-precision or mixed-precision model containing half-precision, it is possible to reason faster and occupy less memory despite low precision, and therefore how to shift between full-precision and low precision is a subject for extensive research. The traditional deep learning full-precision model generally needs to be quantized to a data type with lower precision to meet the ground-based index requirement, such as detection speed.
Due to the urgent need of landing, how to quickly perform fixed-point quantization on the deep learning model becomes a difficult problem.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a fixed point quantization method of a deep learning model, which can improve the fixed point quantization speed and precision of the deep learning model.
The invention also provides a device for executing the fixed point quantization method of the deep learning model.
The invention also provides a computer readable storage medium with the fixed point quantization method of the deep learning model.
According to the fixed-point quantization method of the deep learning model of the first aspect of the present invention, calibration data is input to a target model, model parameters and activation values of the target model are sequentially used as quantization objects, and the following steps are performed: inputting calibration set data, extracting the quantization objects of the target model according to layers, obtaining a distribution histogram of the quantization objects, scaling the distribution histogram of the quantization objects through a self-adaptive KL divergence equation, obtaining KL divergence values corresponding to different decimal point positions based on a preset quantization digit, and comparing to obtain a first quantization result of the quantization objects.
The fixed point quantization method of the deep learning model according to the embodiment of the invention at least has the following beneficial effects: the method overcomes the defect that the KL divergence algorithm only focuses on probability in the quantization process through the self-adaptive KL divergence equation, can greatly improve the quantization speed under the condition of ensuring certain accuracy of a quantized model, improves the quantization efficiency, and saves time.
According to some embodiments of the invention, further comprising the steps of: determining a measurement function according to the task type of the target model, and obtaining the measurement loss degree of the target model through the measurement function; and selecting at least one of the model parameter or the activation value as an optimization object according to the metric loss degree, and optimizing the first quantization result of the optimization object based on the metric function.
According to some embodiments of the invention, the method of optimizing the first quantization result of the optimized object based on the metric function comprises: adjusting the position of a decimal point corresponding to the first quantization result of the optimized object according to a positive direction and a negative direction respectively, predicting through the calibration set data to obtain a measurement floating value, determining an adjustment direction based on a measurement floating threshold value, and obtaining an optimal solution of the measurement function according to the adjustment direction and the quantization digit to obtain a second quantization result of the optimized object.
According to some embodiments of the invention, determining an adjustment direction based on the metric float threshold comprises: if the metric float values corresponding to the positive and negative adjustment directions are both greater than the metric float threshold, the adjustment directions are configured to: the direction to which the larger of the metric float values corresponds; otherwise, the adjustment direction is configured to: a direction to which the metric float value that is greater than the metric float threshold corresponds.
According to some embodiments of the present invention, the method for obtaining an optimal solution of the metric function according to the adjustment direction and the quantization bits comprises: if the adjustment direction is a positive direction, searching from the decimal point position corresponding to the first quantization result of the optimized object to the maximum decimal point position corresponding to the quantization digit in sequence, respectively obtaining the value of the measurement function through the calibration data set, and comparing to obtain the optimal solution of the measurement function; if the adjustment direction is a negative direction, searching sequentially from the decimal point position corresponding to the first quantization result of the optimized object to the smallest decimal point position corresponding to the quantization digit, respectively obtaining the values of the measurement function through the calibration data set, and comparing to obtain the optimal solution of the measurement function.
According to some embodiments of the invention, determining a metric function according to a task type of the object model comprises: if the task type of the target model is a classification task, the metric function is configured to: the first result is the weighted sum of the probability of the correct result and the probability of the correct result existing in the previous results; if the task type of the target model is a segmentation task, the measurement function is configured to be MIoU; if the task type of the target model is a detection task, the measurement function is configured to be mAP; if the task type of the target model is an identification task, the metric function is configured as a weighted sum of an identification precision and an edit distance.
According to some embodiments of the invention, the adaptive KL divergence equation is specifically:
wherein y represents a distribution scaling equation of the adaptive KL divergence, exp () represents an exponential function with a natural constant e as the base, α and β are parameters of the adaptive KL divergence equation, the mathematical meaning is the parameters of the adaptive KL divergence equation, and x represents an input parameter, i.e., a value in a model parameter distribution histogram or an activation parameter distribution histogram.
The apparatus for performing fixed-point quantization on a deep learning model according to an embodiment of the second aspect of the present invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method steps of the embodiment of the first aspect of the present invention when executing the computer program.
The fixed point quantization device of the deep learning model according to the embodiment of the invention has at least the following beneficial effects: the method overcomes the defect that the KL divergence algorithm only focuses on probability in the quantization process through the self-adaptive KL divergence equation, can greatly improve the quantization speed under the condition of ensuring certain accuracy of a quantized model, improves the quantization efficiency, and saves time.
A computer-readable storage medium according to an embodiment of the third aspect of the invention has stored thereon a computer program which, when executed by a processor, implements a method according to an embodiment of the first aspect of the invention.
The computer-readable storage medium according to an embodiment of the present invention has at least the same advantageous effects as the method according to an embodiment of the first aspect of the present invention.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;
FIG. 2 is one of the schematic diagrams of the adaptive KL divergence function curve according to the embodiment of the present invention;
FIG. 3 is one of the schematic diagrams of the adaptive KL divergence function curve according to the embodiment of the present invention;
FIG. 4 is a schematic diagram of a model parameter distribution obtained by statistics in an embodiment of the present invention;
FIG. 5 is a quantization result of fixed point quantization of the same license plate recognition model using different methods;
FIG. 6 is a schematic block diagram of an apparatus of an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and more than, less than, more than, etc. are understood as excluding the present number, and more than, less than, etc. are understood as including the present number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated. In the description of the present invention, the step numbers are merely used for convenience of description or for convenience of reference, and the sequence numbers of the steps do not mean the execution sequence, and the execution sequence of the steps should be determined by the functions and the inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present invention.
The noun explains:
the edit distance, also called Levenshtein distance (Levenshtein), is a quantitative measure of the degree of difference between two strings (e.g., english text) by how many times a string is changed into another string.
Referring to fig. 1, a method of an embodiment of the invention includes: inputting calibration data to a target model, and sequentially quantizing model parameters and activation values of the target model (namely the deep learning model to be quantized). And inputting calibration set data for model parameters, extracting the model parameters of the target model according to layers, obtaining a model parameter distribution histogram, scaling the model parameter distribution histogram through a self-adaptive KL divergence equation, obtaining KL divergence values corresponding to different decimal point positions based on a preset quantization digit, and comparing to obtain a first quantization result of the model parameters. And inputting calibration set data for the activation value, extracting the activation value of the target model according to layers, acquiring an activation value distribution histogram, zooming the activation value distribution histogram through a self-adaptive KL divergence equation, acquiring KL divergence values corresponding to different decimal point positions based on a preset quantization digit, and comparing to obtain a first quantization result of the activation value.
The fixed-point quantization method according to the embodiment of the present invention will be described in detail below with a license plate recognition model based on deep learning as an object. In this embodiment, PyTorch is used as the training frame of the pre-training model for license plate recognition, and Python is used as the implementation language. The pre-training model for license plate recognition comprises 20 convolution modules and two full-connection layer branches so as to obtain a license plate recognition text. A first number of license plate data sets is obtained, the license plate data sets including various types of license plates, for example: license plates in different regions, military, civil and the like.
The specific fixed point quantization step comprises:
(1) firstly, a license plate recognition model to be quantized is obtained, and a first number (such as 1000 pieces) of license plate data sets with representative features are used as a calibration set.
(2) The preset quantization bit number is 8, that is, the original data is compressed into 8-bit binary number, wherein the first bit is a sign bit. The quantization result is an integer ranging from 0 to 7, and represents the position of the decimal point of the binary digit; the quantization process can also be viewed as a process of finding an optimal decimal point position. And the measurement floating threshold value is set to be 0.001 and used for determining whether to search or not, and the model to be quantized is preprocessed by operator integration configuration, namely all BatchNorm layers are folded to the convolution layer. The quantization step is divided into model parameter quantization and model activation value quantization, which correspond to two different groups of optimal quantization results respectively.
(3) The metric function is determined from the task of quantifying the object (i.e., the target model). In this embodiment, the quantization object is a license plate recognition model, recognition accuracy is often used as a quantization target in the conventional quantization, but in practical application, an edit distance can also be used as an index with a smaller granularity to measure model accuracy and improve recognition accuracy. In this embodiment, the following formula is adopted as an optimization target, i.e., a metric function:
where Ed represents edit distance and Acc represents precision.
And the formula z-Acc is used as a comparison with the precision as a quantization index only. Where Ed represents edit distance and Acc represents precision.
(4) For the license plate recognition model, extracting model parameters of each layer according to layers, and zooming the distribution of the model parameters of each layer through a self-adaptive KL divergence equation;
then based on a preset quantization digit, respectively carrying out scaling treatment on the model parameter distribution maps of all layers through a self-adaptive KL divergence equation, namely combining histogram bins; in this embodiment, the preset quantization bit number is 8, the corresponding quantization result is 0 to 7, that is, 8 decimal point positions are corresponding, that is, 8 decimal point positions of 0 to 7 are used, and the model parameter distribution histograms of the respective layers are respectively processed;
and finally, calculating the KL divergence under the 8 conditions respectively, and comparing the KL divergence, wherein the position of the decimal point corresponding to the minimum KL divergence is the first quantization result of the model parameter of the current layer.
In this embodiment, the equation of the adaptive KL divergence is:
wherein y represents the distribution scaling equation of the self-adaptive KL divergence, exp () represents an exponential function with the natural constant e as the base, alpha and beta are hyper-parameters, the mathematical meaning is the parameters of the self-adaptive KL divergence equation, and x represents the values in the original distribution diagram. In this embodiment, x corresponds to a value in the model parameter distribution histogram, or alternatively, activates a value in the parameter distribution histogram.
Referring to fig. 2 and fig. 3, it can be known that α represents how sharp the curve of the adaptive KL divergence equation is, β represents how large the processing range of the scaling function is, and the specific determination formula of the parameter α is:
wherein bit _ width is the quantization bit number;to be distributed inThe proportion of the distribution quantity to the total distribution is used for measuring the sharpness of the distribution. Beta is set to be-beta when Sum (P (x) ≧ 0.5<x<β, i.e., the number of distributions of this interval exceeds half of the total number, is determined by the adaptive KL divergence scaling function.
In this embodiment, a new distribution may be obtained by multiplying y by the corresponding bin value in the original distribution histogram, and a KL divergence value may be obtained according to the new distribution. In general, a statistical model parameter distribution histogram often has a shape similar to a normal distribution, and refer to fig. 4. Referring to fig. 2 and 3, it can be seen that the adaptive KL divergence equation graph is just opposite to the distribution histogram, and therefore, in this way, the weight of a large value in the height distribution can be increased, and the weight occupied by a value near 0 in the KL divergence calculation is suppressed, so as to achieve the purpose of improving the defect that only probability is concerned in the KL divergence calculation.
Since the license plate recognition model can obtain a model with good recognition accuracy only by quantizing the adaptive KL divergence, and the result refers to fig. 3, in this embodiment, the model parameters are not optimized for local search, but if the accuracy drops too much, the local search may be performed by using the same scheme as the activation value search (i.e., the same process as in step (6) below).
Step (5) is a process of performing a first quantization on the activation value in this embodiment; step (6) is a process of performing a second quantization on the activation value based on step (5), that is, a process of local search optimization in this embodiment. Obviously, if the metric loss degree calculated according to the metric function after the quantization in step (5) meets the requirement, step (6) may not be executed any more, that is, the first quantization result is the final quantization result of the activation value.
(5) And (4) carrying out forward propagation on the target model by using the calibration set data, namely predicting the calibration set data by using the target model, collecting the activation values of all layers according to layers to obtain an activation value distribution histogram, and executing the method of the self-adaptive KL divergence scaling function in the step (4) on the activation values to obtain a first quantization result of the activation values.
Namely: inputting calibration set data, extracting an activation value of a target model according to layers, obtaining an activation value distribution histogram, zooming the activation value distribution histogram through a self-adaptive KL divergence equation, obtaining KL divergence values corresponding to different decimal point positions based on a preset quantization digit, and comparing, wherein the decimal point position corresponding to the minimum KL divergence is a first quantization result of the activation value of the current layer.
(6) And (5) obtaining a first quantization result of each layer of activation value, namely the position of a corresponding decimal point of each layer of activation value. And adjusting the first quantization result of the activation value layer by layer according to the position of the decimal point corresponding to the first quantization result of the activation value in positive and negative directions (plus or minus directions). I.e. if the first quantization result of the activation value of the first layer is 4, then on the basis of 4, the adjustment is made to 2 directions, i.e. to 3 and 5, respectively.
And after adjustment, the calibration set data is reused to carry out forward propagation on the model, the difference of the measurement function values before and after adjustment is compared, if the measurement floating value is larger than a preset measurement floating threshold value, searching is carried out in the corresponding adjustment direction (namely, in one direction of addition or subtraction), all possible decimal positions in the corresponding direction are tried, the optimal solution of the measurement function is obtained, and the quantization result with the optimal activation value of each layer is determined. For example, if the first quantization result of the activation value of the first layer is 4, and after adjustment to 3 and 5, respectively, it is calculated that the metric fluctuation in the direction of 5 adjustment is larger than a preset metric fluctuation threshold, the adjustment of the decimal positions 6 and 7 is continued. And finally, taking out a KL divergence minimum value from KL divergence values corresponding to all traversed decimal point positions, wherein the decimal point position corresponding to the KL divergence minimum value is a second quantization result (namely a final quantization result) of the activation value of the corresponding layer.
In detail, in the classification task, the metric function is set to: 0.7 Top1+0.3 Top5, the weight of which can vary from a particular data set, can be determined empirically, where TopK is an indicator of the probability that the correct outcome occurred in the first k classifications, Top1 is the first 1 correct outcome, and Top5 is the first 5 correct outcomes. In the segmentation task, the metric function is set to MIoU (Mean Intersection over Union), and in the detection task to mapp (Mean Average Precision, Average AP value).
7) And obtaining a quantization result after the quantization is finished, wherein the quantization result comprises the quantization result of each layer of model parameters and the quantization result of each layer of activation values.
In the embodiment of the invention, the defect that the KL divergence algorithm only focuses on the probability is overcome by weighting the KL divergence equation through the self-adaptive KL divergence, and the improved self-adaptive KL divergence algorithm inhibits the weight occupied by the value near 0 in the KL divergence calculation by improving the weight of a large value in the original distribution so as to achieve the purpose of improving the defect that the KL divergence calculation only focuses on the probability and further provide a better first quantization result; and based on the measurement function of the task, different measurement functions can be set for different tasks, which is beneficial to paying attention to the accuracy of the quantization parameter in different degrees aiming at different tasks, and whether to carry out search operation locally is determined based on the accuracy, so as to determine a second quantization result. The embodiment of the invention improves the fixed point quantization speed, and is basically consistent with the traditional fixed point quantization method of global search in precision.
Fig. 5 shows the quantization results of performing fixed-point quantization on the same license plate recognition model by using different methods. The first action adopts the traditional global search and optimizes the precision, and finally, under the condition that the quantization precision reaches 95.88%, the quantization time is up to 7 days. The second behavior adopts the method of the embodiment of the invention to obtain the optimal fixed point quantization result through self-adapting KL divergence and optimize the precision, namely, the z-Acc formula is used as an optimization target (measurement function), and the time is 4 hours, so that the quantization precision is 95.34%. The third action is to adopt the method of the embodiment of the invention to obtain the optimal fixed point quantization result through self-adaptive KL divergence and optimize the editing distance, namely to obtain the optimal fixed point quantization result through self-adaptive KL divergenceThe formula is used as an optimization target (measurement function), and takes 4 hours, so that the quantization precision is 95.88%. Obviously, the embodiment of the invention enables fixed point quantization speedThe degree is greatly improved, and meanwhile, the precision is basically consistent with that of a traditional fixed point quantization method of global search.
In an embodiment of the present invention, an apparatus for deep learning a model is further provided, and referring to fig. 6, the apparatus includes a memory 100, a processor 200, and a computer program stored in the memory and executable on the processor, and when the processor 200 executes the computer program, the above method steps are implemented, that is: inputting calibration data to a target model, sequentially taking model parameters and activation values of the target model as quantization objects, and executing the following steps: inputting calibration set data, extracting quantization objects of a target model according to layers, obtaining a distribution histogram of the quantization objects, scaling the distribution histogram of the quantization objects through a self-adaptive KL divergence equation, obtaining KL divergence values corresponding to different decimal point positions based on a preset quantization digit, and comparing to obtain a first quantization result of the quantization objects.
Although specific embodiments have been described herein, those of ordinary skill in the art will recognize that many other modifications or alternative embodiments are equally within the scope of this disclosure. For example, any of the functions and/or processing capabilities described in connection with a particular device or component may be performed by any other device or component. In addition, while various illustrative implementations and architectures have been described in accordance with embodiments of the present disclosure, those of ordinary skill in the art will recognize that many other modifications of the illustrative implementations and architectures described herein are also within the scope of the present disclosure.
Certain aspects of the present disclosure are described above with reference to block diagrams and flowchart illustrations of systems, methods, systems, and/or computer program products according to example embodiments. It will be understood that one or more blocks of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, respectively, can be implemented by executing computer-executable program instructions. Also, according to some embodiments, some blocks of the block diagrams and flow diagrams may not necessarily be performed in the order shown, or may not necessarily be performed in their entirety. In addition, additional components and/or operations beyond those shown in the block diagrams and flow diagrams may be present in certain embodiments.
Accordingly, blocks of the block diagrams and flowchart illustrations support combinations of means for performing the specified functions, combinations of elements or steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, can be implemented by special purpose hardware-based computer systems that perform the specified functions, elements or steps, or combinations of special purpose hardware and computer instructions.
Program modules, applications, etc. described herein may include one or more software components, including, for example, software objects, methods, data structures, etc. Each such software component may include computer-executable instructions that, in response to execution, cause at least a portion of the functionality described herein (e.g., one or more operations of the illustrative methods described herein) to be performed.
The software components may be encoded in any of a variety of programming languages. An illustrative programming language may be a low-level programming language, such as assembly language associated with a particular hardware architecture and/or operating system platform. Software components that include assembly language instructions may need to be converted by an assembler program into executable machine code prior to execution by a hardware architecture and/or platform. Another exemplary programming language may be a higher level programming language, which may be portable across a variety of architectures. Software components that include higher level programming languages may need to be converted to an intermediate representation by an interpreter or compiler before execution. Other examples of programming languages include, but are not limited to, a macro language, a shell or command language, a job control language, a scripting language, a database query or search language, or a report writing language. In one or more exemplary embodiments, a software component containing instructions of one of the above programming language examples may be executed directly by an operating system or other software component without first being converted to another form.
The software components may be stored as files or other data storage constructs. Software components of similar types or related functionality may be stored together, such as in a particular directory, folder, or library. Software components may be static (e.g., preset or fixed) or dynamic (e.g., created or modified at execution time).
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.