Model training method, related system and storage medium

文档序号:7822 发布日期:2021-09-17 浏览:28次 中文

1. A method of model training, comprising:

according to M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' of the above, the number of training processes T of the i-th training stage, the hyper-parameter of each model in the i-1 th training stageiDetermining a reference hyperparameter in the ith training phase, wherein M, i, TiAre integers not less than 2;

according to M models Pi-1And the reference hyperparameters in the ith training stage determine M models Pi-1' Each model is inA hyper-parameter in the ith training phase; wherein the M models Pi-1For the M models Pi-2' the M models P obtained in the last training procedure in the i-1 th training stagei-1Is based on said M models Pi-1Is obtained by processing;

in the i-th training phase, according to the M models Pi-1' in the i-th training phase the hyper-parameters of each model carry out T on said each modeliThe secondary training process obtains M models PiAnd obtaining the M models Pi-1' obtaining a model performance score for each training process of each model in the ith training stage;

when the ith training stage is the last training stage, based on the M models PiFrom the M models PiThe target model is determined.

2. Method according to claim 1, characterized in that said method is based on M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' of the above, the number of training processes T of the i-th training stage, the hyper-parameter of each model in the i-1 th training stageiDetermining a reference hyperparameter in the ith training phase, comprising:

according to M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage;

according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, at least one model obtains model performance scores in an i-1 training stage through at least one training process, and the model performance estimation function is processed to obtain reference hyper-parameters in the i training stage.

3. The method according to claim 2, wherein when i is not less than 3, the M models P are selectedi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage, comprising:

according to M models Pi-3' in the above, the M models P and model performance scores obtained from the last training process of each model in the i-2 training stagesi-3' the hyper-parametric, M models P of each model in the i-2 training phasei-2' in the training stage of i-1, each model obtains the model performance score in each training process and the M models Pi-2' obtaining a model performance estimation function by using the hyper-parameters of each model in the i-1 training stage, wherein the M models Pi-2' is based on the M models Pi-3' obtained by carrying out the treatment.

4. Method according to claim 2, characterized in that said method is based on M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage, comprising:

according to M models P0The model performance scores of each model obtained in each training process in the first i-1 training stages and the M models P0The hyper-parameters of each model in each training stage in the first i-1 training stages are used for obtaining a model performance estimation function, and the M models P0Is the initial model.

5. The method according to any one of claims 2 to 4, further comprising:

processing the model performance estimation function according to the N length ranges to obtain N processed model performance estimation functions, wherein N is an integer not less than 2;

the training process times T according to the ith training stageiAnd the M models Pi-2' processing the model performance estimation function to obtain a reference hyper-parameter in the ith training stage according to a model performance score obtained by at least one training process of at least one model in the ith-1 training stage, wherein the model performance score comprises:

according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, model performance scores obtained by at least one training process of at least one model in the (i-1) th training stage are respectively processed on the N processed model performance estimation functions to obtain N initial hyper-parameters;

and processing the N initial hyper-parameters to obtain the reference hyper-parameter in the ith training stage.

6. Method according to any one of claims 1 to 5, characterized in that said method is based on M models Pi-1And the reference hyperparameters in the ith training stage determine M models Pi-1' the hyper-parameters of each model in the i-th training phase, comprising:

according to the M models Pi-1Obtaining K first models and M-K second models by the model performance score, wherein the K first models are the M models Pi-1The model performance score of the middle model is smaller than a first preset threshold, the M-K second models are models of which the model performance score is not smaller than the first preset threshold, K is an integer not smaller than 1, and K is smaller than M;

updating the parameters of each model in the K first models according to the parameters of the models with the model performance scores larger than a second preset threshold value to obtain K updated first models, wherein the second preset threshold value is not smaller than the first preset threshold value;

determining the reference hyperparameter in the ith training phase as the K updatesThe hyper-parameters of each model in the subsequent first model in the ith training phase; wherein the M models Pi-1' includes the K updated first models and the M-K second models, and the hyper-parameters of each model in the ith training phase are the same as the hyper-parameters of the model in the ith-1 training phase.

7. The method according to any one of claims 1 to 6, further comprising:

according to the training process times T of each training stage in the first i-1 training stagesjAcquiring the data volume of the first i-1 training stages, wherein j is a positive integer;

and confirming that the data volume of the first i-1 training stages does not exceed a preset value.

8. The method of any of claims 1 to 7, wherein the object model is applied to an image processing system, or a recommendation system.

9. The method of any one of claims 1 to 8, wherein the hyper-parameters comprise at least one of:

learning rate, batch size, discard rate, weight decay coefficient, momentum coefficient.

10. A model training apparatus, comprising:

a first determination module for determining the M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' of the above, the number of training processes T of the i-th training stage, the hyper-parameter of each model in the i-1 th training stageiDetermining a reference hyperparameter in the ith training phase, wherein M, i, TiAre integers not less than 2;

a second determination module for determining the M models Pi-1And the model performance score ofDetermining M models P by reference hyperparameters in i training phasesi-1' hyperparameters of each model in the i-th training phase; wherein the M models Pi-1For the M models Pi-2' the M models P obtained in the last training procedure in the i-1 th training stagei-1Is based on said M models Pi-1Is obtained by processing;

a model training module for performing the i-th training phase according to the M models Pi-1' in the i-th training phase the hyper-parameters of each model carry out T on said each modeliThe secondary training process obtains M models PiAnd obtaining the M models Pi-1' obtaining a model performance score for each training process of each model in the ith training stage;

a model determination module for determining the model P based on the M models when the ith training stage is the last training stageiFrom the M models PiThe target model is determined.

11. The apparatus of claim 10, wherein the first determining module is configured to:

according to M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage;

according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, at least one model obtains model performance scores in an i-1 training stage through at least one training process, and the model performance estimation function is processed to obtain reference hyper-parameters in the i training stage.

12. The apparatus of claim 11, wherein when i is not less than 3, the first determining module is further configured to:

according to M models Pi-3' in the above, the M models P and model performance scores obtained from the last training process of each model in the i-2 training stagesi-3' the hyper-parametric, M models P of each model in the i-2 training phasei-2' in the training stage of i-1, each model obtains the model performance score in each training process and the M models Pi-2' obtaining a model performance estimation function by using the hyper-parameters of each model in the i-1 training stage, wherein the M models Pi-2' is based on the M models Pi-3' obtained by carrying out the treatment.

13. The apparatus of claim 11, wherein the first determining module is further configured to:

according to M models P0The model performance scores of each model obtained in each training process in the first i-1 training stages and the M models P0The hyper-parameters of each model in each training stage in the first i-1 training stages are used for obtaining a model performance estimation function, and the M models P0Is the initial model.

14. The apparatus according to any one of claims 11 to 13, further comprising a processing module configured to:

processing the model performance estimation function according to the N length ranges to obtain N processed model performance estimation functions, wherein N is an integer not less than 2;

the first determining module is further configured to:

according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, model performance scores obtained by at least one training process of at least one model in the (i-1) th training stage are respectively processed on the N processed model performance estimation functions to obtain N initial hyper-parameters;

and processing the N initial hyper-parameters to obtain the reference hyper-parameter in the ith training stage.

15. The apparatus of any one of claims 10 to 14, wherein the second determining module is configured to:

according to the M models Pi-1Obtaining K first models and M-K second models by the model performance score, wherein the K first models are the M models Pi-1The model performance score of the middle model is smaller than a first preset threshold, the M-K second models are models of which the model performance score is not smaller than the first preset threshold, K is an integer not smaller than 1, and K is smaller than M;

updating the parameters of each model in the K first models according to the parameters of the models with the model performance scores larger than a second preset threshold value to obtain K updated first models, wherein the second preset threshold value is not smaller than the first preset threshold value;

determining a reference hyperparameter in the ith training phase as a hyperparameter in the ith training phase for each model in the K updated first models; wherein the M models Pi-1' includes the K updated first models and the M-K second models, and the hyper-parameters of each model in the ith training phase are the same as the hyper-parameters of the model in the ith-1 training phase.

16. The apparatus according to any one of claims 10 to 15, further comprising a confirmation module configured to:

according to the training process times T of each training stage in the first i-1 training stagesjAcquiring the data volume of the first i-1 training stages;

and confirming that the data volume of the first i-1 training stages does not exceed a preset value.

17. The apparatus of any one of claims 10 to 16, wherein the object model is applied to an image processing system, or a recommendation system.

18. The apparatus of any one of claims 10 to 17, wherein the hyper-parameters comprise at least one of:

learning rate, batch size, discard rate, weight decay coefficient, momentum coefficient.

19. A model training apparatus comprising a processor and a memory; wherein the memory is configured to store program code and the processor is configured to invoke the program code to perform the method of any of claims 1 to 9.

20. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method of any one of claims 1 to 9.

21. A computer program product, characterized in that it causes a computer to carry out the method according to any one of claims 1 to 9 when the computer program product is run on the computer.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

Neural networks have been successfully applied in a number of fields, including computer vision, machine translation, speech recognition, and so forth. Successful training of a neural network often requires adjustment of appropriate hyper-parameters. The hyper-parameter is a parameter set before training of the deep neural network is started, is not a network weight parameter of the neural network, and is a parameter used in a control training process. The hyper-parameters are not directly involved in the training process, they are just configuration variables.

The proper hyper-parameters have certain influence on the performance of the trained neural network. Therefore, how to automate the hyper-parameter selection process is also a commercially valuable technology.

Currently, the Population-Based slot machine Bandit algorithm (PB 2) splits the training process into several training phases. Each training phase comprises a plurality of training procedures. For each training procedure, the model traverses the sample set. When selecting the hyper-parameters, the PB2 selects the hyper-parameters according to the model performance estimation obtained from the first training process in the next training phase. However, since it can only predict the model performance obtained by the first training procedure in the next training stage, the prediction target is too limited, and it cannot predict the model performance obtained by the last training procedure in the next training stage of the neural network. Because the model performance obtained by the last training process of each training stage is concerned more when the proper hyper-parameter is selected, the method has little guiding significance for obtaining the neural network with better performance.

Disclosure of Invention

The application discloses a model training method, a related system and a storage medium, which can be helpful for improving the model calculation efficiency.

In a first aspect, an embodiment of the present application provides a model training method, including: according to M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' of the above, the number of training processes T of the i-th training stage, the hyper-parameter of each model in the i-1 th training stageiDetermining a reference hyperparameter in the ith training phase, wherein M, i, TiAre integers not less than 2; according to M models Pi-1And the reference hyperparameters in the ith training stage determine M models Pi-1' hyperparameters of each model in the i-th training phase; wherein the M models Pi-1For the M models Pi-2' the M models P obtained in the last training procedure in the i-1 th training stagei-1Is based on said M models Pi-1Processed to obtain the M models Pi-1' with the M models Pi-1One-to-one correspondence of the M models Pi-1With said M models Pi-2' one-to-one correspondence; in the i-th training phase, according to the M models Pi-1' the hyper-parameters of each model in T-running said each modeliThe secondary training process obtains M models PiAnd obtaining the M models Pi-1' obtaining a model performance score for each training process of each model in the ith training stage; when the ith training stage is the last trainingDuring training phase, based on the M models PiFrom the M models PiThe target model is determined.

According to the embodiment of the application, the hyperparameter of each model in the M models in the ith training stage is determined according to the model performance score obtained by each training process of each model in the M models in the ith-1 training stage, the hyperparameter of each model in the ith-1 training stage and the training process times of the ith training stage, and the M models are trained based on the hyperparameters of the M models to further obtain the target model. By adopting the method, the training process times of the ith training stage are considered in combination when the hyper-parameter of the ith training stage is determined, so that the determining process of the hyper-parameter is more comprehensive and more accurate, a neural network with better performance is obtained, and the model calculation efficiency is improved.

As an optional implementation manner, the model P is based on M modelsi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' of the above, the number of training processes T of the i-th training stage, the hyper-parameter of each model in the i-1 th training stageiDetermining a reference hyperparameter in the ith training phase, comprising:

according to M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage; according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, at least one model obtains model performance scores in an i-1 training stage through at least one training process, and the model performance estimation function is processed to obtain reference hyper-parameters in the i training stage.

By learning the corresponding information of each model in the (i-1) th training stage, a model performance estimation function is obtained, and model performance scores of training processes with a prediction interval delta t can be achieved. By adopting the method, the model calculation efficiency is improved.

Wherein, the training process times T according to the ith training stageiAnd the M models Pi-2' in the method, a model performance score obtained by at least one training process of at least one model in the i-1 th training stage is processed by the model performance estimation function to obtain the reference hyper-parameter in the i-th training stage, which can be according to M models Pi-2' in the above, the model performance score obtained from the last training process of the model in the i-1 st training stage is used to obtain the reference hyper-parameter.

It may also be according to M models Pi-2' the model performance scores obtained by the last training process of a plurality of models in the i-1 training stage are used for obtaining the reference hyperparameters.

It may also be according to M models Pi-2' in the above, the model performance scores obtained by any one training process in the i-1 st training stage are used to obtain the reference hyper-parameter.

The above-mentioned reference super parameter may be one or more, and this is not specifically limited in this embodiment.

As an alternative implementation, when i is not less than 3, the M models P are usedi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage, comprising:

according to M models Pi-3' in the above, the M models P and model performance scores obtained from the last training process of each model in the i-2 training stagesi-3' the hyper-parametric, M models P of each model in the i-2 training phasei-2' in the training stage of i-1, each model obtains the model performance score in each training process and the M models Pi-2' obtaining a model performance estimation function by using the hyper-parameters of each model in the i-1 training stage, wherein the M models Pi-2' is based on the M models Pi-3' obtained by processing, the M models Pi-2' with the M models Pi-3' one to one correspondence.

According to the method and the device, the model performance score obtained by the last training process in the previous training stage, the hyper-parameter in the previous training stage, the model performance score obtained by each training process in the current training stage and the hyper-parameter in the current training stage are learned, so that the cross-stage model performance estimation can be performed, and the accuracy of the model performance estimation is improved.

As another alternative implementation, the model P is based on M modelsi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage, comprising:

according to M models P0The model performance scores of each model obtained in each training process in the first i-1 training stages and the M models P0The hyper-parameters of each model in each training stage in the first i-1 training stages are used for obtaining a model performance estimation function, and the M models P0Is the initial model.

According to the embodiment of the application, the reference hyperparameters in the ith training stage are obtained by learning the performance scores and the hyperparameters of the models in the first i-1 training stages. By adopting the method, the accuracy of model performance estimation is improved by learning a large amount of data, so that the reliability of selecting the hyper-parameters is improved, and the efficiency of obtaining the model with better performance is improved.

The embodiment is described by taking training data of i-1 training phases as an example, and it may also be obtained by learning training data of any plurality of training phases, for example, at least 3 training phases, or from the 3 rd training phase to the i-1 th training phase, and the like, which is not limited in this embodiment.

As an optional implementation manner, the method further includes:processing the model performance estimation function according to the N length ranges to obtain N processed model performance estimation functions, wherein N is an integer not less than 2; the training process times T according to the ith training stageiAnd the M models Pi-2' processing the model performance estimation function to obtain a reference hyper-parameter in the ith training stage according to a model performance score obtained by at least one training process of at least one model in the ith-1 training stage, wherein the model performance score comprises: according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, model performance scores obtained by at least one training process of at least one model in the (i-1) th training stage are respectively processed on the N processed model performance estimation functions to obtain N initial hyper-parameters; and processing the N initial hyper-parameters to obtain the reference hyper-parameter in the ith training stage.

By adopting the method, the appropriate hyper-parameter is searched through a plurality of length ranges, and the stability of the model can be improved.

As an optional implementation manner, the model P is based on M modelsi-1And the reference hyperparameters in the ith training stage determine M models Pi-1' the hyper-parameters of each model in the i-th training phase, comprising: according to the M models Pi-1Obtaining K first models and M-K second models by the model performance score, wherein the K first models are the M models Pi-1The model performance score of the middle model is smaller than a first preset threshold, the M-K second models are models of which the model performance score is not smaller than the first preset threshold, K is an integer not smaller than 1, and K is smaller than M; updating the parameters of each model in the K first models according to the parameters of the models with the model performance scores larger than a second preset threshold value to obtain K updated first models, wherein the second preset threshold value is not smaller than the first preset threshold value; determining a reference hyperparameter in the ith training phase as a hyperparameter in the ith training phase for each model in the K updated first models; wherein the content of the first and second substances,the M models Pi-1' includes the K updated first models and the M-K second models, and the hyper-parameters of each model in the ith training phase are the same as the hyper-parameters of the model in the ith-1 training phase.

The embodiment of the application is based on M models Pi-1The model performance scores of (1) are subjected to model updating to obtain M models P for training in the next training stagei-1'. The parameters of the model with poor model performance score are updated to the parameters of the model with high model performance score through the periodic model updating, the training of the next training stage is carried out based on the model with good performance, and the efficiency of obtaining the model with good performance is improved.

In the embodiment of the present application, an example is given in which the reference hyper-parameter in the ith training stage is determined as the hyper-parameter of each model in the K updated first models in the ith training stage, where the reference hyper-parameter may be one.

The reference hyperparameter may also be plural. For example, one of the reference hyper-parameters is determined as the hyper-parameter of the model with the worst model performance score, the other reference hyper-parameter is determined as the hyper-parameter of the model with the worse model performance score, and the like. It may be in other forms, and the present solution is not particularly limited thereto.

As an optional implementation manner, the method further includes: according to the training process times T of each training stage in the first i-1 training stagesjAcquiring the data quantity of the first i-1 training stages, wherein j is a positive integer, and if j is 1, 2, 3 and the like, TjIs a positive integer; and confirming that the data volume of the first i-1 training stages does not exceed a preset value.

By judging whether excessive training data are collected in the model training process in real time, initialization processing is carried out when the data volume is excessive. By adopting the method, the calculation efficiency can be effectively improved.

When the data volume of the first i-1 training phases exceeds a preset value, as an optional implementation mode, initialization processing is performed on the hyper-parameters.

As another optional implementation manner, when the data amount of the first i-1 training phases exceeds a preset value, a part of data may be cleared. For example, only the training data from the 5 th training phase to the i-1 th training phase may be saved. It may also be other processes, etc.

The target model provided by the scheme is applied to an image processing system or a recommendation system.

The image processing system may be used for image recognition, instance segmentation, object detection, etc., among others. The recommendation system can be used for commodity recommendation, entertainment activity recommendation such as movies and music recommendation and the like, such as recommendation based on click rate prediction.

Wherein the hyper-parameters comprise at least one of: learning rate, batch size, discard rate, weight decay coefficient, momentum coefficient.

In a second aspect, the present application provides a model training apparatus comprising: a first determination module for determining the M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' of the above, the number of training processes T of the i-th training stage, the hyper-parameter of each model in the i-1 th training stageiDetermining a reference hyperparameter in the ith training phase, wherein M, i, TiAre integers not less than 2; a second determination module for determining the M models Pi-1And the reference hyperparameters in the ith training stage determine M models Pi-1' hyperparameters of each model in the i-th training phase; wherein the M models Pi-1For the M models Pi-2' the M models P obtained in the last training procedure in the i-1 th training stagei-1Is based on said M models Pi-1Processed to obtain the M models Pi-1' with the M models Pi-1One-to-one correspondence of the M models Pi-1With said M models Pi-2' one-to-one correspondence; a model training module for training at the i-th trainingIn the training phase, according to the M models Pi-1' the hyper-parameters of each model in T-running said each modeliThe secondary training process obtains M models PiAnd obtaining the M models Pi-1' obtaining a model performance score for each training process of each model in the ith training stage; a model determination module for determining the model P based on the M models when the ith training stage is the last training stageiFrom the M models PiThe target model is determined.

According to the embodiment of the application, the hyperparameter of each model in the M models in the ith training stage is determined according to the model performance score obtained by each training process of each model in the M models in the ith-1 training stage, the hyperparameter of each model in the ith-1 training stage and the training process times of the ith training stage, and the M models are trained based on the hyperparameters of the M models to further obtain the target model. By adopting the method, the training process times of the ith training stage are considered in combination when the hyper-parameter of the ith training stage is determined, so that the determining process of the hyper-parameter is more comprehensive and more accurate, a neural network with better performance is obtained, and the model calculation efficiency is improved.

As an optional implementation manner, the first determining module is configured to: according to M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage; according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, at least one model obtains model performance scores in an i-1 training stage through at least one training process, and the model performance estimation function is processed to obtain reference hyper-parameters in the i training stage.

By learning the corresponding information of each model, a model performance estimation function is obtained, and model performance scores of training processes with a prediction interval delta t can be achieved. By adopting the method, the model calculation efficiency is improved.

As an optional implementation manner, when i is not less than 3, the first determining module is further configured to: according to M models Pi-3' in the above, the M models P and model performance scores obtained from the last training process of each model in the i-2 training stagesi-3' the hyper-parametric, M models P of each model in the i-2 training phasei-2' in the training stage of i-1, each model obtains the model performance score in each training process and the M models Pi-2' obtaining a model performance estimation function by using the hyper-parameters of each model in the i-1 training stage, wherein the M models Pi-2' is based on the M models Pi-3' obtained by processing, the M models Pi-2' with the M models Pi-3' one to one correspondence.

According to the method and the device, the model performance score obtained by the last training process in the previous training stage, the hyper-parameter in the previous training stage, the model performance score obtained by each training process in the current training stage and the hyper-parameter in the current training stage are learned, so that the cross-stage model performance estimation can be performed, and the accuracy of the model performance estimation is improved.

As another optional implementation manner, the first determining module is further configured to: according to M models P0The model performance scores of each model obtained in each training process in the first i-1 training stages and the M models P0The hyper-parameters of each model in each training stage in the first i-1 training stages are used for obtaining a model performance estimation function, and the M models P0Is the initial model.

According to the embodiment of the application, the reference hyperparameters in the ith training stage are obtained by learning the performance scores and the hyperparameters of the models in the first i-1 training stages. By adopting the method, the accuracy of model performance estimation is improved by learning a large amount of data, so that the reliability of selecting the hyper-parameters is improved, and the efficiency of obtaining the model with better performance is improved.

As an optional implementation manner, the apparatus further includes a processing module, configured to: processing the model performance estimation function according to the N length ranges to obtain N processed model performance estimation functions, wherein N is an integer not less than 2; the first determining module is further configured to: according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, model performance scores obtained by at least one training process of at least one model in the (i-1) th training stage are respectively processed on the N processed model performance estimation functions to obtain N initial hyper-parameters; and processing the N initial hyper-parameters to obtain the reference hyper-parameter in the ith training stage.

By adopting the method, the appropriate hyper-parameter is searched through a plurality of length ranges, and the stability of the model can be improved.

As an optional implementation manner, the second determining module is configured to: according to the M models Pi-1Obtaining K first models and M-K second models by the model performance score, wherein the K first models are the M models Pi-1The model performance score of the middle model is smaller than a first preset threshold, the M-K second models are models of which the model performance score is not smaller than the first preset threshold, K is an integer not smaller than 1, and K is smaller than M; updating the parameters of each model in the K first models according to the parameters of the models with the model performance scores larger than a second preset threshold value to obtain K updated first models, wherein the second preset threshold value is not smaller than the first preset threshold value; determining a reference hyperparameter in the ith training phase as a hyperparameter in the ith training phase for each model in the K updated first models; wherein the M models Pi-1' includes the K updated first models and the M-K second models, and the hyper-parameters of each model in the ith training phase are the same as the hyper-parameters of the model in the ith-1 training phase.

The examples of the present application are byIn M models Pi-1The model performance scores of (1) are subjected to model updating to obtain M models P for training in the next training stagei-1'. The parameters of the model with poor model performance score are updated to the parameters of the model with high model performance score through the periodic model updating, the training of the next training stage is carried out based on the model with good performance, and the efficiency of obtaining the model with good performance is improved.

As an optional implementation manner, the apparatus further includes a confirmation module, configured to: according to the training process times T of each training stage in the first i-1 training stagesjAcquiring the data volume of the first i-1 training stages; and confirming that the data volume of the first i-1 training stages does not exceed a preset value.

By judging whether excessive training data are collected in the model training process in real time, initialization processing is carried out when the data volume is excessive. By adopting the method, the calculation efficiency can be effectively improved.

The object model is applied to an image processing system or a recommendation system.

The image processing system may be used for image recognition, instance segmentation, object detection, etc., among others. The recommendation system can be used for commodity recommendation, entertainment activity recommendation such as movies and music recommendation and the like, such as recommendation based on click rate prediction.

The hyper-parameters include at least one of: learning rate, batch size, discard rate, weight decay coefficient, momentum coefficient.

In a third aspect, the present application provides a model training apparatus comprising a processor and a memory; wherein the memory is configured to store program code and the processor is configured to invoke the program code to perform the method.

In a fourth aspect, the present application provides a computer storage medium comprising computer instructions that, when executed on an electronic device, cause the electronic device to perform the method as provided in any one of the possible embodiments of the first aspect.

In a fifth aspect, the embodiments of the present application provide a computer program product, which when run on a computer, causes the computer to execute the method as provided in any one of the possible embodiments of the first aspect.

In a sixth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to perform the method provided in any one of the possible implementation manners of the first aspect.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to execute the method provided in any one of the possible implementation manners of the first aspect.

It is to be understood that the apparatus of the second aspect, the apparatus of the third aspect, the computer storage medium of the fourth aspect, the computer program product of the fifth aspect, or the chip of the sixth aspect provided above are all adapted to perform the method provided in any of the first aspects.

Therefore, the beneficial effects achieved by the method can refer to the beneficial effects in the corresponding method, and are not described herein again.

Drawings

The drawings used in the embodiments of the present application are described below.

FIG. 1 is a schematic diagram of an artificial intelligence framework according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an application environment according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a neural network processor according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a model training architecture according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart diagram illustrating a model training method according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a learning method provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of another learning method provided by embodiments of the present application;

FIG. 10 is a schematic diagram of a model training method provided in an embodiment of the present application;

FIG. 11 is a schematic flow chart diagram illustrating a model training method according to an embodiment of the present disclosure;

FIG. 12 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 13 is a schematic structural diagram of another model training device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

FIG. 1 shows a schematic diagram of an artificial intelligence body framework that describes the overall workflow of an artificial intelligence system, applicable to the general artificial intelligence field requirements.

The artificial intelligence topic framework described above is set forth below in terms of two dimensions, the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis).

The "smart information chain" reflects a series of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.

The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the artificial intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure:

the infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by an intelligent chip (a hardware acceleration chip such as a Central Processing Unit (CPU), an embedded neural Network Processor (NPU), a Graphic Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.

Referring to fig. 2, a system architecture 200 is provided in accordance with an embodiment of the present invention. Data collection facility 260 is used to collect, for example, image data and store in database 230, and training facility 220 generates target model/rule 201 based on the image data maintained in database 230. How the training device 220 derives the target model/rule 201 based on the image data will be described in more detail below, and the target model/rule 201 can be applied to an image processing system, or a recommendation system, or the like.

Wherein the training device 220 is based on M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' the hyperparameter in the i-1 th training stage and the training process times Ti of the i training stage of each model determine the reference hyperparameter in the i training stage, wherein M, i, TiAre integers not less than 2; according to M models Pi-1And the reference hyperparameters in the ith training stage determine M models Pi-1' hyperparameters of each model in the i-th training phase; wherein the M models Pi-1For the M models Pi-2' the M models P obtained in the last training procedure in the i-1 th training stagei-1Is based on said M models Pi-1Processed to obtain the M models Pi-1' with the M models Pi-1One-to-one correspondence of the M models Pi-1With said M models Pi-2' one-to-one correspondence; in the i-th training phase, according to the M models Pi-1' the hyper-parameters of each model in T-running said each modeliThe secondary training process obtains M models PiAnd obtaining the M models Pi-1' obtaining a model performance score for each training process of each model in the ith training stage; when the ith training stage is the last training stage, based on the M models PiFrom the M models PiThe target model is determined.

Wherein the training apparatus 220 is further configured to: according to M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage; according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, at least one model obtains model performance scores in an i-1 training stage through at least one training process, and the model performance estimation function is processed to obtain reference hyper-parameters in the i training stage.

As an alternative implementation, when i is not less than 3, the M models P are usedi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage, comprising: according to M models Pi-3' in the above, the M models P and model performance scores obtained from the last training process of each model in the i-2 training stagesi-3' the hyper-parametric, M models P of each model in the i-2 training phasei-2' Each model is in the i-1 st training phaseThe model performance scores obtained in each training process and the M models Pi-2' obtaining a model performance estimation function by using the hyper-parameters of each model in the i-1 training stage, wherein the M models Pi-2' is based on the M models Pi-3' obtained by processing, the M models Pi-2' with the M models Pi-3' one to one correspondence.

As another alternative implementation, the model P is based on M modelsi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage, comprising: according to M models P0The model performance scores of each model obtained in each training process in the first i-1 training stages and the M models P0The hyper-parameters of each model in each training stage in the first i-1 training stages are used for obtaining a model performance estimation function, and the M models P0Is the initial model.

As an optional implementation manner, the method further includes: processing the model performance estimation function according to the N length ranges to obtain N processed model performance estimation functions, wherein N is an integer not less than 2; the training process times T according to the ith training stageiAnd the M models Pi-2' processing the model performance estimation function to obtain a reference hyper-parameter in the ith training stage according to a model performance score obtained by at least one training process of at least one model in the ith-1 training stage, wherein the model performance score comprises: according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, model performance scores obtained by at least one training process of at least one model in the (i-1) th training stage are respectively processed on the N processed model performance estimation functions to obtain N initial hyper-parameters; and processing the N initial hyper-parameters to obtain the reference hyper-parameter in the ith training stage.

As one kind canSelected implementation manner, according to M models Pi-1And the reference hyperparameters in the ith training stage determine M models Pi-1' the hyper-parameters of each model in the i-th training phase, comprising: according to the M models Pi-1Obtaining K first models and M-K second models by the model performance score, wherein the K first models are the M models Pi-1The model performance score of the middle model is smaller than a first preset threshold, the M-K second models are models of which the model performance score is not smaller than the first preset threshold, K is an integer not smaller than 1, and K is smaller than M; updating the parameters of each model in the K first models according to the parameters of the models with the model performance scores larger than a second preset threshold value to obtain K updated first models, wherein the second preset threshold value is not smaller than the first preset threshold value; determining a reference hyperparameter in the ith training phase as a hyperparameter in the ith training phase for each model in the K updated first models; wherein the M models Pi-1' includes the K updated first models and the M-K second models, and the hyper-parameters of each model in the ith training phase are the same as the hyper-parameters of the model in the ith-1 training phase.

Wherein the training apparatus 220 is further configured to:

according to the training process times T of each training stage in the first i-1 training stagesjAcquiring the data volume of the first i-1 training stages; and confirming that the data volume of the first i-1 training stages does not exceed a preset value.

The operation of each layer in the deep neural network can be described by the mathematical expression y ═ a (Wgx + b): from the work of each layer in the physical-level deep neural network, it can be understood that the transformation of the input space into the output space (i.e. the row space to the column space of the matrix) is accomplished by five operations on the input space (set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein operations 1, 2 and 3 are performed by Wgx, operation 4 is performed by + b, and operation 5 is performed by a (). The expression "space" is used herein because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

Because it is desirable that the output of the deep neural network is as close as possible to the value actually desired to be predicted, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the value actually desired to be predicted, and then updating the weight vector according to the difference between the predicted value and the value actually desired (of course, there is usually an initialization process before the first update, that is, parameters are configured in advance for each layer in the deep neural network). Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

The target models/rules obtained by the training device 220 may be applied in different systems or devices. In FIG. 2, the execution device 210 is configured with an I/O interface 212 to interact with data from an external device, and a "user" may input data to the I/O interface 212 via a client device 240.

The execution device 210 may call data, code, etc. from the data storage system 250 and may store data, instructions, etc. in the data storage system 250.

The calculation module 211 performs image recognition processing or recommendation on the input data using the target model/rule 201.

The correlation function 213 is configured to extract features of the received data and perform a normalization operation.

The correlation function module 214 is configured to process the result output by the calculation module.

Finally, the I/O interface 212 returns the results of the processing to the client device 240 for presentation to the user.

Further, the training device 220 may generate corresponding target models/rules 201 based on different data for different targets to provide better results to the user.

In the case shown in FIG. 2, the user may manually specify data to be input into the execution device 210, for example, to operate in an interface provided by the I/O interface 212. Alternatively, the client device 240 may automatically enter data into the I/O interface 212 and obtain the results, and if the client device 240 automatically enters data to obtain authorization from the user, the user may set the corresponding permissions in the client device 240. The user can view the result output by the execution device 210 at the client device 240, and the specific presentation form can be display, sound, action, and the like. The client device 240 may also be used as a data collection end to store the collected image data in the database 230.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the position relationship between the devices, modules, etc. shown in the diagram does not constitute any limitation, for example, in fig. 2, the data storage system 250 is an external memory with respect to the execution device 210, and in other cases, the data storage system 250 may also be disposed in the execution device 210.

The following is presented by taking convolutional neural network as an example:

a Convolutional Neural Network (CNN) is a deep neural network with a Convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to learning at multiple levels in different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons respond to overlapping regions in an image input thereto.

As shown in fig. 3, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.

Convolutional layer/pooling layer 120:

and (3) rolling layers:

as shown in FIG. 3, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of step size stride) in the horizontal direction, so as to complete the task of extracting a specific feature from the image.

The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image.

Different weight matrixes can be used for extracting different features in an image, for example, one weight matrix is used for extracting image edge information, another weight matrix is used for extracting specific colors of the image, and another weight matrix is used for blurring unwanted noise points in the image.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to make correct prediction.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 3, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a particular range to produce an average. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling.

In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 130:

after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130.

Accordingly, a plurality of hidden layers (131, 132 to 13n shown in fig. 3) and an output layer 140 may be included in the neural network layer 130, and parameters included in the hidden layers may be pre-trained according to the related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 3 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 3 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 4, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.

Fig. 5 is a block diagram of a neural network processor according to an embodiment of the present invention. The neural network processor NPU 50NPU is mounted on a main CPU (Host CPU) as a coprocessor, and tasks are allocated by the Host CPU. The core portion of the NPU is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform an operation.

In some implementations, the arithmetic circuit 503 internally includes a plurality of processing units (PEs). In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 501 and performs matrix operation with the matrix B, and partial results or final results of the obtained matrix are stored in the accumulator 508 accumulator.

The vector calculation unit 507 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculation of non-convolution/non-FC layers in a neural network, such as Pooling (Pooling), Batch Normalization (Batch Normalization), Local Response Normalization (Local Response Normalization), and the like.

In some implementations, the vector calculation unit 507 stores the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.

The unified memory 506 is used to store input data as well as output data.

A Memory cell Access Controller 505 (DMAC) transfers input data in the external Memory to the input Memory 501 and/or the unified Memory 506, stores weight data in the external Memory into the weight Memory 502, and stores data in the unified Memory 506 into the external Memory.

A Bus Interface Unit (BIU) 510, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 509 through a Bus.

An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

the controller 504 is configured to call the instruction cached in the instruction storage 509 to implement controlling the working process of the operation accelerator.

Generally, the unified Memory 506, the input Memory 501, the weight Memory 502, and the instruction fetch Memory 509 are On-Chip memories, the external Memory is a Memory outside the NPU, and the external Memory may be a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

It should be noted that the hyper-parameters in the present solution are used to adjust the whole network model training process, such as the number of hidden layers of the neural network, the size and number of kernel functions, and so on. The hyper-parameters are configuration variables and do not directly participate in the training process.

The hyper-parameter may be any of:

1) optimizer algorithm (optimizer)

Refers to a machine learning algorithm that updates network weights. Such as a Stochastic Gradient Descent (SGD) algorithm.

2) Learning rate

Refers to the amplitude of updating the parameters at each iteration in the optimization algorithm, and is also called the step size. When the step length is too large, the algorithm is not converged, the target function is in a vibration state, and when the step length is too small, the convergence speed of the model is too slow.

3) Activating a function

It refers to the nonlinear function added on each neuron, that is, the key of the nonlinear property of the neural network, and commonly used activation functions are sigmoid, relu, tanh, etc.

Loss function: namely, the objective function in the optimization process, the smaller the loss, the better, and the training process is the process of minimizing the loss function.

Commonly used loss functions are logarithmic loss functions, quadratic loss functions, exponential loss functions, and the like.

The hyper-parameters in the embodiment of the present application may also be:

4) batch size (batch size)

Refers to the amount of data per gradient down update.

5) Discard ratio (discharge rate)

Refers to the randomly masked network weights in the model training process.

6) Weight attenuation coefficient (weight decay coeffient)

Refers to the regularization term coefficients of the network weights.

7) Coefficient of momentum (momentum coefficient)

Refers to the momentum term coefficient when the gradient is decreased and the network weight is updated.

The present disclosure is only described by way of example, and other hyper-parameters may also be used, which is not limited in the present disclosure.

Fig. 6 is a schematic diagram of a model training architecture according to an embodiment of the present disclosure. And the model training is based on M models to carry out the training of Z training stages, and finally the target model is obtained. Each of the M models is trained in Z training phases.

Wherein each training phase comprises at least two training processes. A training procedure here may be understood as the model traversing a sample set once.

As an alternative implementation, the M models are identical in scale and structure.

The following describes in detail the model training method provided in the embodiments of the present application.

The execution subject of the embodiment of the present application may be a server or the like.

Fig. 7 is a schematic flow chart of a model training method according to an embodiment of the present disclosure. The method comprises steps 701-704, which are as follows:

701. according to M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' of the above, the number of training processes T of the i-th training stage, the hyper-parameter of each model in the i-1 th training stageiDetermining a reference hyperparameter in the ith training phase, wherein M, i, TiAre integers not less than 2;

wherein, M models Pi-2' Each model was subjected to T in the i-1 st training phasei-1And (5) performing a secondary training process. Each training procedure may be understood as the model traversing the sample set once. Each model is subjected to a training process to obtain a model performance score.

Specifically, in a one-time training process, a new model is obtained after model parameters such as weights are updated by using a training data set, and a model performance score can be obtained by calculating a performance score of the new model on a verification data set, wherein the verification data set is a data set for evaluating the performance of the model.

As an alternative implementation, the above is based on M models Pi-2' in the formula, each model obtains model performance in each training process in the (i-1) th training stageScore, the M models Pi-2' of the above, the number of training processes T of the i-th training stage, the hyper-parameter of each model in the i-1 th training stageiDetermining a reference hyperparameter in the ith training phase, comprising:

according to M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage;

according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, at least one model obtains model performance scores in an i-1 training stage through at least one training process, and the model performance estimation function is processed to obtain reference hyper-parameters in the i training stage.

By means of a plurality of models Pi-2' in the training phase of i-1, each model obtains model performance scores and M models P in each training processi-2In the method, machine learning is carried out on the hyper-parameters of each model in the (i-1) th training stage, model performance scores of training processes at intervals of delta t are obtained by continuously learning model performance scores obtained based on any training process in the (i-1) th training stage and the corresponding hyper-parameters, and then a model performance estimation function f (y) is obtainedt,Δt,A)。

Wherein, ytFor the model performance score, Δ t is a positive integer, Δ t represents ytThe interval between the corresponding training process and the training process corresponding to the predicted model performance score is as large as ytThe corresponding hyper-parameters of the training phase.

Specifically, refer to the learning process diagram shown in fig. 8. Fig. 8 illustrates an example of 6 training sequences in the i-1 th training phase. The learning is carried out based on 1 training flow at an interval, 2 training flows at an interval, 3 training flows at an interval, 4 training flows at an interval and 5 training flows at an interval of the first training flow, and meanwhile, 1 training flow at an interval, 2 training flows at an interval, 3 training flows at an interval and 4 training flows at an interval of the second training flow are learned.

Accordingly, learning is also performed based on the third training procedure with 1 training procedure interval (not shown), 2 training procedures interval (not shown), 3 training procedures interval (not shown), 1 training procedure interval (not shown), 2 training procedures interval (not shown) based on the fourth training procedure, and 1 training procedure interval (not shown) based on the fifth training procedure.

By learning the corresponding information of each model in the (i-1) th training stage, the model performance scores of training processes at intervals of delta t can be predicted based on different performance scores under different hyper-parameters. By adopting the method, the model calculation efficiency is improved.

The processing of the model performance estimation function may be extremum solving of the model performance estimation function to obtain a reference hyper-parameter a, and may refer to the following manner:

A*=argmaxA f(yt,Δt,A);

which is indicated at given ytIn the case of Δ t, the value of a corresponding to the maximum value of f is the reference hyper-parameter a.

After obtaining the model performance estimation function, the method is based on M models Pi-2' where at least one model obtains a model performance score in at least one training process in the i-1 th training stage, and the number of training processes T in the i-th training stageiAnd carrying out extremum solution on the model performance estimation function to obtain the hyperparameter when the model performance score obtained in the last training process of the ith training stage is highest.

Wherein the M models Pi-2' at least one of the models, which may be M models Pi-2' the model with the highest performance score of the model obtained in the last training process in the i-1 st training stage,

optionally, one of the models with the model performance score exceeding the preset threshold obtained in the last training process may be randomly selected. It may also be determined based on other selection manners, which is not specifically limited in this embodiment.

The obtaining of the hyper-parameter when the model performance score obtained in the last training process of the ith training stage is the highest based on the model performance score obtained in the at least one training process may be obtaining the hyper-parameter when the model performance score obtained in the last training process of the ith training stage is the highest based on the model performance score obtained in one training process, or obtaining the hyper-parameter when the model performance score obtained in the last training process of the ith training stage is the highest based on the model performance scores obtained in a plurality of training processes. The present solution is not particularly limited to this.

In particular, based on M models Pi-2In the method, the model performance score obtained by each model in the last training process in the (i-1) th training stage is used for obtaining the model with the highest model performance score; and then training process times T based on the model and the ith training stageiTo obtain the above-mentioned reference superparameter.

The embodiment is only described by taking the model with the highest model performance score as an example, and the model may be a model selected in any other setting manner, for example, a model may be randomly selected from the models with the model performance scores ranked in the first four from high to low, and the present solution is not limited to this.

The above description is given by taking the model performance score obtained by the last training process of a model in the i-1 th training stage as an example, but the reference hyper-parameter may also be calculated by the model performance score obtained by any training process of a model in the i-1 th training stage except the last training process, and this is not specifically limited in this embodiment.

The reference hyper-parameter can be calculated based on model performance scores obtained by the models in any training process in the (i-1) th training stage, for example, the hyper-parameters obtained by the models can be set with different weights, and then the reference hyper-parameter is obtained.

The above is only described by taking an example of obtaining the reference hyper-parameter through processing based on a model performance estimation function, and the reference hyper-parameter may also be obtained through processing based on a performance estimation model, which is not specifically limited in this embodiment.

It should be noted that, the reference hyper-parameter in the ith training phase determined by the present solution may be one or multiple, and the present solution is not particularly limited to this.

As an alternative implementation, when i is not less than 3, the above is based on M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage, which may include:

according to M models Pi-3' in the above, the M models P and model performance scores obtained from the last training process of each model in the i-2 training stagesi-3' the hyper-parametric, M models P of each model in the i-2 training phasei-2' in the training stage of i-1, each model obtains the model performance score in each training process and the M models Pi-2' obtaining a model performance estimation function by using the hyper-parameters of each model in the i-1 training stage, wherein the M models Pi-2' is based on the M models Pi-3' obtained by processing, the M models Pi-2' with the M models Pi-3' one to one correspondence.

That is, the model performance estimation function is obtained by learning based on the model performance score obtained from the last training process in the previous training stage, the hyper-parameter in the previous training stage, the model performance score obtained from each training process in the current training stage, and the hyper-parameter in the current training stage.

Fig. 9 is a schematic diagram of a learning process provided in an embodiment of the present application. Fig. 9 illustrates an example in which there are 8 training processes in the i-2 th training stage and 6 training processes in the i-1 th training stage. The training process is based on the last training process of the (i-2) th training stage with 1 training process interval (i.e. the 1 st training process of the (i-1) th training stage), 2 training processes interval (i.e. the 2 nd training process of the (i-1) th training stage), 3 training processes interval, 4 training processes interval, 5 training processes interval, 6 training processes interval, and also learning the 1 st training process interval 1 training process of the (i-1) th training stage, 2 training processes interval, 3 training processes interval, 4 training processes interval, 5 training processes interval.

Correspondingly, the 2 nd training process based on the i-1 th training stage is separated by 1 training process (not shown in the figure), 2 training processes (not shown in the figure), 3 training processes (not shown in the figure), 4 training processes (not shown in the figure), the 3 rd training process based on the i-1 th training stage is separated by 1 training process (not shown in the figure), 2 training processes (not shown in the figure), 3 training processes (not shown in the figure), and so on.

By learning the corresponding information of each model, the model performance scores of training processes with the interval delta t can be predicted based on different performance scores under different hyper-parameters.

According to the method and the device, the model performance score obtained by the last training process in the previous training stage, the hyper-parameter in the previous training stage, the model performance score obtained by each training process in the current training stage and the hyper-parameter in the current training stage are learned, so that the cross-stage model performance estimation can be performed, and the accuracy of the model performance estimation is improved.

The above embodiment is based on M models Pi-3' obtaining the model performance estimation function by the model performance score obtained by the last training process of each model in the i-2 training stage.

Optionally, M models P can be usedi-3' in the method, each model obtains the model performance score in each training process in the i-2 training stage, and further obtains the model performance estimation function. The present solution is not particularly limited to this.

That is, the model performance estimation function may be derived based on learning of the first two training phases.

As another alternative implementation, the above is based on M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage, comprising:

according to M models P0The model performance scores of each model obtained in each training process in the first i-1 training stages and the M models P0The hyper-parameters of each model in each training stage in the first i-1 training stages are used for obtaining a model performance estimation function, and the M models P0Is the initial model. The initial model, i.e. the model corresponding to when i takes 2.

Specifically, by learning the respective model performance scores and the respective hyper-parameters of the first i-1 training phases, it is further possible to base M models Pi-2' the model performance score obtained by the last training process of each model in the (i-1) th training stage and the training process times of the (i) th training stage are used for obtaining the reference hyper-parameter in the (i) th training stage.

According to the embodiment of the application, the reference hyperparameters in the ith training stage are obtained by learning the performance scores and the hyperparameters of the models in the first i-1 training stages. By adopting the method, the accuracy of model performance estimation is improved by learning a large amount of data, so that the reliability of selecting the hyper-parameters is improved, and the efficiency of obtaining the model with better performance is improved.

The above description is only given in several different implementation manners, wherein the learning may be performed based on each model performance score and each hyper-parameter of any several training stages, so as to obtain a reference hyper-parameter in the ith training stage. The present solution is not particularly limited to this.

It should be noted that, when determining the reference hyperparameter, the embodiment of the present application may directly predict the model performance scores of the training procedures at intervals Δ t based on one of the model performance scores. It may also be that intermediate model performance scores for the training routines at intervals of Δ t' are predicted based on one of the model performance scores, and a final model performance score is predicted based on the intermediate model performance scores.

For example, the model performance score of the 10 th training process of the next training stage is predicted based on the 3 rd model performance score of the current training stage, the model performance score of the 2 nd training process of the next training stage may be predicted based on the 3 rd model performance score of the current training stage, and the model performance score of the 10 th training process may be predicted based on the model performance score of the 2 nd training process.

That is, the scheme may perform learning based on multiple intermediate predictions, and then obtain a model performance estimation function. The above description is only an embodiment, and other forms are also possible, and the present disclosure is not limited to this specifically.

As an optional implementation manner, after obtaining the model performance estimation function, the method further includes:

processing the model performance estimation function according to the N length ranges to obtain N processed model performance estimation functions, wherein N is an integer not less than 2:

according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, model performance scores obtained by at least one training process of at least one model in the (i-1) th training stage are respectively processed on the N processed model performance estimation functions to obtain N initial hyper-parameters;

and processing the N initial hyper-parameters to obtain the reference hyper-parameter in the ith training stage.

Where length scale refers to the term of length scale in the gaussian kernel, used to control the degree of correlation between two quantities. The N length ranges may be, for example, a length range of 0.5 to 1.5 divided into N terms.

The processing of the model performance estimation function according to the N length ranges may be a gaussian process fitting of the model performance estimation function by using gaussian kernels of different length ranges. And each length range corresponds to one fitted model performance estimation function, and N processed model performance estimation functions are obtained after Gaussian process fitting is carried out in different length ranges.

Then, carrying out the extreme value solution on each model performance estimation function in the N processed model performance estimation functions to obtain N initial hyper-parameters; and carrying out multi-centroid clustering processing on the hyper-parameters, and selecting the hyper-parameter positioned at the center as the reference hyper-parameter so as to adjust the network training of the next training stage.

Of course, the hyper-parameters corresponding to the plurality of centers formed by clustering can be used as the reference hyper-parameters in the ith training stage. That is, the above-mentioned reference hyper-parameter of the present solution may be one or more, and the present solution is not particularly limited to this.

By adopting the method, the appropriate hyper-parameter is searched through a plurality of length ranges, and the stability of the model can be improved.

As an optional implementation manner, before step 701, the method further includes:

according to the training process times T of each training stage in the first i-1 training stagesjAcquiring the data volume of the first i-1 training stages;

and confirming that the data volume of the first i-1 training stages does not exceed a preset value.

Wherein, the data amount can be expressed as:

the data quantity can represent the sum of the number of training data for obtaining model performance scores of training processes at intervals of delta t based on any one model performance score in each training stage in the first i-1 training stages.

For example, if there are 4 training flows in the 3 rd training stage, the number of training data corresponding to the training stage is 4 × 5/2 — 10. And the data volume is obtained by superposing the number of the training data of each training stage.

If the preset value is not exceeded, step 701 is executed.

Optionally, the model performance score, the hyper-parameter, and the like are obtained each time and stored.

If the value exceeds the preset value, resetting the buffer of the memory for storing the training data and initializing the hyper-parameter. Alternatively, only the most recent training data is retained, and training is continued.

By judging whether excessive training data are collected in the model training process in real time, corresponding processing is carried out when the data volume is excessive. By adopting the method, the calculation efficiency can be effectively improved.

702. According to M models Pi-1And the reference hyperparameters in the ith training stage determine M models Pi-1' hyperparameters of each model in the i-th training phase;

wherein the M models Pi-1For the M models Pi-2' the M models P obtained in the last training procedure in the i-1 th training stagei-1Is based on said M models Pi-1Processed to obtain the M models Pi-1' with the M models Pi-1One-to-one correspondence of the M models Pi-1With said M models Pi-2' one-to-one correspondence;

fig. 10 is a schematic diagram of a model training method according to an embodiment of the present application.

Fig. 10 illustrates a model s among M models as an example. Model Pi-3,s' carrying out T in the i-2 training phasei-2A secondary training process to obtain a model Pi-2,s. Wherein the model Pi-2,sProcessed to obtain a model Pi-2,s'. Model Pi-2,s' carrying out T in the i-1 st training phasei-1A secondary training process to obtain a model Pi-1,s. Wherein the model Pi-1,sProcessed to obtain a model Pi-1,s'. Model Pi-1,s' atThe ith training stage is performed by TiA secondary training process to obtain a model Pi,s

That is, the above-described model Pi-2,s' and model Pi-1,sCorresponding, model Pi-1,sAnd model Pi-1,s' correspond to.

As an optional implementation manner, the M models P are described abovei-1Is based on said M models Pi-1Processed, it can be understood that, among them, M models Pi-1' the K models corresponding to the K first models are obtained by dividing M models Pi-1Updating the parameters of the K first models with lower median scores; m models Pi-1In the present invention, the M-K models corresponding to the M-K second models are obtained by keeping the parameters of the M-K second models with higher scores unchanged, that is, the partial models for the ith training stage can be obtained by keeping the models with higher scores unchanged.

Of course, the M models P can be obtained by performing the processing based on other methodsi-1' this is not a specific limitation of the present invention.

As an alternative implementation, the above is based on M models Pi-1And the reference hyperparameter determines M models Pi-1' the hyper-parameters of each model in the i-th training phase, comprising:

according to the M models Pi-1Obtaining K first models and M-K second models by the model performance score, wherein the K first models are the M models Pi-1The model performance score of the middle model is smaller than a first preset threshold, the M-K second models are models of which the model performance score is not smaller than the first preset threshold, K is an integer not smaller than 1, and K is smaller than M;

updating the parameters of each model in the K first models according to the parameters of the models with the model performance scores larger than a second preset threshold value to obtain K updated first models, wherein the second preset threshold value is not smaller than the first preset threshold value;

determining a reference hyperparameter in the ith training phase as a hyperparameter in the ith training phase for each model in the K updated first models; wherein the M models Pi-1' includes the K updated first models and the M-K second models, and the hyper-parameters of each model in the ith training phase are the same as the hyper-parameters of the model in the ith-1 training phase.

That is, M models Pi-1The reference hyperparameters are determined as the hyperparameters in the ith training stage of a part of the models (such as the above K first models) with poor performance scores. Specifically, when there is only one reference hyper-parameter, it may be determined as a hyper-parameter of each model having a poor performance score. When there are a plurality of reference hyperparameters, a hyperparameter for each model with a poor performance score may be determined based on the ranking of performance scores. For example, the allocation may be random, or may be in sequence, and the present solution is not particularly limited to this.

Wherein, the hyperparameter of each model in the model with better performance score (such as the M-K second models) in the ith training stage is consistent with the hyperparameter of the model in the (i-1) th training stage.

The updating of the parameters of each of the K first models according to the parameters of the model whose model performance score is greater than the second preset threshold may be performed by determining the parameters of the model whose performance score is poor based on the parameters of the model whose performance score is good.

For example, the parameters of the model with the poor model performance score are updated to the parameters of the model with the highest model performance score. Or, the parameters of the model with the worst model performance score are updated to the parameters of the model with the highest model performance score, and the parameters of the model with the worse model performance score are updated to the parameters of the model with the higher model performance score. The above description is only given by way of example, and other forms are also possible, and the present solution is not particularly limited to this.

Optionally, the size of the second preset threshold may be the same as the size of the first preset threshold. The second preset threshold may be larger than the first preset threshold.

The embodiment of the application is based on M models Pi-1The model performance scores of (1) are subjected to model updating to obtain M models P for training in the next training stagei-1'. The parameters of the model with poor model performance score are updated to the parameters of the model with high model performance score through the periodic model updating, the training of the next training stage is carried out based on the model with good performance, and the efficiency of obtaining the model with good performance is improved.

It should be noted that, in the embodiment of the present application, the parameters of the model include network weights of the model. It may also include other information, which the present solution does not limit.

703. In the i-th training phase, according to the M models Pi-1' the hyper-parameters of each model in T-running said each modeliThe secondary training process obtains M models PiAnd obtaining the M models Pi-1' obtaining a model performance score for each training process of each model in the ith training stage;

by determining M models P for the ith training phasei-1' the hyper-parameters of each model in the set, and further for the M models Pi-1' Each model in the above was performed by TiAnd (5) performing a secondary training process.

If the ith training stage is not the last training stage, let i be i +1, and repeat the above-mentioned step 701-703, that is, determine the hyper-parameter of the next training stage and the model of the next training stage, and then perform training.

Until the ith training phase is the last training phase, step 704 is entered.

704. When the ith training stage is the last training stage, based on the M models PiFrom the M models PiThe target model is determined.

In particular, when the completion is the mostAfter the last training process of the next training stage, based on the M finally obtained models PiThe model performance scores of each model in the set are sorted, and the model with the highest model performance score is determined as the target model.

The embodiment of the present application is described by taking only the model with the highest model performance score as the target model, and may also determine the model with the model performance score exceeding a preset value as the target model. That is, there may be one or more target models, and this scheme is not particularly limited in this respect.

According to the embodiment of the application, the hyperparameter of each model in the M models in the ith training stage is determined according to the model performance score obtained by each training process of each model in the M models in the ith-1 training stage, the hyperparameter of each model in the ith-1 training stage and the training process times of the ith training stage, and the M models are trained based on the hyperparameters of the M models to further obtain the target model. By adopting the method, the training process times of the ith training stage are considered in combination when the hyper-parameter of the ith training stage is determined, so that the determining process of the hyper-parameter is more comprehensive and more accurate, a neural network with better performance is obtained, and the model calculation efficiency is improved.

The following introduces an example of the application of the model training method of the present scheme to an image processing scenario:

referring to fig. 11, a model training method applied to image recognition is provided in an embodiment of the present application. The training method comprises the following steps 1101-1108:

1101. acquiring an image classification sample set;

as an optional implementation mode, training data is collected on a supervised learning task, preprocessing operations such as normalization are carried out on the data, the data are sorted into samples, and then an image classification sample set is obtained.

It may also be in other ways to obtain the image classification sample set.

The image classification sample set comprises different types of data, such as a table picture, a chair picture, an airplane picture and the like.

1102. Initializing each model of the M models to obtain M models P0

The initialization process may be setting an initial parameter and an initial hyper-parameter. The initial hyper-parameter is the hyper-parameter of the model in the first training phase.

The M models may be deep neural network models, or the like.

1103. According to the M models P0For said M models P0Carrying out T1 training procedures in the first training stage to obtain M models P1 and M models P0The performance score of the model obtained by each model in each training process in the first training stage;

i.e., when i is 1, the training in the first training phase is performed.

The above-parameter learning rate will be described as an example. The learning rate controls the model parameters, namely the speed of updating the network weight. For example, when the learning rate of the ith training phase is smaller than that of the (i-1) th training phase, the network weight of the ith training phase is updated slowly.

In each training process of the ith training stage, the model extracts image data of a batch of samples for multiple times, loss values of the batch of image data are calculated according to a predefined image classification evaluation function (namely a loss function) each time, the weight updating direction of the image classification model obtained on the batch of image data is calculated through an optimizer algorithm, and the model weight is updated by combining with the learning rate.

And in each training process, traversing the image classification sample set once by each model.

1104. According to the M models P0The model performance scores of each model obtained in each training process in the first training stage, and the M models P0Obtaining the meta-parameters of each of the M models P1 'and M models P1' of the second training stage by the initial meta-parameter of (a) and the training procedure times T2 of the second training stage;

wherein, M models P are obtained according to the model performance score of each model in each training process in the first training stage0The hyper-parameters of each model in the first training stage are learned, and model performance scores obtained at intervals of delta t training processes are learned and predicted under the action of the hyper-parameters of the model through the specific model performance scores of any model, so that a model performance estimation function can be obtained.

Then, based on a specific model performance score (e.g., obtaining M models P)0The highest model performance score among the model performance scores obtained by the last training process in the first training stage of each model), the hyper-parameter of the model corresponding to the model performance score in the first training stage, and the extreme value solution of the model performance estimation function by a specific delta T (such as the training process times T2 in the second training stage) can obtain the reference hyper-parameter corresponding to the highest model performance score obtained by the last training process in the second training stage.

Based on the reference hyper-parameter and M models P0The model performance score obtained by the last training process of each model in the first training stage is obtained, and M models P1 'in the second training stage and the hyper-parameters of each model in M models P1' are obtained.

Wherein the M models P1' of the second training phase are based on the M models P1And (4) obtaining the product. Specifically, M models P1The parameters of the model with the lower performance score are updated into the M models P1The parameters of the model with the higher performance score.

Furthermore, both the parameters and the hyper-parameters of the model with the higher performance score remain unchanged. The hyper-parameters of the model with lower performance scores are obtained according to the model performance estimation function.

The detailed processing procedure can refer to the description in step 702, and is not described herein again.

1105. According to the hyper-parameters of each model in the M models P1 ', each model in the M models P1 ' is trained in a second training stage to obtain M models P2, and each model in the M models P1 ' is trained in the second training stage to obtain a model performance score;

i.e., when i is 2, the training in the second training phase is performed.

1106. Determining whether the ith training phase is the last training phase;

at this time, it is determined whether the second training phase is the last training phase, and if not, step 1107 is performed; if yes, go to step 1108.

1107. Let i equal to i +1, repeat step 1104 + 1106 and perform training in the next training phase until the last training phase is reached.

1108. And determining a target model according to the model performance score obtained by each model Pi-1' in the last training process in the ith training stage.

The target model may be the highest model among the performance scores of the models obtained in the last training procedure in the ith training stage.

It should be noted that, for the M models P of the first training phase0Setting P0And P0The same.

The above specific implementation process may refer to the description of the foregoing embodiments, and is not described herein again.

Based on the model training method, a target model for image recognition can be obtained.

The following introduces the application of the scheme to a recommendation system:

the embodiment of the application provides a model training method. The training method is applied to a recommendation system and comprises the following steps of C1-C8:

c1, acquiring a recommended data sample set;

the recommendation data sample set includes at least one of a user gender, a user age, and a user historical purchasing behavior.

C2, initializing each model of the M models to obtain M models P0

C3、According to the M models P0For said M models P0Carrying out T1 training procedures in the first training stage to obtain M models P1 and M models P0The performance score of the model obtained by each model in each training process in the first training stage;

the description will be given by taking the example where the hyperparameter is a weighted attenuation coefficient. Wherein the weight attenuation coefficient controls the regularization term coefficient of the network weight. For example, when the weight attenuation coefficient of the ith training stage is greater than that of the (i-1) th training stage, the ith training stage has a more strict weight penalty, so that the weight of the ith training stage becomes smaller and has better generalization.

In each training process of the ith training stage, the model extracts user data of batch samples for multiple times, the loss value of the batch user data is calculated according to a predefined commodity recommendation evaluation function (namely a loss function) each time, and the weight of the recommendation model is updated through an optimizer algorithm.

C4, according to the M models P0The model performance scores of each model obtained in each training process in the first training stage, and the M models P0Obtaining the meta-parameters of each of the M models P1 'and M models P1' of the second training stage by the initial meta-parameter of (a) and the training procedure times T2 of the second training stage;

c5, training each model in the M models P1 ' in a second training stage according to the hyper-parameters of each model in the M models P1 ' to obtain M models P2 and model performance scores obtained by each training process of each model in the M models P1 ' in the second training stage;

c6, determining whether the ith training stage is the last training stage;

at this time, it is determined whether the second training phase is the last training phase, and if not, step 1107 is performed; if yes, go to step 1108.

C7, let i be i +1, repeat step 1104-.

And C8, determining a target model according to the model performance score obtained by each model Pi-1' in the last training process in the ith training stage.

The above specific implementation process may refer to the description of the foregoing embodiments, and is not described herein again.

Based on the model training method, a target model for commodity recommendation and the like can be obtained.

Referring to fig. 12, a model training apparatus provided in this embodiment of the present application includes a first determining module 1201, a second determining module 1202, a model training module 1203, and a model determining module 1204, which specifically includes the following:

a first determining module 1201 for determining the M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' of the above, the number of training processes T of the i-th training stage, the hyper-parameter of each model in the i-1 th training stageiDetermining a reference hyperparameter in the ith training phase, wherein M, i, TiAre integers not less than 2;

a second determining module 1202 for determining the M models Pi-1And the reference hyperparameters in the ith training stage determine M models Pi-1' hyperparameters of each model in the i-th training phase; wherein the M models Pi-1For the M models Pi-2' the M models P obtained in the last training procedure in the i-1 th training stagei-1Is based on said M models Pi-1Processed to obtain the M models Pi-1' with the M models Pi-1One-to-one correspondence of the M models Pi-1With said M models Pi-2' one-to-one correspondence;

a model training module 1203, configured to, in the ith training phase, train the M models Pi-1' the hyper-parameters of each model in T-running said each modeliThe secondary training process obtains M models PiAnd obtaining the M models Pi-1' obtaining a model performance score for each training process of each model in the ith training stage;

a model determining module 1204, configured to determine, when the ith training phase is the last training phase, the M models P based on the training dataiFrom the M models PiThe target model is determined.

As an optional implementation manner, the first determining module 1201 is configured to:

according to M models Pi-2' in the above, each model obtains model performance scores in each training process in the i-1 th training stage, and the M models Pi-2' obtaining a model performance estimation function by the hyper-parameters of each model in the i-1 training stage; according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, at least one model obtains model performance scores in an i-1 training stage through at least one training process, and the model performance estimation function is processed to obtain reference hyper-parameters in the i training stage.

As another optional implementation manner, when i is not less than 3, the first determining module 1201 is further configured to:

according to M models Pi-3' in the above, the M models P and model performance scores obtained from the last training process of each model in the i-2 training stagesi-3' the hyper-parametric, M models P of each model in the i-2 training phasei-2' in the training stage of i-1, each model obtains the model performance score in each training process and the M models Pi-2' obtaining a model performance estimation function by using the hyper-parameters of each model in the i-1 training stage, wherein the M models Pi-2' is based on the M models Pi-3' obtained by processing, the M models Pi-2' with the M models Pi-3' one to one correspondence.

As yet another optional implementation manner, the first determining module 1201 is further configured to:

according to M models P0The model performance scores of each model obtained in each training process in the first i-1 training stages and the M models P0The hyper-parameters of each model in each training stage in the first i-1 training stages are used for obtaining a model performance estimation function, and the M models P0Is the initial model.

The apparatus further comprises a processing module configured to:

processing the model performance estimation function according to the N length ranges to obtain N processed model performance estimation functions, wherein N is an integer not less than 2;

the first determining module 1201 is further configured to:

according to the training process times T of the ith training stageiAnd the M models Pi-2In the method, model performance scores obtained by at least one training process of at least one model in the (i-1) th training stage are respectively processed on the N processed model performance estimation functions to obtain N initial hyper-parameters; and processing the N initial hyper-parameters to obtain the reference hyper-parameter in the ith training stage.

As another optional implementation manner, the second determining module 1202 is configured to:

according to the M models Pi-1Obtaining K first models and M-K second models by the model performance score, wherein the K first models are the M models Pi-1The model performance score of the middle model is smaller than a first preset threshold, the M-K second models are models of which the model performance score is not smaller than the first preset threshold, K is an integer not smaller than 1, and K is smaller than M; updating the parameters of each model in the K first models according to the parameters of the models with the model performance scores larger than a second preset threshold value to obtain K updated first models, wherein the second preset threshold value is not smaller than the first preset threshold value; determining the reference hyperparameter in the ith training phase as each model in the K updated first modelsA hyper-parameter in the i-th training phase; wherein the M models Pi-1' includes the K updated first models and the M-K second models, and the hyper-parameters of each model in the ith training phase are the same as the hyper-parameters of the model in the ith-1 training phase.

As an optional implementation manner, the apparatus further includes a confirmation module, configured to:

according to the training process times T of each training stage in the first i-1 training stagesjAcquiring the data volume of the first i-1 training stages;

and confirming that the data volume of the first i-1 training stages does not exceed a preset value.

Wherein the hyper-parameters comprise at least one of: learning rate, batch size, discard rate, weight decay coefficient, momentum coefficient.

Wherein the object model can be applied to an image processing system, or a recommendation system, etc.

It should be noted that the first determining module 1201, the second determining module 1202, the model training module 1203 and the model determining module 1204 shown in fig. 12 are used for executing relevant steps of the model training method.

For example, the first determining module 1201 is used to execute the relevant content of step 701, the second determining module 1202 is used to execute the relevant content of step 702, the model training module 1203 is used to execute the relevant content of step 703, and the model determining module 1204 is used to execute the relevant content of step 704.

In this embodiment, the model training apparatus is presented in the form of a module. A "module" herein may refer to an application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that may provide the described functionality. Further, the above first determining module 1201, second determining module 1202, model training module 1203 and model determining module 1204 may be implemented by a processor 1302 of the model training apparatus shown in fig. 13.

Fig. 13 is a schematic hardware configuration diagram of another model training apparatus according to an embodiment of the present application. The model training apparatus 1300 shown in fig. 13 (the apparatus 1300 may be a computer device) includes a memory 1301, a processor 1302, a communication interface 1303, and a bus 1304. The memory 1301, the processor 1302, and the communication interface 1303 are communicatively connected to each other through a bus 1304.

The Memory 1301 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM).

The memory 1301 may store a program, and when the program stored in the memory 1301 is executed by the processor 1302, the processor 1302 and the communication interface 1303 are configured to perform each step of the model training method according to the embodiment of the present application.

The processor 1302 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits, and is configured to execute related programs to implement functions required to be executed by units in the model training apparatus according to the embodiment of the present disclosure, or to execute the model training method according to the embodiment of the present disclosure.

The processor 1302 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the model training method of the present application may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 1302. The processor 1302 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1301, and the processor 1302 reads information in the memory 1301, and completes functions to be executed by units included in the model training apparatus according to the embodiment of the present application in combination with hardware of the processor, or executes the model training method according to the embodiment of the method of the present application.

Communication interface 1303 enables communication between apparatus 1300 and other devices or communication networks using transceiver means, such as, but not limited to, a transceiver. For example, data may be acquired through the communication interface 1303.

Bus 1304 may include pathways for communicating information between various components of device 1300, such as memory 1301, processor 1302, and communication interface 1303.

It should be noted that although the apparatus 1300 shown in fig. 13 shows only memories, processors, and communication interfaces, in a specific implementation, those skilled in the art will appreciate that the apparatus 1300 also includes other components necessary for normal operation. Also, those skilled in the art will appreciate that the apparatus 1300 may also include hardware components for performing other additional functions, according to particular needs. Furthermore, those skilled in the art will appreciate that apparatus 1300 may also include only those components necessary to implement embodiments of the present application, and need not include all of the components shown in FIG. 13.

The embodiment of the application also provides a chip system, and the chip system is applied to the electronic equipment; the chip system comprises one or more interface circuits, and one or more processors; the interface circuit and the processor are interconnected through a line; the interface circuit is to receive a signal from a memory of the electronic device and to send the signal to the processor, the signal comprising computer instructions stored in the memory; the electronic device performs the method when the processor executes the computer instructions.

The embodiment of the application also provides a model training device, which comprises a processor and a memory; wherein the memory is configured to store program code and the processor is configured to invoke the program code to perform the model training method.

Embodiments of the present application also provide a computer-readable storage medium having stored therein instructions, which when executed on a computer or processor, cause the computer or processor to perform one or more steps of any one of the methods described above.

The embodiment of the application also provides a computer program product containing instructions. The computer program product, when run on a computer or processor, causes the computer or processor to perform one or more steps of any of the methods described above.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

It should be understood that in the description of the present application, unless otherwise indicated, "/" indicates a relationship where the objects associated before and after are an "or", e.g., a/B may indicate a or B; wherein A and B can be singular or plural. Also, in the description of the present application, "a plurality" means two or more than two unless otherwise specified. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance. Also, in the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as examples, illustrations or illustrations. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion for ease of understanding.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the division of the unit is only one logical function division, and other division may be implemented in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. The shown or discussed mutual coupling, direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a read-only memory (ROM), or a Random Access Memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a Digital Versatile Disk (DVD), or a semiconductor medium, such as a Solid State Disk (SSD).

The above description is only a specific implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any changes or substitutions within the technical scope disclosed in the embodiments of the present application should be covered by the scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:一种小区动态内容推荐方法、系统、智能终端及服务器

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!