Model deployment method and device
1. A method of model deployment, comprising:
acquiring historical access information of an algorithm model library and performance information of each algorithm model in the algorithm model library;
determining the type of the target model to be deployed and the copy number of each target model according to the historical access information and the performance information of each algorithm model;
determining the memory occupation amount of each target model;
and deploying the target models on the server cluster according to the memory occupation amount of each target model, the copy number of each target model and the total memory resources of the server cluster.
2. The method of claim 1, wherein determining the types of target models to be deployed and the number of copies of each target model according to the historical access information and the performance information of each algorithm model comprises:
screening at least one type of target model of which the number of times of access is greater than or equal to a preset number of times within a preset time from the algorithm model base according to the historical access information, and acquiring the load bearing capacity of the target model of each type;
according to the historical access information, counting the actual load quantity which needs to be provided by the server cluster for the target model of each category;
and determining the number of copies of each target model to be deployed according to the actual load capacity and the load bearing capacity.
3. The method of claim 1, wherein said determining a memory footprint for each of said target models comprises:
acquiring the disk occupation amount and the preset memory expansion rate of each target model;
and determining the memory occupation amount of each target model according to the disk occupation amount and the memory expansion rate.
4. The method of claim 3, wherein the step of presetting the memory dilation rate comprises:
obtaining at least one sample algorithm model of each algorithm in the algorithm model library;
for each algorithm, starting a model deployment service on a preset server, and recording the first memory occupation amount of the preset server when no algorithm model is deployed;
loading the sample algorithm model of each algorithm on the preset server, and recording the second memory occupation amount of the preset server when the sample algorithm model is deployed;
and determining the difference value of subtracting the first memory occupation amount from the second memory occupation amount aiming at each algorithm, and dividing the difference value by the disk occupation amount of the sample algorithm model to obtain the memory expansion rate of the sample algorithm model.
5. The method of claim 4, wherein said obtaining at least one sample algorithm model for each algorithm in said library of algorithm models comprises:
training at least one sample algorithm model for each algorithm in the algorithm model library.
6. The method of claim 4, wherein said obtaining at least one sample algorithm model for each algorithm in said library of algorithm models comprises:
and obtaining at least one sample algorithm model corresponding to each algorithm from the algorithm model library.
7. The method of claim 1, wherein said deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster comprises:
determining the total memory occupation amount when all the target models are deployed according to the memory occupation amount of each target model and the copy number of each target model;
judging whether the total memory occupation amount is greater than or equal to a total load threshold of the server cluster;
and when the total memory occupation amount is smaller than the total load threshold value of the server cluster, deploying all the target models into the server cluster according to the memory resource information of each node in the server cluster.
8. The method of claim 7, wherein said deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster comprises:
when the total memory occupation amount is larger than or equal to the total load threshold value of the server cluster, algorithm models with the lowest historical access frequency are sequentially removed from all the target models until the total memory occupation amount of the remaining target models is smaller than the total load threshold value, the remaining target models are deployed into the server cluster according to the memory resource information of each node in the server cluster, and the removed algorithm models are marked as high-priority deployment.
9. The method of claim 1, further comprising, after said deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster, further:
obtaining a model access request of the server cluster, the current total load capacity of the server cluster and the current load capacity of each node in the server cluster;
when the model to be accessed specified in the access request is not deployed in the server cluster and the current total load is smaller than the total load threshold of the server cluster, deploying the model to be accessed on the node with the minimum current memory occupancy rate in the server cluster.
10. The method of claim 9, further comprising, after said deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster, further:
when the model to be accessed specified in the access request is not deployed in the server cluster and the current total load is greater than or equal to the total load threshold value of the server cluster, unloading the algorithm model with the minimum access frequency deployed in the server cluster and deploying the model to be accessed to the corresponding node.
11. The method of claim 9, further comprising, after said deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster, further:
when the current load capacity of a first node is larger than or equal to a load threshold of the first node and the current total load capacity is smaller than a total load threshold of the server cluster, adjusting and deploying an algorithm model which occupies the minimum memory and is deployed on the first node to a second node, wherein the current load capacity of the second node is smaller than the load threshold of the second node.
12. The method of claim 9, further comprising, after said deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster, further:
when the current total load amount is larger than or equal to a total load threshold value of the server cluster, and the current load amount of each node in the server cluster is larger than or equal to a load threshold value of the node, unloading the algorithm model with the minimum access frequency deployed in the server cluster, and marking the algorithm model with the minimum access frequency as high-priority deployment.
13. The method of claim 8 or 9, further comprising:
and when a new node is added into the server cluster, deploying the algorithm model marked as high-priority deployment into the new node.
14. The method of claim 11, wherein the adapting the minimum memory footprint algorithm model deployed on the first node to be deployed to a second node comprises:
and when a plurality of algorithm models with the minimum occupied memory exist, adjusting and deploying the algorithm model with the minimum occupied memory, which consumes the least deployment time, to the second node.
15. The method of claim 1, wherein said deploying the target model on the server cluster comprises:
and when the file of the target model exists locally, reading the locally stored file of the target model for deployment.
16. An electronic device, comprising:
a memory to store a computer program;
a processor to execute the computer program to implement the method of any one of claims 1 to 15.
Background
Artificial Intelligence (AI) is the subject of research to make computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.). In artificial intelligence, corresponding functions are usually realized through various computer algorithms, in order to improve efficiency, an algorithm model, such as an algorithm model based on deep learning, can be trained in advance for the algorithm, and the algorithm model is deployed on a server of a cloud platform, so that a user can obtain related algorithm model services. In a real scenario, the algorithmic model may be a mathematical model used to describe the objective world. In practical scenarios, in order to adapt to a variety of scenarios, often an algorithm may build one or more algorithm models.
With the development of the artificial intelligence field becoming more mature and more landing scenes, the scale of many artificial intelligence cloud platforms is expanding day by day, and the number of algorithm models generated on the cloud platforms is also increasing rapidly. Moreover, as the number of the algorithm model parameters gradually increases, the algorithm models used in the industry are also increasingly large, that is, resources (here, CPU resources and memory resources of the server) occupied by each algorithm model are also increasingly large.
The rapid development of the artificial intelligence algorithm model also brings various problems: because the model needs to be deployed in the server memory first to provide the service, if all the models on the cloud platform are deployed in the server memory, the occupied memory is large, and due to the characteristics of the cloud platform, the number of hot spot models only occupies a small part, and the utilization rate of the server memory and the CPU is extremely low.
Disclosure of Invention
The embodiment of the application aims to provide a model deployment method and equipment, so that the utilization rate of server system resources during model deployment is improved, and the efficiency of model deployment is improved.
A first aspect of an embodiment of the present application provides a model deployment method, including: acquiring historical access information of an algorithm model library and performance information of each algorithm model in the algorithm model library; determining the type of the target model to be deployed and the copy number of each target model according to the historical access information and the performance information of each algorithm model; determining the memory occupation amount of each target model; and deploying the target models on the server cluster according to the memory occupation amount of each target model, the copy number of each target model and the total memory resources of the server cluster.
In an embodiment, the determining, according to the historical access information and the performance information of each algorithm model, a type of a target model to be deployed and a number of copies of each target model, includes: screening at least one type of target model of which the number of times of access is greater than or equal to a preset number of times within a preset time from the algorithm model base according to the historical access information, and acquiring the load bearing capacity of the target model of each type; according to the historical access information, counting the actual load quantity which needs to be provided by the server cluster for the target model of each category; and determining the number of copies of each target model to be deployed according to the actual load capacity and the load bearing capacity.
In an embodiment, the determining the memory footprint of each of the target models includes: acquiring the disk occupation amount and the preset memory expansion rate of each target model; and determining the memory occupation amount of each target model according to the disk occupation amount and the memory expansion rate.
In one embodiment, the step of presetting the memory expansion ratio includes: obtaining at least one sample algorithm model of each algorithm in the algorithm model library; starting a model deployment service on a preset server, and recording a first memory occupation amount of the preset server when an algorithm model is not deployed; loading the sample algorithm model of each algorithm on the preset server, and recording the second memory occupation amount of the preset server when the sample algorithm model is deployed; and determining the difference value of the second memory occupation amount minus the first memory occupation amount, and dividing the difference value by the disk occupation amount of the sample algorithm model to obtain the memory expansion rate of the sample algorithm model.
In an embodiment, the obtaining at least one sample algorithm model for each algorithm in the algorithm model library includes: training at least one sample algorithm model for each algorithm in the algorithm model library.
In an embodiment, the obtaining at least one sample algorithm model for each algorithm in the algorithm model library includes: and obtaining at least one sample algorithm model corresponding to each algorithm from the algorithm model library.
In an embodiment, the deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster includes: determining the total memory occupation amount when all the target models are deployed according to the memory occupation amount of each target model and the copy number of each target model; judging whether the total memory occupation amount is greater than or equal to a total load threshold of the server cluster; and when the total memory occupation amount is smaller than the total load threshold value of the server cluster, deploying all the target models into the server cluster according to the memory resource information of each node in the server cluster.
In an embodiment, the deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster includes: when the total memory occupation amount is larger than or equal to the total load threshold value of the server cluster, algorithm models with the lowest historical access frequency are sequentially removed from all the target models until the total memory occupation amount of the remaining target models is smaller than the total load threshold value, the remaining target models are deployed into the server cluster according to the memory resource information of each node in the server cluster, and the removed algorithm models are marked as high-priority deployment.
In an embodiment, after the deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster, the method further includes: obtaining a model access request of the server cluster, the current total load capacity of the server cluster and the current load capacity of each node in the server cluster; when the model to be accessed specified in the access request is not deployed in the server cluster and the current total load is smaller than the total load threshold of the server cluster, deploying the model to be accessed on the node with the minimum current memory occupancy rate in the server cluster.
In an embodiment, after the deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster, the method further includes: when the model to be accessed specified in the access request is not deployed in the server cluster and the current total load amount is greater than or equal to the total load threshold value of the server cluster, unloading the algorithm model with the minimum access frequency deployed in the server cluster and deploying the model to be accessed to the corresponding node.
In an embodiment, after the deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster, the method further includes: when the current load capacity of a first node is larger than or equal to a load threshold of the first node and the current total load capacity is smaller than a total load threshold of the server cluster, adjusting and deploying an algorithm model which occupies the minimum memory and is deployed on the first node to a second node, wherein the current load capacity of the second node is smaller than the load threshold of the second node.
In an embodiment, after the deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster, the method further includes: when the current total load amount is larger than or equal to a total load threshold value of the server cluster, and the current load amount of each node in the server cluster is larger than or equal to a load threshold value of the node, unloading the algorithm model with the minimum access frequency deployed in the server cluster, and marking the algorithm model with the minimum access frequency as high-priority deployment.
In one embodiment, the method further comprises: and when a new node is added into the server cluster, deploying the algorithm model marked as high-priority deployment into the new node.
In an embodiment, the adjusting and deploying the minimum memory-occupied algorithm model deployed on the first node to the second node includes: and when a plurality of algorithm models with the minimum occupied memory exist, adjusting and deploying the algorithm model with the minimum occupied memory, which consumes the least deployment time, to the second node.
In an embodiment, the deploying the object model on the server cluster includes: and when the file of the target model exists locally, reading the locally stored file of the target model for deployment.
A second aspect of the embodiments of the present application provides a model deployment apparatus, including: the first acquisition module is used for acquiring historical access information of an algorithm model library and performance information of each algorithm model in the algorithm model library; the determining module is used for determining the types of the target models to be deployed and the copy number of each target model according to the historical access information and the performance information of each algorithm model; the calculation module is used for determining the memory occupation amount of each target model; and the deployment module is used for deploying the target models on the server cluster according to the memory occupation amount of each target model, the copy number of each target model and the total memory resources of the server cluster.
In one embodiment, the determining module is configured to: screening at least one type of target model of which the number of times of access is greater than or equal to a preset number of times within a preset time from the algorithm model base according to the historical access information, and acquiring the load bearing capacity of the target model of each type; according to the historical access information, counting the actual load quantity which needs to be provided by the server cluster for the target model of each category; and determining the number of copies of each target model to be deployed according to the actual load capacity and the load bearing capacity.
In one embodiment, the calculation module is configured to: acquiring the disk occupation amount and the preset memory expansion rate of each target model; and determining the memory occupation amount of each target model according to the disk occupation amount and the memory expansion rate.
In one embodiment, the system further comprises a presetting module for: obtaining at least one sample algorithm model of each algorithm in the algorithm model library; starting a model deployment service on a preset server, and recording a first memory occupation amount of the preset server when an algorithm model is not deployed; loading the sample algorithm model of each algorithm on the preset server, and recording the second memory occupation amount of the preset server when the sample algorithm model is deployed; and determining the difference value of the second memory occupation amount minus the first memory occupation amount, and dividing the difference value by the disk occupation amount of the sample algorithm model to obtain the memory expansion rate of the sample algorithm model.
In an embodiment, the obtaining at least one sample algorithm model for each algorithm in the algorithm model library includes: training at least one sample algorithm model for each algorithm in the algorithm model library.
In an embodiment, the obtaining at least one sample algorithm model for each algorithm in the algorithm model library includes: and obtaining at least one sample algorithm model corresponding to each algorithm from the algorithm model library.
In one embodiment, the deployment module is configured to: determining the total memory occupation amount when all the target models are deployed according to the memory occupation amount of each target model and the copy number of each target model; judging whether the total memory occupation amount is greater than or equal to a total load threshold of the server cluster; and when the total memory occupation amount is smaller than the total load threshold value of the server cluster, deploying all the target models into the server cluster according to the memory resource information of each node in the server cluster.
In one embodiment, the deployment module is configured to: when the total memory occupation amount is larger than or equal to the total load threshold value of the server cluster, algorithm models with the lowest historical access frequency are sequentially removed from all the target models until the total memory occupation amount of the remaining target models is smaller than the total load threshold value, the remaining target models are deployed into the server cluster according to the memory resource information of each node in the server cluster, and the removed algorithm models are marked as high-priority deployment.
In one embodiment, the method further comprises: a second obtaining module, configured to obtain, after the target models are deployed on the server cluster according to the memory occupancy of each target model, the number of copies of each target model, and the total memory resources of the server cluster, a model access request for the server cluster, a current total load amount of the server cluster, and a current load amount of each node in the server cluster; the deployment module is further to: when the model to be accessed specified in the access request is not deployed in the server cluster and the current total load is smaller than the total load threshold of the server cluster, deploying the model to be accessed on the node with the minimum current memory occupancy rate in the server cluster.
In an embodiment, after the deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster, the deploying module is further configured to: when the model to be accessed specified in the access request is not deployed in the server cluster and the current total load amount is greater than or equal to the total load threshold value of the server cluster, unloading the algorithm model with the minimum access frequency deployed in the server cluster and deploying the model to be accessed to the corresponding node.
In an embodiment, after the deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster, the deploying module is further configured to: when the current load capacity of a first node is larger than or equal to a load threshold of the first node and the current total load capacity is smaller than a total load threshold of the server cluster, adjusting and deploying an algorithm model which occupies the minimum memory and is deployed on the first node to a second node, wherein the current load capacity of the second node is smaller than the load threshold of the second node.
In an embodiment, after the deploying the target models on the server cluster according to the memory footprint of each of the target models, the number of copies of each of the target models, and the total memory resources of the server cluster, the deploying module is further configured to: when the current total load amount is larger than or equal to a total load threshold value of the server cluster, and the current load amount of each node in the server cluster is larger than or equal to a load threshold value of the node, unloading the algorithm model with the minimum access frequency deployed in the server cluster, and marking the algorithm model with the minimum access frequency as high-priority deployment.
In one embodiment, the deployment module is further configured to: and when a new node is added into the server cluster, deploying the algorithm model marked as high-priority deployment into the new node.
In an embodiment, the adjusting and deploying the minimum memory-occupied algorithm model deployed on the first node to the second node includes: and when a plurality of algorithm models with the minimum occupied memory exist, adjusting and deploying the algorithm model with the minimum occupied memory, which consumes the least deployment time, to the second node.
In one embodiment, the deployment module is further configured to: and when the file of the target model exists locally, reading the locally stored file of the target model for deployment.
A third aspect of embodiments of the present application provides an electronic device, including: a memory to store a computer program; a processor configured to perform the method of the first aspect of the embodiments of the present application and any of the embodiments of the present application.
A fourth aspect of embodiments of the present application provides a non-transitory electronic device-readable storage medium, including: a program which, when run by an electronic device, causes the electronic device to perform the method of the first aspect of an embodiment of the present application and any embodiment thereof.
According to the model deployment method and the model deployment device, the number of target models and copies thereof which need to be deployed in the server cluster is determined by counting historical access information of the algorithm model library and performance information of each algorithm model, then the size of a memory occupied by each target model is determined, and partial models are deployed in the memory of the server cluster by combining the memory of the server cluster system and the load condition of a CPU. Therefore, the actual use condition of each model is fully considered, the limited resources are reasonably utilized by combining the memory resources of the server, the utilization rate of the server system resources during model deployment is improved, and the model deployment efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram illustrating a model deployment method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart diagram illustrating a model deployment method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a model deployment apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.
As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10. The memory 12 stores instructions executable by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 may perform all or part of the processes of the method in the embodiments described below, so as to improve the utilization rate of the server system resources when the model is deployed and improve the efficiency of the model deployment.
In an embodiment, the electronic device 1 may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, or a mainframe computing system composed of multiple computers.
Please refer to fig. 2, which is a model deployment method according to an embodiment of the present application, and the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to a scenario in which an algorithm model is deployed in a server cluster, so as to improve a utilization rate of server system resources during model deployment and improve efficiency of model deployment.
The method comprises the following steps:
step 201: and acquiring historical access information of the algorithm model library and performance information of each algorithm model in the algorithm model library.
In this step, the algorithm model may be a prediction model based on a neural network or deep learning, and the algorithm model library includes a plurality of algorithm models to be deployed. The historical access information at least comprises a record of the historical access of each algorithm model in the model library, such as access times, access time, access events and other information. Each algorithm model has its own performance information, and the performance information may include information such as the size of memory occupied by the model and the time required to deploy the model. The performance information may be stored in a model library or may be separately stored in a performance information database. The historical access information and performance information may be read from the corresponding databases, respectively.
In an embodiment, a model traffic statistic module may be added to the server cluster system, and statistics of access times of each model within a certain time, which may be within one hour, one day, one week, and the like, are recorded, and the specific time granularity may be configurable and changed according to actual service requirements. Therefore, historical access information of each algorithm model is counted.
Step 202: and determining the type of the target model to be deployed and the copy number of each target model according to the historical access information and the performance information of each algorithm model.
In this step, the number of target model machines and their copies to be deployed may be determined according to the historical access information and the performance information of each algorithm model acquired in step 201, and a preset screening rule. For example, the filtering rule may be: the algorithm model with the access rate reaching the specific threshold value can be deployed, so that the screened target model can better accord with the utilization rate of the model in the actual scene, and the phenomenon that some models with less use are also blindly deployed on a server cluster to waste server resources is avoided. And the actual performance information of each model is comprehensively considered to determine the number of copies to be deployed finally, so that the resource utilization rate of the server can be further improved.
In one embodiment, step 202 may comprise: and screening at least one type of target model of which the number of times of access is greater than or equal to the preset number of times within a preset time period from the algorithm model base according to the historical access information, and acquiring the load bearing capacity of the target model of each type. And according to the historical access information, counting the actual load quantity which needs to be provided by the server cluster aiming at the target model of each category. And determining the copy number of each target model to be deployed according to the actual load capacity and the load bearing capacity.
In this step, the preset time period may be one hour, one day or one week, the preset times may be 3 times, 10 times or the like, as the case may be, or the preset time period and the preset times may be set based on the analysis result of the big data used for the model. The algorithm models in the algorithm model library can be classified according to the historical access information, such as: three types of models which are necessarily deployed, deployable and undeployed (or a fourth type which is deployable or undeployable) are adopted, wherein the model which is necessarily deployed is the target model. When classifying, according to historical visit information, a model with visits within one hour or visits more than a specific time in one day is classified as a target model which needs to be deployed, and a model without visits within three days is classified as a model which does not need to be deployed. Based on the rules, the target model can be screened from the algorithm model library. And the load bearing capacity of each type of target model can be obtained from the performance information of the target model. The load bearing capacity can be the maximum bearable load capacity of a single target model, can be measured in the existing model pressure measurement mode, and is stored in the performance information.
In an actual scenario, the number of copies of an object model is determined by the load capacity of a single copy of the object model and the actual load capacity of the CPU that the server cluster needs to provide for the class of object models. The actual load amount may be determined by the historical access information recorded by the traffic statistics module, for example, the actual load amount required to be provided by the server cluster for the target model may be obtained according to a certain policy (for example, a policy of calculating an average value or a harmonic average value of the various load amounts) by counting data such as a maximum load amount, an average load amount, a 99% load amount (which is a load amount that 99% of time is less than or equal to and 1% of time is greater than) provided by the target model server in the historical access information, and then the formula (2) is used for each target model: the number of copies is the actual load amount/load bearing amount, and the corresponding number of copies can be obtained.
Step 203: and determining the memory occupation amount of each target model.
In this step, the memory occupied amount refers to the amount of memory occupied by the server when a target model is deployed on the server. After the target models to be deployed and the number of copies thereof are obtained, it is also considered that the memory resources of the server cluster may not bear the target models to be deployed, and therefore, the memory occupation amount of each target model needs to be determined first.
In one embodiment, step 203 may comprise: and acquiring the disk occupation amount and the preset memory expansion rate of each target model. And determining the memory occupation amount of each target model according to the disk occupation amount and the memory expansion rate.
In practical scenarios, it is often difficult to calculate the memory footprint of the algorithmic model, where an estimation approach may be employed. In an actual scenario, the size of the algorithm model after being persisted (stored) to a file is relatively easy to obtain (for example, the size can be directly obtained from the stored model file attribute), and it is assumed that for each algorithm, the ratio between the size of the model persisted to a disk and the size of the model in the memory is not changed, but the model occupies a larger amount in the memory, so that the ratio can be referred to as an expansion rate, that is, an algorithm corresponds to an expansion rate. Obtaining the formula (1): memory occupancy is the expansion rate of the disk occupancy. Therefore, after the disk occupation amount and the preset memory expansion rate of each target model are obtained, the memory occupation amount of each target model can be obtained by adopting the formula (1).
In one embodiment, the step of presetting the memory expansion ratio may include: at least one sample algorithm model for each algorithm in the algorithm model library is obtained. And starting the model deployment service on the preset server, and recording the first memory occupation amount of the preset server when the algorithm model is not deployed. And loading the sample algorithm model of each algorithm on the preset server, and recording the second memory occupation amount of the preset server when the sample algorithm model is deployed. And determining the difference value of the second memory occupation amount minus the first memory occupation amount, and dividing the difference value by the disk occupation amount of the sample algorithm model to obtain the memory expansion rate of the sample algorithm model.
In this embodiment, the algorithm model library includes one or more models of each algorithm, and the sample algorithm model may directly pull at least one sample algorithm model corresponding to each algorithm from the algorithm model library in consideration of reasonable utilization of resources. If in an actual scene, when the models in the algorithm model library cannot meet the requirements, one or more sample algorithm models can be trained for each algorithm in advance (the more the sample algorithm models are, the more accurate the estimation is), and the controllability of the sample algorithm model obtained by a single training mode is strong, so that the requirement of the expansion rate accuracy is easily met. After the sample algorithm model is determined, the service of the deployment model is started, the first memory occupation amount of the server when no model is deployed is recorded, for example, the first memory occupation amount is independently deployed on a preset server, and the memory occupation information of the whole preset server is recorded. Then, for each algorithm, loading a plurality of sample algorithm models on a preset server respectively, recording a second memory occupation amount of the preset server again, wherein a difference value obtained by subtracting the first memory occupation amount from the second memory occupation amount is a memory occupation increase value before and after loading, and dividing the difference value by a disk occupation amount of the sample algorithm models to obtain an expansion rate corresponding to the algorithm.
Step 204: and deploying the target models on the server cluster according to the memory occupation amount of each target model, the copy number of each target model and the total memory resources of the server cluster.
In this step, the memory occupancy amount when all the target models are deployed can be obtained according to the target models to be deployed and the number of copies thereof, and the target models are deployed into the memory of the server cluster in a balanced manner by combining the memory resources of each node in the server cluster.
The model deployment method determines the number of target models and copies thereof which need to be deployed in the server cluster by counting the historical access information of the algorithm model library and the performance information of each algorithm model, then determines the size of the memory occupied by each target model, and deploys part of the models into the memory of the server cluster by combining the memory of the server cluster system and the CPU load condition. Therefore, the actual use condition of each model is fully considered, the limited resources are reasonably utilized by combining the memory resources of the server, the utilization rate of the server system resources during model deployment is improved, and the model deployment efficiency is improved.
Please refer to fig. 3, which is a model deployment method according to an embodiment of the present application, and the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to a scenario in which an algorithm model is deployed in a server cluster, so as to improve a utilization rate of server system resources during model deployment and improve efficiency of model deployment.
The method comprises the following steps:
step 301: and acquiring historical access information of the algorithm model library and performance information of each algorithm model in the algorithm model library. See the description of step 201 in the above embodiments for details.
Step 302: and determining the type of the target model to be deployed and the copy number of each target model according to the historical access information and the performance information of each algorithm model. See the description of step 202 in the above embodiments for details.
Step 303: and determining the memory occupation amount of each target model. See the description of step 203 in the above embodiments for details.
Step 304: and determining the total memory occupation amount when all the target models are deployed according to the memory occupation amount of each target model and the copy number of each target model.
In this step, when deploying the target model, the memory carrying capacity of the server needs to be considered, so the total memory occupation amount when deploying all the target models needs to be determined first. For a target model, the product of the memory occupancy of a single target model and the number of copies of the target model is the total memory occupancy of the target model, and the sum of the total memory occupancy of each target model is the total memory occupancy when all target models are deployed.
For example, assuming that three algorithms a1, a2 and A3 exist in the algorithm model library, the model performance information is as follows:
1. the algorithm A1 has 3 models, namely a model M1-1 to a model M1-3, and the memory occupation amount is 1G.
2. The algorithm A2 has 3 models, namely a model M2-1 to a model M2-3, and the memory occupation amount of each model is 2G.
3. The algorithm A3 has 3 models, namely a model M3-1 to a model M3-3, and the memory occupation amount of each model is 3G.
Wherein, the longer the model deployment time with larger memory occupation is, the longer the model deployment time with smaller number is.
Through step 302, the screening rule is that a model with access in one day must be deployed, and an unused model without access in one week is not deployed, and it is assumed that the target model to be deployed which is finally screened with access in the last day is: model M1-1, model M1-2, model M2-1 and model M3-1. The total memory usage amount can be calculated to be 7G.
Step 305: and judging whether the total memory occupation amount is greater than or equal to the total load threshold of the server cluster. If yes, go to step 307, otherwise go to step 306.
In this step, the load threshold is the allowable load when the server is kept in a good operating state, and may be set in advance based on the actual condition of the server. A plurality of server nodes may be included in a server cluster, each node being implemented by one or more computer devices. For example, assume a cluster of 3 servers, including: node S1, node S2, node S3. The available memory of each server node is 5G, the load threshold of the memory is 3G, and the imbalance threshold is 2G (that is, when the difference between the available memories of every two nodes reaches the threshold, the two nodes are considered to be imbalanced, in this case, the deployment node of the model needs to be adjusted, and the threshold may be adaptively configured based on an actual scene). The total load threshold for the server cluster is 9G. Taking the target model in step 304 as: for example, when the model M1-1, the model M1-2, the model M2-1 and the model M3-1 are used, if the total memory usage amount 7G is smaller than the total load threshold value 9G, step 306 is entered, otherwise, step 307 is entered.
Step 306: and deploying all the target models into the server cluster according to the memory resource information of each node in the server cluster.
In this step, the memory resource information of each node at least includes information such as a total available memory size of the node, a load threshold of the configured memory of the node, and the like, and may further include information of an imbalance threshold between each two configured nodes. For example, in the above example of step 305, the server cluster includes 3 nodes: node S1, node S2, node S3. The available memory of each server node is 5G, the load threshold of the memory is 3G, and the imbalance threshold is 2G. When the total memory usage amount is smaller than the total load threshold of the server cluster, it indicates that all the target models screened in step 302 can be deployed, and all the target models can be deployed in the server cluster. For example, the selected model M1-1, model M1-2, model M2-1, and model M3-1, the total memory usage amount 7G is smaller than the total load threshold 9G, and all the models may be deployed, in order to balance the utilization rate of the memory resource information of each node, an imbalance threshold 2G between every two nodes may be considered, that is, the difference between the available memories of every two nodes is as small as 2G, and then a first deployment formula may be obtained as follows:
node S1: and the deployment model M1-1 and the deployment model M1-2 occupy the memory 2G.
Node S2: and the deployment model M2-1 occupies a memory 2G.
Node S3: and the deployment model M3-1 occupies a memory 3G.
In an embodiment, when the total amount of memory occupied is smaller than the total load threshold of the server cluster, in order to fully utilize the server resources, a model with higher historical access traffic may be selected from the unselected algorithm models and deployed together.
Step 307: and sequentially removing the algorithm model with the lowest historical access frequency from all the target models until the total memory occupation amount of the remaining target models is smaller than a total load threshold value, deploying the remaining target models into the server cluster according to the memory resource information of each node in the server cluster, and marking the removed algorithm model as high-priority deployment.
In this step, when the total amount of memory usage is greater than or equal to the total load threshold of the server cluster, some target models with the lowest historical access frequency or the lowest access traffic within a period of time may be removed according to the historical access frequency or the access traffic within a period of time, and then the remaining target models are deployed to the server cluster. In order to ensure that the removed target models can be deployed in time when resources exist, the removed target models can be marked as high-priority deployment, and the models are deployed to a server memory once the system is idle.
In one embodiment, model loading may be balanced among the nodes of the distributed server cluster so that memory and CPU utilization is optimized and changes are made quickly as the nodes adjust. For example, after the above steps, a part of the target models have been selected for deployment, at this time, a timing task may be configured, the necessary deployment, the deployable and unnecessary deployment models may be recalculated, the deployment policy may be adjusted, or new nodes and deleted nodes in the cluster may be dealt with, which is called self-balancing. Therefore, after step 307 or step 306, it may further include:
step 308: and acquiring a model access request of the server cluster, the current total load capacity of the server cluster and the current load capacity of each node in the server cluster.
In this step, the self-balancing may obtain the required data through the real-time load of the monitoring server cluster system, or may obtain the latest access data from the model traffic statistic module.
Step 309: and when the model to be accessed specified in the access request is not deployed in the server cluster and the current total load is smaller than the total load threshold of the server cluster, deploying the model to be accessed on the node with the minimum current memory occupancy rate in the server cluster.
In this step, when the to-be-accessed model specified in the access request is not deployed in the server cluster, in order to meet the access request, the to-be-accessed model needs to be deployed, not only the total load threshold value but also the current memory occupancy rate of each node need to be considered during deployment, if the current total load amount is smaller than the total load threshold value of the server cluster, the node with the minimum current memory occupancy rate is selected to deploy the to-be-accessed model, otherwise, the step 311 is entered.
Based on the embodiment of step 306, assuming that an access request to the model M2-2 is detected at this time, the model M2-2 is not deployed in step 306, and therefore in order to satisfy the access request, the model needs to be deployed, the node S1 or the node S2 that is currently occupied the least is selected for deployment, and assuming that the node S1 is selected, the second deployment method is obtained as follows:
node S1: the deployment model M1-1, the deployment model M1-2 and the deployment model M2-2 occupy 4G of the memory.
Node S2: and the deployment model M2-1 occupies a memory 2G.
Node S3: and the deployment model M3-1 occupies a memory 3G.
As can be seen from the above deployment result, the node S1 exceeds its own load threshold 3G, but is still within the available memory range, and it is acceptable that, in order to make reasonable use of the server resources, the deployment situation may be adjusted through a self-balancing task, and therefore, in an embodiment, after step 308, the method may further include:
step 310: when the current load capacity of the first node is larger than or equal to the load threshold of the first node and the current total load capacity is smaller than the total load threshold of the server cluster, the algorithm model which occupies the minimum memory and is deployed on the first node is adjusted and deployed to the second node, and the current load capacity of the second node is smaller than the load threshold of the second node.
In this step, based on the above embodiment, the node S1 is the first node, and the node S2 is the second node, and the memory usage of 1G can be adjusted from the node S1 to the node S2 for self-balancing.
In one embodiment, step 310 may comprise: and when a plurality of algorithm models which occupy the minimum memory exist, adjusting and deploying the algorithm model which occupies the minimum memory and consumes the minimum deployment time to the second node.
In this step, for example, in the second deployment mode, there are two algorithm models with the smallest memory occupation, which are model M1-1 and model M1-2, respectively, and assuming that the deployment speed of model M1-2 is faster, model M1-2 is selected and adjusted to node S2, so that the third deployment mode is obtained as follows:
node S1: and deploying the model M1-1 and the model M2-2 to occupy the memory 3G.
Node S2: and deploying the model M2-1 and the model M1-2 to occupy the memory 3G.
Node S3: and the deployment model M3-1 occupies a memory 3G.
That is, when the traffic distribution of the deployed model is not uniform, the CPU resource usage of the server is unbalanced, and the model traffic is balanced as much as possible by adjusting the node where the deployed model is located without adjusting the number of deployed models and copies.
In an embodiment, after step 308, the method further includes:
step 311: and when the model to be accessed specified in the access request is not deployed in the server cluster and the current total load amount is greater than or equal to the total load threshold value of the server cluster, unloading the algorithm model with the minimum access frequency deployed in the server cluster, and deploying the model to be accessed to the corresponding node.
In this step, if the model to be accessed is not deployed in the server cluster, and the current total load amount is greater than or equal to the total load threshold of the server cluster, it indicates that the load of the server cluster increases, some algorithm models with the minimum traffic (i.e., the minimum access frequency) among the deployed models need to be dynamically unloaded, the system load is reduced, and then the model to be accessed is deployed on the corresponding node.
For example, in the third deployment mode based on step 309, an access request to the model M3-2 is detected, the model M3-2 is not deployed, and none of the nodes in the server cluster can accommodate the model M3-2 occupying the memory of 3G, so that an already deployed model needs to be unloaded to vacate the resource location. Assuming model M2-1 has not been visited for more than 1 day at this time (belonging to the least visited algorithmic model), model M2-1 may be unloaded from node S2 and deployed from model M3-2 to node S2, resulting in a fourth deployment scenario as follows:
node S1: and deploying the model M1-1 and the model M2-2 to occupy the memory 3G.
Node S2: and deploying the model M1-2 and the model M3-2, and occupying the memory 4G.
Node S3: and the deployment model M3-1 occupies a memory 3G.
In an embodiment, the self-balancing task discovery node S2 exceeds its own load threshold 3G, and other nodes are fully loaded, and it is necessary to unload the deployed algorithmic model to balance the model traffic, so that after step 308, the method may further include:
step 312: and when the current total load capacity is greater than or equal to the total load threshold of the server cluster, and the current load capacity of each node in the server cluster is greater than or equal to the load threshold of the node, unloading the algorithm model with the minimum access frequency deployed in the server cluster, and marking the algorithm model with the minimum access frequency as high-priority deployment.
In this step, for example, based on the fourth deployment mode in step 311, the node S2 exceeds its own load threshold 3G, and other nodes are already fully loaded, assuming that the model M3-1 has not been visited for the longest time (i.e. the access frequency is the smallest), the model M3-1 may be unloaded from the node S3, and a model on the node S2 is transferred to the node S3, at this time, the candidate model M1-2 and the model M3-2 are moved, and since the deployment time of the model M3-2 is greater than that of the model M1-2, the limited selection is transferred to the model M1-2 to the node S3, so as to obtain the fifth deployment mode as follows:
node S1: and deploying the model M1-1 and the model M2-2 to occupy the memory 3G.
Node S2: and the deployment model M3-2 occupies a memory 3G.
Node S3: and the deployment model M1-2 occupies a memory 1G.
In addition, the unloaded model M3-1 is marked for deployment reasons as high priority deployment.
In one embodiment, when a node in the cluster is added, indicating that a new server resource is added, the model on the other node or the model marked as a high priority deployment is moved to the node by a portion to balance the model traffic. Therefore, after step 308, it may further include:
step 313: and when a new node is added into the server cluster, deploying the algorithm model marked as high-priority deployment into the new node.
In this step, if a new node is added to the server cluster, it indicates that a new resource is available, and at this time, the current load of the system is reduced below the medium load threshold, and it is possible to check whether an unloaded target model exists, and if so, the unloaded target model is immediately loaded. And in order to improve the service efficiency of the model, the algorithm model marked as high-priority deployment can be preferentially deployed into the new node. For example, at this time, a new node S4 is added to the server cluster, the memory is also 5G, and the load threshold is 3G, where the occupied memory of the node S4 is 0G. When the self-balancing task discovers that a new node is added and the memory usage is 0G, the model M3-1 marked as high-priority deployment in the fifth deployment mode in step 312 may be pulled into the server memory again, so as to obtain a sixth deployment mode as follows:
node S1: and deploying the model M1-1 and the model M2-2 to occupy the memory 3G.
Node S2: and the deployment model M3-2 occupies a memory 3G.
Node S3: and the deployment model M1-2 occupies a memory 1G.
Node S4: and the deployment model M3-1 occupies a memory 3G.
Similarly, when a node in the server cluster is deleted, the model on the deleted node needs to be moved to other nodes, and the resource utilization rate of each node needs to be considered comprehensively in the moving principle.
It should be noted that, all of the steps 309 to 313 may be executed after the step 308, and the order may not be distinguished, the flow order shown in this embodiment is only an optional execution order given as an example, and the execution order of the steps 309 to 313 is not limited in this embodiment.
In an embodiment, in order to reduce the model deployment time to the maximum, when deploying a target model on a server cluster and performing self-balancing task adjustment deployment, when a file of the target model exists locally, the file of the locally stored target model may be preferentially read for deployment. For example, in the above embodiments, a large number of operations of model deployment and transfer are mentioned, and at this time, how to reduce the model deployment time is considered, and a local disk may be used for caching to accelerate the model deployment speed.
Assume that in general the model deployment process can be as follows:
a. and pulling the model file from the model server to the local.
b. And reading the model file by the program and deploying the model file to the server memory.
c. Deleting model files
Wherein when the model is large, the step a takes time and is not easy to be overlooked.
Therefore, the file system of the local disk can be used as a layer of cache, and the adjusted model deployment process is as follows:
a. and c, checking whether the local disk has model file cache, if so, jumping to the step c, otherwise, entering the step b.
b. And pulling the model file from the model server to the local.
c. And reading the model file of the local disk by the program, and deploying the model file to the memory of the server.
d. It is determined whether a file needs to be deleted (this step can be referred to in a manner common in the art).
Please refer to fig. 4, which is a model deployment apparatus 400 according to an embodiment of the present application, and the apparatus may be applied to the electronic device 1 shown in fig. 1, and may be applied to a scenario in which an algorithm model is deployed in a server cluster, so as to improve a utilization rate of server system resources during model deployment and improve efficiency of model deployment. The device includes: the system comprises a first obtaining module 401, a determining module 402, a calculating module 403 and a deploying module 404, wherein the principle relationship of the modules is as follows:
the first obtaining module 401 is configured to obtain historical access information of the algorithm model library and performance information of each algorithm model in the algorithm model library. A determining module 402, configured to determine, according to the historical access information and the performance information of each algorithm model, a type of the target model to be deployed and a number of copies of each target model. And a calculating module 403, configured to determine a memory usage amount of each target model. A deployment module 404, configured to deploy the target models on the server cluster according to the memory occupancy amount of each target model, the number of copies of each target model, and the total memory resources of the server cluster.
In one embodiment, the determining module 402 is configured to: and screening at least one type of target model of which the number of times of access is greater than or equal to the preset number of times within a preset time period from the algorithm model base according to the historical access information, and acquiring the load bearing capacity of the target model of each type. And according to the historical access information, counting the actual load quantity which needs to be provided by the server cluster aiming at the target model of each category. And determining the copy number of each target model to be deployed according to the actual load capacity and the load bearing capacity.
In one embodiment, the calculation module 403 is configured to: and acquiring the disk occupation amount and the preset memory expansion rate of each target model. And determining the memory occupation amount of each target model according to the disk occupation amount and the memory expansion rate.
In one embodiment, the default module 405 is further configured to: at least one sample algorithm model for each algorithm in the algorithm model library is obtained. And starting the model deployment service on the preset server, and recording the first memory occupation amount of the preset server when the algorithm model is not deployed. And loading the sample algorithm model of each algorithm on the preset server, and recording the second memory occupation amount of the preset server when the sample algorithm model is deployed. And determining the difference value of the second memory occupation amount minus the first memory occupation amount, and dividing the difference value by the disk occupation amount of the sample algorithm model to obtain the memory expansion rate of the sample algorithm model.
In one embodiment, obtaining at least one sample algorithm model for each algorithm in the library of algorithm models comprises: at least one sample algorithm model is trained for each algorithm in the library of algorithm models.
In an embodiment, the obtaining at least one sample algorithm model for each algorithm in the algorithm model library includes: and obtaining at least one sample algorithm model corresponding to each algorithm from the algorithm model library.
In one embodiment, the deployment module 404 is configured to: and determining the total memory occupation amount when all the target models are deployed according to the memory occupation amount of each target model and the copy number of each target model. And judging whether the total memory occupation amount is greater than or equal to the total load threshold of the server cluster. And when the total memory occupation amount is smaller than the total load threshold value of the server cluster, deploying all the target models into the server cluster according to the memory resource information of each node in the server cluster.
In one embodiment, the deployment module 404 is configured to: and when the total memory occupation amount is greater than or equal to the total load threshold value of the server cluster, sequentially removing the algorithm model with the lowest historical access frequency from all the target models until the total memory occupation amount of the remaining target models is less than the total load threshold value, deploying the remaining target models into the server cluster according to the memory resource information of each node in the server cluster, and marking the removed algorithm model as high-priority deployment.
In one embodiment, the method further comprises: a second obtaining module 406, configured to obtain a model access request to the server cluster, a current total load amount of the server cluster, and a current load amount of each node in the server cluster after the target models are deployed on the server cluster according to the memory occupancy amount of each target model, the number of copies of each target model, and the total memory resources of the server cluster. The deployment module 404 is further operable to: and when the model to be accessed specified in the access request is not deployed in the server cluster and the current total load is smaller than the total load threshold of the server cluster, deploying the model to be accessed on the node with the minimum current memory occupancy rate in the server cluster.
In an embodiment, after deploying the target models on the server cluster according to the memory footprint of each target model, the number of copies of each target model, and the total memory resources of the server cluster, the deployment module 404 is further configured to: and when the model to be accessed specified in the access request is not deployed in the server cluster and the current total load amount is greater than or equal to the total load threshold value of the server cluster, unloading the algorithm model with the minimum access frequency deployed in the server cluster, and deploying the model to be accessed to the corresponding node.
In an embodiment, after deploying the target models on the server cluster according to the memory footprint of each target model, the number of copies of each target model, and the total memory resources of the server cluster, the deployment module 404 is further configured to: when the current load capacity of the first node is larger than or equal to the load threshold of the first node and the current total load capacity is smaller than the total load threshold of the server cluster, the algorithm model which occupies the minimum memory and is deployed on the first node is adjusted and deployed to the second node, and the current load capacity of the second node is smaller than the load threshold of the second node.
In an embodiment, after deploying the target models on the server cluster according to the memory footprint of each target model, the number of copies of each target model, and the total memory resources of the server cluster, the deployment module 404 is further configured to: and when the current total load capacity is greater than or equal to the total load threshold of the server cluster, and the current load capacity of each node in the server cluster is greater than or equal to the load threshold of the node, unloading the algorithm model with the minimum access frequency deployed in the server cluster, and marking the algorithm model with the minimum access frequency as high-priority deployment.
In one embodiment, the deployment module 404 is further configured to: and when a new node is added into the server cluster, deploying the algorithm model marked as high-priority deployment into the new node.
In an embodiment, adjusting and deploying the algorithm model with the minimum memory occupation deployed on the first node to the second node includes: and when a plurality of algorithm models which occupy the minimum memory exist, adjusting and deploying the algorithm model which occupies the minimum memory and consumes the minimum deployment time to the second node.
In one embodiment, the deployment module 404 is further configured to: and when the file of the target model exists locally, reading the locally stored file of the target model for deployment.
For a detailed description of the model deployment apparatus 400, please refer to the description of the related method steps in the above embodiments.
An embodiment of the present invention further provides a non-transitory electronic device readable storage medium, including: a program that, when run on an electronic device, causes the electronic device to perform all or part of the procedures of the methods in the above-described embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like. The storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种系统部署方法、装置和设备