Unrelated parallel machine dynamic hybrid flow shop scheduling method based on deep Q network
1. A method for dynamically mixing a flow shop of unrelated parallel machines based on a deep Q network is characterized by comprising the following steps: the existing mixed flow scheduling problem is expanded, and dynamic events of random arrival of workpieces and real constraints of unrelated parallel machines are considered. The method is characterized in that a minimum average weighted lag is taken as a scheduling target, a neural network and reinforcement learning are combined, a mathematical model based on the problem is built and is programmed and realized as an environment, online training of the model is realized, and after the training is finished, the optimal action combination is selected for each decision point of a scheduling system, and the method specifically comprises the following steps:
step 1: according to the mixed flow shop problem considered by the invention, two constraints of irrelevant parallel machines and random arrival of workpieces are introduced, an application framework of reinforcement learning on the scheduling problem is established, and a new mathematical model is established by taking the minimum average weighted lag as an optimization target. And on the basis, a discrete simulation environment is established by programming, and parameters including the number of workpieces, the types of the workpieces, the number of machine tools, the number of stages, a machining completion flag done and the like are included. . Turning to the step 2;
step 2: and performing workshop simulation environment initialization, and initializing an experience playback pool D with the capacity of MEMORY _ SIZE, a random initialization state value neural network V (theta) and a target network V (theta)'. Where D is used to store decision data for each rescheduling point. And (4) turning to the step 3:
and step 3: when a dynamic event such as the arrival of a workpiece is met At a certain stage in the system, the intelligent body selects an action At according to the system state St of a rescheduling point by taking a greedy value e-green as probability, obtains a reward Rt after executing the action, and meanwhile, the system reaches the next state St + 1. And putting the current state characteristic St, the selected action At, the obtained reward Rt, the next state St +1 reached and the machining completion mark done into an experience playback pool D as a sample (St, At, Rt, St +1, done) of the decision. The done flag done is boolean, true only after the last workpiece is finished in the last stage, and false in the rest cases. And (4) deciding a step counter +1, and turning to the step 4:
and 4, step 4: before the training of the agent, the MEMORY _ warning _ SIZE decision sample data needs to be stored in the experience playback pool D in advance. Judging whether the number of data D reaches a threshold value MEMORY _ WARMUP _ SIZE:
if not, repeating the step 3; otherwise, go to step 5
And 5: judging whether the counter count of the decision step is an integral multiple of LEARN _ FREQ, wherein LEARN _ FREQ represents that the state value neural network V (theta) is trained once every decision step:
if yes, go to step 6; otherwise, go to step 7
Step 6: randomly extracting BATCH _ SIZE strip data from a sample playback pool D, using the BATCH _ SIZE strip data to train a state value neural network V (theta), updating parameters of the V (theta) through a gradient descent method, and meanwhile judging whether a decision step counter count is an integral multiple of 1000:
if the count is an integral multiple of 1000, deeply copying the neural network parameters to a target network V (theta)', and turning to step 6; otherwise, directly carrying out step 6.
And 7: judging whether the machining completion flag done is true; if yes, an algebraic counter epicode +1 is switched to step 7; otherwise, repeating the step 3;
and 8: judging whether the algebraic counter epicode reaches a threshold value Max _ epicode
If not, executing the initialization of the workshop simulation environment, and repeating the step 3; otherwise, ending the training.
2. The method for scheduling the independent parallel machine dynamic hybrid flow shop based on the deep Q network according to claim 1, characterized in that: in the step 1, the established reinforcement learning application framework is as follows:
comprising S phases, each phase S ∈ [1, S ∈ [ ]]With MsA machine, each machine mi(1≤i≤Ms) Starting from zero time, the n types of operation arrive at independent intervals, and the arrival time interval of each type of operation j e n is subject to uniform distribution. Each workpiece must be machined sequentially from stage 1 to stage 3, each workpiece being machined on a machine tool at each stage. The work can be processed after it arrives, and if the work cannot be immediately processed, the work is stored in the buffer BF. Whenever the machine finishes a job, or a new job arrives, this moment is called the rescheduling point t, the status of the current stage is characterized by StInput into DQN, DQN being dependent on StAnd selecting the most suitable scheduling rule as an action, and selecting the workpiece to be processed and the machine for processing the workpiece according to the scheduling rule. When the current stage reaches the next rescheduling point t ', the state characteristic S of t' is determinedt' input into DQN, and convert St' as StNext _ state of. The workpiece is put into the BF of the next stage immediately after the completion of the processing of the current stage. When the workpiece is finished in the final stageAfter processing, the goods are placed in a delivery area for delivery. In addition, there are some assumptions and hard constraints, some of which are listed below:
all workpieces are independent, and a drag period can be generated;
the power of the machine tool is different and is known in advance;
no workpiece exists in the zero-time buffer area, and all workpieces arrive randomly;
for any stage, the newly added workpiece in the buffer area is regarded as the workpiece arrival event of the stage (namely, the action of the workpiece which is put into the buffer area of the next stage after the processing of the previous stage is finished is regarded as the workpiece arrival event)
Each machine can process the next workpiece only after one workpiece is processed;
each workpiece can be processed on one machine tool at a time;
in a non-occupied state, workpieces can be processed on any machine in each stage;
the storage capacity between any two stages is infinite (i.e., one workpiece can wait any time between two processes);
the tooling time of the workpiece and the transport time between two successive stages are included in the queue time of the buffer zone of the respective stage.
3. The method for scheduling the independent parallel machine dynamic hybrid flow shop based on the deep Q network according to claim 1, characterized in that: based on the description of claim 2, the mathematical model is established as follows:
the notation used for the problem of the present invention is given first:
Jjk: the kth workpiece of type j
SJj: workpiece set of type j
Isi: i machine tool with s stage
Estimated processing time of workpiece of type j at stage s
Pjsi: actual machining time of the ith machine tool at stage s for a workpiece of type
Pjk: total processing time of kth workpiece of type j
Ajks: time for k-th workpiece of type j to reach stage s buffer
Djk: delivery date of k-th workpiece of type j
Ideal power of machine tool i
wi: actual power of machine tool i
xjksi: if operation J is to be performed at stage sjkAssigned to machine i, then xjksi1 is ═ 1; otherwise, xjksi=0。
More precisely, the mathematical model can be represented as follows:
STjksi=Ajks+Qjks (4)
Cjksi=STjksi+Pjksi (5)
STj′k′si>Cjksi (6)
Aj,k,1>0 (7)
Aj,k,s+1=Cj,k,s (8)
STj,k,s+1>Cj,k,s (9)
the objective function (1) is to minimize the average weighted lag after all workpieces have been machined. Formula (2) describes the characteristics of a non-equivalent parallel machine, and any type of workpiece has different processing time on different machine tools at the same stage, and the actual processing time is determined by the product of the ratio of the actual power to the ideal power and the estimated processing time. Equation (3) describes the actual average machining time on all machines for each type of workpiece. Equation (4) describes that the phase start processing time of the workpiece is equal to the sum of the time to reach the buffer and the queue time at the buffer. Equation (5) describes that the phase completion time of the workpiece is equal to the sum of the phase start machining time and the actual machining time of the workpiece in the machine tool. The constraint (6) ensures that for two jobs processed in succession by the same machine, the next job can only be started after the previous job has been completed. Constraint (7) requires that the buffer at the initial moment be free of artifacts and that all artifacts arrive randomly. The constraint (8) ensures that workpieces are fed into the next stage buffer queue immediately after processing at one stage is complete. Constraint (9) is a process limitation, i.e. one operation has two successive steps, the next step having to be started after the previous step has been completed. Constraints (10) ensure that the job passes through all stages and is processed by only one machine at each stage.
4. The method for scheduling the independent parallel machine dynamic hybrid flow shop based on the deep Q network according to claim 1, characterized in that: based on the mathematical expression of claim 3, the phase states are characterized as follows:
the state features defined by the invention are stage-by-stage, the state features of a certain stage are acquired by each rescheduling point, each feature vector comprises 8 types of feature parameters, and the total number of the feature parameters is 4n +3Ms+4 parameters. E.g. for machine m at any stage ssiThe kth characteristic of the processing of the jth workpiece is designated as ak,j,sThe defined set of state features together represent local information and global information where the environment is located.
The definition of the State features is shown in Table 1 "
TABLE 1 machine State characterization definition Table
The parameters used in the table are described in a unified way: assuming that a dynamic event occurs in a certain stage s, the above state features of the buffer and the machine in the stage are extracted as a state vector. t is the environment time of the current decision point, n is the total number of workpiece types, JjkIndicating the kth part of class j, Nj indicating the number of class j parts in the buffer, Ng indicating the number of remaining delivery dates of the workpieces in the buffer falling within a specified interval, Pjs indicating the average machining time of class j workpieces on all machines at stage s, DjksShowing a workpiece JjkIn the delivery date of the stage s, the delivery date of any workpiece is calculated according to the following formula:
wherein A isjksH is a preset lead time tension factor.
5. The method for scheduling the independent parallel machine dynamic hybrid flow shop based on the deep Q network according to claim 1, characterized in that: in step 3, the candidate action set is as follows:
TABLE 2 candidate action set for decision process
The scheduling rules applied by the invention comprise two parts, namely workpiece assignment rules and machine tool selection rules, and because the characteristics of irrelevant parallel machines are researched by the invention to mix flow shop scheduling, different machine tools in a stage have different processing capacities. Therefore, it is necessary to select the most suitable machine tool as well as the most suitable workpiece.
6. The method for scheduling the independent parallel machine dynamic hybrid flow shop based on the deep Q network according to claim 1, characterized in that: for the above defined scheduling rule, the implementation steps are as follows:
assume that the set of free artifacts of the s-stage buffer is denoted as Os ═ JjkThe set of free machine tools in the s phase is denoted as F ═ msiEach parameter in Fs represents the number m of the idle machine tool at the stage of decision point t time ssi。
7. The method for scheduling the independent parallel machine dynamic hybrid flow shop based on the deep Q network according to claim 1, characterized in that: in step 3, the reward function is defined as follows:
the awards are divided into instant awards and cumulative awards. The design of the reward function should follow several rules: the reward function should indicate the immediate effect of an action, i.e. associate the action with an immediate reward. In addition, cumulative rewards represent objective function values, i.e., agents receive a greater average reward with minimal average weighted delay. In addition, the reward function is convenient to calculate and can be applied to problems of different scales. The reward function is defined as follows:
rsq=-wj(Cjksi-Djks)
wherein r issqIndicating that at each rescheduled point of phase s, the workpiece J isjkTo machines miReward obtained after processing, wjIs a workpiece JjkWeight coefficient of the type of work to which it belongs, CjksiIs a workpiece JjkCompletion time at stage s, DjksIs a workpiece JjkAt a predetermined completion time of stage s.
Proof 1: minimizing the average weighted linger is equivalent to maximizing the time-averaged reward over an infinitely long time.
Where τ denotes the time the scheduling process has run, and NT (τ) denotes the number of system state transitions at time τ.
Suppose deltajksi(v) Is a workpiece JjkThe delay information indication function of (2) is defined as follows:
suppose SJ1 represents a collection of artifacts that have had stalls generated at stage s, and SJ2 represents a collection of artifacts that have not had stalls generated at stage s; NJ (t)qJ, s) at time tqNumber of j type workpieces that previously reached the stage s buffer. The cumulative prize earned by stage s at decision step q is then expressed as follows:
thus, for each step the reward is:
the total reward obtained at all stages at time τ is
On the other hand, the formula (1) can be used to obtain
Thus:
8. the method for scheduling the independent parallel machine dynamic hybrid flow shop based on the deep Q network according to claim 1, characterized in that: the specific parameters involved in the steps 2, 4 and 5 are as follows:
the network structures of V (theta) and V (theta)' are completely the same, and are deep neural networks composed of 7 complete connection layers, and are used for realizing fitting and asynchronous updating of the cost function. The method comprises the following steps that one input layer, one output layer and 5 hidden layers are arranged, the number of nodes of the input layer is the same as that of state features, the number of nodes of the output layer is the same as that of actions, the number of nodes of each hidden layer is 64, relu activation functions are used for the input layer and the hidden layers, and softmax is used for the output layer.
Second, background Art
The mixed flow shop is a production shop which is arranged according to a flow production line and comprises a plurality of working procedures, each working procedure comprises one or more parallel machines, and the mixed flow shop is a complex multi-stage decision making process, and the study on the structure of the workshop can effectively balance the utilization rate of the machines and increase the productivity. Current research on HFSP problems mostly assumes that shop floor information is a completely known static manufacturing environment, and is mostly studied on ideal standard HFSP problems. However, today's manufacturing systems are constantly exposed to various dynamic disturbances, such as: the disturbance can cause the deviation of the actual production and the scheduled production plan to be large, and the production efficiency is seriously influenced. Meanwhile, different machine tools at the same stage often do not have the same processing capacity when processing the same part. The scheduling problem described above is referred to as the dynamic hybrid pipelining scheduling problem with unrelated parallel machines (DHFSP-UPM). Therefore, the research on the dynamic HFSP online scheduling method with the non-equivalent parallel machine has important significance for improving the capacity of processing the disturbance event and enhancing the robustness of the actual production system.
For a long time, many scholars have been dedicated to research how to apply a reinforcement learning algorithm to solve the dynamic scheduling problem, and at present, many types of classical dynamic scheduling problems have been successfully solved. The reinforcement learning is a decision method for selecting the optimal based on continuous states, and the scheduling rule most suitable for the current production environment is selected according to the system state information of each decision point so as to realize the optimal of the specified index. Because production information is complex and various, the number of scheduling rules is huge, and a decision mechanism of reinforcement learning requires that the acquired system state has certain continuity, the study of a scholars at present is mainly used for solving the problem of single-stage decision dynamic scheduling, namely a buffer area is used for storing workpieces, and a plurality of machine tools select the workpieces according to the system state information, such as a single machine problem, a parallel machine workshop problem, a job shop problem, a flow shop problem and the like.
However, the reinforced learning decision mechanism in the above research is difficult to solve the multi-stage decision problem of the hybrid pipelining scheduling related to the present invention, that is, the machine tool in each stage in the HFSP only selects a part from the buffer area in the stage to which the machine tool belongs, after the decision in one stage is completed, the next state obtained may be the buffer area in any stage and the state of the machine tool, the intelligent agent is the machine tool in each stage simultaneously in the decision process, the obtained system state is the buffer area information and the machine information in each stage simultaneously, and the states obtained by two consecutive decisions do not have correlation. Therefore, the problem of dynamic scheduling of the hybrid flow is difficult to solve by the existing RL decision method.
Summarizing and analyzing the existing related results, the research on the scheduling problem of the mixed flow shop mainly has the following problems:
(1) in the traditional mixed flow scheduling research considering the characteristics of parallel machines, the parallel machines are mostly assumed to have the same processing capacity or have different processing time according to the processing efficiency of a machine tool, so that the method has certain limitation. And considering the completely irrelevant parallel machine to better accord with the actual processing scene.
(2) In the current research for solving the scheduling problem through a deep reinforcement learning algorithm, the single-machine scheduling problem, the parallel-machine scheduling problem and the pipelining scheduling problem are successfully solved, but the research for solving the mixed pipelining scheduling problem provided by the invention through dynamically selecting the scheduling rule through the DQN algorithm is still blank. The state characteristics of different stages of the mixed flow scheduling have no correlation and are difficult to solve through the existing algorithm.
(3) Complex processing conditions in an actual production environment are difficult to express with limited parameter indices. Particularly for the scheduling problem of the hybrid flow shop, it is necessary to construct a state feature vector that can express both the information of each process stage and the overall processing information. Meanwhile, reward functions are designed according to specified optimization indexes, and unified reference is lacked.
Third, the invention
The invention aims to provide a dynamic hybrid flow shop scheduling method considering irrelevant parallel machines based on deep Q learning.
The technical solution for realizing the purpose of the invention is as follows: the invention relates to a dynamic hybrid flow shop scheduling method considering workpiece arrival and irrelevant parallel machine characteristics based on deep Q learning, which takes a minimized average weighted lag as a scheduling target, uses a deep neural network fitting cost function, takes a processing system state of each decision point as an input training model, takes a combination of a workpiece sequencing rule and a machine distribution rule as an action candidate set, combines a reward and punishment mechanism of reinforcement learning, and selects an optimal action combination for each scheduling decision, and specifically comprises the following steps:
step 1: according to the mixed flow shop problem considered by the invention, two constraints of irrelevant parallel machines and random arrival of workpieces are introduced, an application framework of reinforcement learning on the scheduling problem is established, and a new mathematical model is established by taking the minimum average weighted lag as an optimization target. And on the basis, a discrete simulation environment is established by programming, and parameters including the number of workpieces, the types of the workpieces, the number of machine tools, the number of stages, a machining completion flag done and the like are included.
Turning to the step 2;
step 2: and performing workshop simulation environment initialization, and initializing an experience playback pool D with the capacity of MEMORY _ SIZE, a random initialization state value neural network V (theta) and a target network V (theta)'. Where D is used to store decision data for each rescheduling point. And (4) turning to the step 3:
and step 3: when a dynamic event such as the arrival of a workpiece is met At a certain stage in the system, the intelligent body selects an action At according to the system state St of a rescheduling point by taking a greedy value e-green as probability, obtains a reward Rt after executing the action, and meanwhile, the system reaches the next state St + 1. And putting the current state characteristic St, the selected action At, the obtained reward Rt, the next state St +1 reached and the machining completion mark done into an experience playback pool D as a sample (St, At, Rt, St +1, done) of the decision. The done flag done is boolean, true only after the last workpiece is finished in the last stage, and false in the rest cases. And (4) deciding a step counter +1, and turning to the step 4:
and 4, step 4: before the training of the agent, the MEMORY _ warning _ SIZE decision sample data needs to be stored in the experience playback pool D in advance. Judging whether the number of data D reaches a threshold value MEMORY _ WARMUP _ SIZE:
if not, repeating the step 3; otherwise, go to step 5
And 5: judging whether the counter count of the decision step is an integral multiple of LEARN _ FREQ, wherein LEARN _ FREQ represents that the state value neural network V (theta) is trained once every decision step:
if yes, go to step 6; otherwise, go to step 7
Step 6: randomly extracting BATCH _ SIZE strip data from a sample playback pool D, using the BATCH _ SIZE strip data to train a state value neural network V (theta), updating parameters of the V (theta) through a gradient descent method, and meanwhile judging whether a decision step counter count is an integral multiple of 1000:
if yes, deeply copying the neural network parameters to a target network V (theta)', and turning to the step 6; otherwise, directly carrying out step 6.
And 7: judging whether the machining completion flag done is true; if yes, an algebraic counter epicode +1 is switched to step 7; otherwise, repeating the step 3;
and 8: judging whether the algebraic counter epicode reaches a threshold value Max _ epicode or not;
if not, executing the initialization of the workshop simulation environment, and repeating the step 3; otherwise, ending the training.
Compared with the prior art, the invention has the remarkable advantages that:
(1) the invention overcomes the problem that the state characteristics of the HFSP at each processing stage are not relevant in the prior art, and provides a solving mechanism based on deep Q learning.
(2) A mathematical model of dynamic hybrid flow shop scheduling considering irrelevant parallel machines and random workpiece arrival is provided, and a programming realization idea of realizing a training environment by combining a discrete event simulation principle is provided.
(3) A reward function for characterizing a minimum mean holding time is presented, as well as a state vector that can represent local and global features during complex machining. The proposed reward function is validated for plausibility by mathematical expression.
Description of the drawings
FIG. 1 is a comparison graph of the proposed algorithm for solving the scheduling problem of HFSP and the current reinforcement learning solution.
FIG. 2 is a flowchart of the DQN-based solution of the present invention for dynamic hybrid pipelining plant scheduling considering the characteristics of unrelated parallel machines.
Fig. 3 is a diagram of a model training process assuming simultaneous machining of 10 types of workpieces.
Fig. 4 is a diagram of a model training process assuming simultaneous machining of 15 types of workpieces.
Fifth, detailed description of the invention
The invention is explained in further detail below with reference to the drawings in which:
the invention relates to a method for scheduling unrelated parallel machine dynamic hybrid flow shop based on a deep Q network, which comprises the following steps:
step 1: according to the mixed flow shop problem considered by the invention, two constraints of irrelevant parallel machines and random arrival of workpieces are introduced, and a new mathematical model is established by taking the minimum average weighted lag as an optimization target. Establishing an application framework of reinforcement learning on the scheduling problem, and completing programming realization of a discrete simulation environment comprising the number of workpieces, the number of workpiece types, the number of machine tools and the number of stages on the basis of a decision function env.step () of each rescheduling point, and turning to step 2;
further, the objective function of the scheduling system in step 1 is to minimize the average weighted lag, and the built reinforcement learning application framework is shown in fig. X: the scheduling process comprises S stages, and each stage belongs to [1, S ]]With MsA machine, each machine mi(1≤i≤Ms) Starting from zero time, the n types of operation arrive at independent intervals, and the arrival time interval of each type of operation j e n is subject to uniform distribution. Each workpiece must be machined sequentially from stage 1 to stage 3, each workpiece being machined on a machine tool at each stage. The work can be processed after it arrives, and if the work cannot be immediately processed, the work is stored in the buffer BF. Whenever the machine finishes a job, or a new job arrives, this moment is called the rescheduling point t, the status of the current stage is characterized by StInput into DQN, DQN being dependent on StAnd selecting the most suitable scheduling rule as an action, and selecting the workpiece to be processed and the machine for processing the workpiece according to the scheduling rule. When the current stage reaches the next rescheduling point t ', the state characteristic S of t' is determinedt' input into DQN, and convert St' as StNext _ stat ofe. The workpiece is put into the BF of the next stage immediately after the completion of the processing of the current stage. When the machine at the final stage finishes processing, the workpiece is placed in a delivery area for delivery.
Further, the mathematical model described in step 1 is built as follows, first making some necessary assumptions and hard constraints, some of which are listed below:
all workpieces are independent, and a drag period can be generated;
the power of the machine tool is different and is known in advance;
no workpiece exists in the zero-time buffer area, and all workpieces arrive randomly;
for any stage, the newly added workpiece in the buffer area is regarded as the workpiece arrival event of the stage (namely, the action of the workpiece which is put into the buffer area of the next stage after the processing of the previous stage is finished is regarded as the workpiece arrival event);
each machine can process the next workpiece only after one workpiece is processed;
each workpiece can be processed on one machine tool at a time;
in a non-occupied state, workpieces can be processed on any machine in each stage;
the storage capacity between any two stages is infinite (i.e., one workpiece can wait any time between two processes);
the tooling time of the workpiece and the transport time between two successive stages are included in the queue time of the buffer zone of the respective stage.
The symbols used for the study problem of the present invention are given below:
Jjk: the kth workpiece of type j
SJj: workpiece set of type j
Isi: i machine tool with s stage
Estimated processing time of workpiece of type j at stage s
Pjsi: type ofActual machining time of the ith machine tool of the piece in stage s
Pjk: total processing time of kth workpiece of type j
Ajks: time for k-th workpiece of type j to reach stage s buffer
Djk: delivery date of k-th workpiece of type j
Ideal power of machine tool i
wi: actual power of machine tool i
xjksi: if operation J is to be performed at stage sjkAssigned to machine i, then xjksi1 is ═ 1; otherwise, xjksi=0。
More precisely, the mathematical model can be represented as follows:
STjksi=Ajks+Qjks (4)
Cjksi=STjksi+Pjksi (5)
STj′k′si>Cjksi (6)
Aj,k,1>0 (7)
Aj,k,s+1=Cj,k,s (8)
STj,k,s+1>Cj,k,s (9)
the objective function (1) is to minimize the average weighted lag after all workpieces have been machined. Formula (2) describes the characteristics of a non-equivalent parallel machine, and any type of workpiece has different processing time on different machine tools at the same stage, and the actual processing time is determined by the product of the ratio of the actual power to the ideal power and the estimated processing time. Equation (3) describes the actual average machining time on all machines for each type of workpiece. Equation (4) describes that the phase start processing time of the workpiece is equal to the sum of the time to reach the buffer and the queue time at the buffer. Equation (5) describes that the phase completion time of the workpiece is equal to the sum of the phase start machining time and the actual machining time of the workpiece in the machine tool. The constraint (6) ensures that for two jobs processed in succession by the same machine, the next job can only be started after the previous job has been completed. Constraint (7) requires that the buffer at the initial moment be free of artifacts and that all artifacts arrive randomly. The constraint (8) ensures that workpieces are fed into the next stage buffer queue immediately after processing at one stage is complete. Constraint (9) is a process limitation, i.e. one operation has two successive steps, the next step having to be started after the previous step has been completed. Constraints (10) ensure that the job passes through all stages and is processed by only one machine at each stage.
Further, the implementation steps of the workshop simulation environment initialization function defined in the step 1 are as follows:
step 2: and performing workshop simulation environment initialization, performing DQN algorithm modeling, initializing an experience playback pool D with the capacity of MEMORY _ SIZE, and randomly initializing a state value evaluation network V (theta) and a target network V (theta)'. And simultaneously, completing the modeling of the state of the processing system for obtaining the system state of the decision point. Where D is used to store decision data for each rescheduling point. And (4) turning to the step 3:
further, the DQN algorithm described in step 2 is modeled as follows: DQN is an improved algorithm of Q-RL reinforcement learning, the problem of fitting a cost function under a high-dimensional condition is solved by combining DL deep learning, and the main improvement comprises two aspects: experience playback and asynchronous updating of the target network structure. The motivation for empirical playback is: the deep learning is supervised learning, and the data is required to meet the characteristics of large scale, independence and same distribution and no relevance between data. However, in reinforcement learning, the state and the reward are obtained through the real-time decision of each decision step, and the difficulty is often high in order to obtain a large amount of training data. Therefore, in order to overcome the problems of low correlation of empirical data and low data utilization rate, the method stores the sample data of MEMORY _ WARMUP _ SIZE into an empirical playback pool through simulation experiments in advance, and randomly extracts BATCH _ SIZE data from the empirical playback pool every LEARN _ FREQ decision steps for training when the above conditions are met. The value of LEARN _ FREQ is usually 100-200; the value of MEMORY _ WARMUP _ SIZE is usually 10-100 times of the total number of decision steps of each epsilon; the value of BATCH _ SIZE is usually a multiple of 2, and is usually 32, 64, 128, 256 and the like; for the MEMORY upper limit MEMORY _ warm of the empirical playback pool, the empirical playback pool is usually configured as a stack structure, and when the MEMORY reaches the upper limit, the earliest sample data is pushed out, which may cause loss of effective experience, and therefore is usually as large as the computer performance allows.
Further, the dual-network structure in step 2 is optimized as follows: for the traditional DQN algorithm, the estimation of the return and the estimation of the action value are related to the weight matrix W of the neural network, and when the weight value changes, the estimation of the return and the estimation of the action value change, so that the unstable situation is easy to occur. Therefore, the target network builds a network with the same structure outside the original neural network. In the process of updating the weight after each training, only the weight of the evaluation network V (θ) (original network) is updated. After a certain number of updates, the weight value of the evaluation network is given to the target network V (θ)', and the next update is performed.
And step 3: when a dynamic event such as the arrival of a workpiece is met At a certain stage in the system, the intelligent agent selects an action At from the candidate action set by taking a greedy value e-green as a strategy according to the system state St of a rescheduling point, obtains a reward Rt after the action is executed, and meanwhile, the system reaches the next state St + 1. And putting the current state characteristic St, the selected action At, the obtained reward Rt, the next state St +1 reached and the machining completion mark done into an experience playback pool D as a sample (St, At, Rt, St +1, done) of the decision. The done flag done is boolean, true only after the last workpiece is finished in the last stage, and false in the rest cases. And (4) deciding a step counter +1, and turning to the step 4:
further, the state features in step 3 are stage-by-stage, the state features of a certain stage are obtained by each rescheduling point, and each feature vector contains 8 types of features, which are 4n +3M in totals+4 parameters. E.g. for machine m at any stage ssiThe kth characteristic of the processing of the jth workpiece is designated as ak,j,sThe defined set of state features together represent local information and global information where the environment is located.
The definition of the status characteristics is shown in Table 1
TABLE 1 machine State characterization definition Table
The parameters used in the table are described in a unified way: assuming that a dynamic event occurs in a certain stage s, the above state features of the buffer and the machine in the stage are extracted as a state vector. t is the environment time of the current decision point, n is the total number of workpiece types, JjkIndicating the kth part of class j, Nj indicating the number of class j parts in the buffer, Ng indicating the number of remaining delivery dates of the workpieces in the buffer falling within a specified interval, Pjs indicating the average machining time of class j workpieces on all machines at stage s, DjksShowing a workpiece JjkDelivery at stage sThe delivery date of any workpiece is calculated according to the following formula:
wherein A isjksH is a preset lead time tension factor.
Further, the set of candidate actions in step 3 is as follows:
TABLE 2 candidate action set for decision process
The invention selects the scheduling rules as a candidate action set, mainly divided into two categories, namely a workpiece assignment rule and a machine tool selection rule. Therefore, it is necessary to select the most suitable machine tool as well as the most suitable workpiece.
TABLE 2 scheduling rules candidate action set
For the above defined scheduling rule, the implementation steps are as follows:
assume that the set of free artifacts of the s-stage buffer is denoted as Os ═ JjkThe set of free machine tools in the s phase is denoted as F ═ msiEach parameter in Fs represents the number m of the idle machine tool at the stage of decision point t time ssi。
And 4, step 4: before the intelligent agent training, judging whether the number of data in the experience playback pool D reaches a threshold value MEMORY _ WARMUP _ SIZE:
if not, repeating the step 3; otherwise, go to step 5
And 5: judging whether the counter count of the decision step is an integral multiple of LEARN _ FREQ, wherein LEARN _ FREQ represents that the state value neural network V (theta) is trained once every decision step:
if yes, go to step 6; otherwise, go to step 7
Step 6: randomly extracting BATCH _ SIZE strip data from a sample playback pool D, using the BATCH _ SIZE strip data to train a state value neural network V (theta), updating parameters of the V (theta) through a gradient descent method, and meanwhile judging whether a decision step counter count is an integral multiple of 1000:
if yes, deeply copying the neural network parameters to a target network V (theta)', and turning to the step 6; otherwise, directly carrying out step 6.
And 7: judging whether the machining completion flag done is true; if yes, an algebraic counter epicode +1 is switched to step 7; otherwise, repeating the step 3;
and 8: judging whether the algebraic counter epicode reaches a threshold value Max _ epicode
If not, executing the initialization of the workshop simulation environment, and repeating the step 3; otherwise, ending the training.
Sixth, example
In order to verify the effectiveness and the universality of the method provided by the invention under different production configurations, the performance of the DQN decision system and various single and composite scheduling rules under different production configurations are compared.
Firstly, for the intelligent agent parameter design in the DQN algorithm, a discount factor gamma is used for measuring the influence degree of future rewards on the decision, the smaller the value is, the more important the decision process is for the current decision benefit, generally, the value is 0.9-0.999, and the value of gamma is 0.9. The exploration rate epsilon represents the random selection action of the probability of the agent having epsilon, otherwise, the action with the highest value is selected, the epsilon is usually set to be large in the initial stage, the agent can fully explore in the environment, and then the value of the epsilon is gradually reduced by epsilon-specification after each decision till 1e-3, so that the optimal action can be stably selected after convergence, wherein epsilon is 0.4, and epsilon-specification is 1 e-6. The training set sample amount batch _ size represents the number of sample data extracted from the sample playback pool at each model training, and is usually 32, 64, 128, or the like, and is taken as 64.
Secondly, for the neural network parameter design in the DQN algorithm, the network structure is set as a full connection layer, and the full connection layer comprises 1 input layer, 7 hidden layers and 1 output layer. The input layer comprises 4n +8m +4 nodes, and the number of the nodes is consistent with the number of the system state parameters acquired by the decision point; each hidden layer comprises 64 nodes, and an activation function is set to relu; the output layer comprises 9 nodes, and the number of the nodes is consistent with the number of the actions. The learning rate indicates the degree of model correction in each training, and lr is 0.001.
Finally, the test examples used here are generated randomly, assuming that the workshop starts (time 0) without any workpiece, then the n types of workpieces are each arrived according to the poisson distribution, and the arrival time interval of two consecutive new jobs of any type j is subject to the uniform distribution of U (1, 10); each workpiece JjkThe processing time is completely independent on different machine tools in the same processing stage, and the processing time P isjkiObey a uniform distribution of U (1, 20); the weight W occupied by different types of workpieces is subject to uniform distribution of U (1, 5). Assume that the total number of machining stages S is 3 and the number of machine tools m per machining stage is 8. Setting 12 groups of comparison tests according to three parameters of the total number N of workpieces, the type N of the workpieces and the delivery date tension factor K, wherein the total number N of the workpieces is respectively 50, 100 and 200, the type N of the workpieces is respectively 10 and 15, the delivery date tension factor K is respectively 1 and 2, and the parameter design of the specific comparison tests is shown in a table. 30 calculation examples are randomly generated in each group of control experiments, and an average value is taken after calculation is completed.
TABLE 3 control test parameter design
Table 4 comparison of results of control experiment
As can be seen from table 4, compared with a single scheduling rule, the scheduling method based on DQN provided by the present invention is based on the single scheduling rule under most processing conditions, and has the capability of selecting the scheduling rule in real time according to the processing conditions. The method is close to the optimal solution under all processing conditions, and the operation is stable. Although the training process usually takes a lot of time (limited by the performance of the machine, the training time of the i5 processor is about 10-30h), once the training is completed, the decision can be made in a very short time in practical application, and the requirement of dynamic scheduling on real-time property in complete reaction is met.
FIG. 3 is a graph of the change in mean weighted lag time of a workpiece with the agent training at a workpiece type setting of 10. Every time all workpieces are processed, one epsilon is calculated, and the up-and-down fluctuation of the intelligent body is large in the initial exploration stage. However, when the exploration rate reaches 0.001, the intelligent agent tends to select the optimal scheduling rule, the algorithm gradually converges with the continuous increase of the epsilon, the workpiece average weighting lag is reduced to the lower limit, and the model can stably obtain the optimal solution after 2150 epodes.
FIG. 4 is a graph of the change in mean weighted lag time of a workpiece with the agent training, with the workpiece type set to 15. As the workpiece types increase, the difficulty of model convergence becomes greater. After 6000 epicodes, the model can stably obtain the optimal solution.