Deep space probe soft landing path planning method for multitask deep reinforcement learning
1. The deep space probe soft landing path planning method for the multitask deep reinforcement learning is characterized by comprising the following steps of:
step 1: defining nodes of the deep space probe and obstacles in the deep space environment as intelligent agents;
step 2: on the basis of a deep deterministic strategy reinforcement learning algorithm DDPG model, a multi-agent reinforcement learning model is constructed by adopting multi-task learning, and the method specifically comprises the following steps:
the DDPG model consists of an operator network for simulating the strategies and a critic network for simulating the Q function, wherein the operator network comprises an online strategy network and a target strategy network, and the critic network comprises an online Q function and a target Q network;
the online strategy network and the target strategy network are both composed of two multi-layer perceptron MLPs; wherein, the intelligent agents share parameters in the first 5 layers of MLP by adopting a hard parameter sharing multi-task learning method; through multi-task learning, cooperation among agents is realized; when one intelligent agent is learning, other intelligent agents are used as supervision signals to improve the learning capacity of the current intelligent agent;
and step 3: when the MLP carries out feature extraction, a self-attention mechanism fusing time context information is adopted to improve the MLP, and the MLP is shown in formulas 1, 2 and 3:
Λi=softmax(f(Fi-1(oi,ai))) (1)
Fi=Λi*Fi (2)
Fi=Fi+Fi-1 (3)
wherein o isiRepresents an observed value of the ith agent, aiRepresenting the behaviour of the ith agent, F representing the activation function ReLu, Fi-1Features of the i-1 th layer, ΛiRepresents the normalized output, FiFeatures representing the ith layer;
and 4, step 4: the actor network generates a random process according to the current online strategy mu and the random Noise, and selects one agent for each agent according to the random process For the ith intelligence at time tA motion of the body; then, the agent is in the current stateAnd execution in the environmentReturn reward rt iAnd new stateWherein the reward function is set as shown in formula 4:
wherein d istIndicating the distance, d, of the agent from the asteroid at time tt-1Representing the distance between the agent and the asteroid at the time of t-1; dbodyIndicating the distance of the agent from the detector body, dagent_iRepresenting the distance of the ith agent from the detector body; omegaagent_tRepresenting the acceleration, ω, of the agent at time tagent_t-1Representing the acceleration of the agent at time t-1; v. ofagent_tRepresenting the speed, v, of the agent at time tagent_t-1Representing the speed of the agent at time t-1;
and 5: the operator network will have each agentAnd storing the data into an experience pool D as a data set for training an online strategy network, wherein D is (x, x', a)1,...,aN,r1,...,rN) Including observations, behaviors, and rewards of all agents;
where x represents an observed value of the agent, x' represents an observed value of the agent update, aNRepresenting the action of the Nth agent, rNRepresenting a reward for the nth agent;
step 6: each agent derives from a corresponding experience pool D,randomly sampling NData which are used as 1 mini-batch training data of the online strategy network and the online Q strategy network;
and 7: calculating the gradient of the online Q network by using the mean square error defined by the formula 5;
wherein, thetaiPolicy function μ representing the ith agentθiIs determined by the parameters of (a) and (b),a Q function value (a) representing the ith agent under the policy mu, the agent observation value x and the action alpha1,...,aN) Representing actions of 1 st through Nth agents, y representing true values, Ex,a,r,x'Represents the expected value, L (θ), of the agent under observation x, action a, reward r, and new observation xi) Is expressed with respect to thetaiOf the loss function riIndicating the reward earned by the i-th agent, gamma a discount factor,q function value (a ') representing ith agent under new strategy mu'1,...,a'N) Representing new actions for 1 st through Nth agents;
and 8: updating an online Q policy network;
and step 9: approximating the policy of an agent asWherein phi denotesParameters of the approximation strategy, abbreviated asThe approximation strategy of the agent is shown in equations 7 and 8:
wherein the content of the first and second substances,representing the approximate policy parameters of the jth agent at the ith iteration,show aboutA loss function of (d);indicating observed value o of j-th agentjConditional execution of ajAn approximate policy function of time;representing an approximation strategyEntropy of (d); λ represents discount sparsity;representing the relative observation ojAnd action ajThe expected value of (d);representing an approximate true value; r isiRepresents a reward value;representing the Q function value after updating the strategy; x' represents the updated observed value,an approximation policy function representing an agent, wherein (o)1,…,oi,…,oN) Representing an observed value of an agent;
step 10: the maximum reward expected by each agent is as shown in equation 9, and the gradient of the policy network is calculated using equation 10:
wherein u isiRepresenting the policy function, R, of the ith agenti(s, a) represents the reward earned by performing action alpha at state s,the distribution representing the kth sub-strategy obeys uniform distribution unif (1, K) and the state s is pμAnd inAn expected value for a lower execution action alpha;indicating that a gradient calculation is performed;representing sub-policiesThe experience pool of (1); j. the design is a squaree(μi) Representing state s according to distribution pμWhen R isi(s, a) desired value; k represents the number of all sub-strategies;represents that the k-th sub-strategy is sampled at the observed value x and the action alphaAn expected value of time;is expressed in an observed value of oiUnder the condition of action asiPolicy function of the ith agent in the kth sub-policy, where oiRepresents an observed value of the ith agent, aiRepresenting the behavior of the ith agent;represents the observed value as x and the action as (a)1,…,aN) Time enforcement policy muiA value of the Q function of time;represents an observed value of oiA policy function for a kth sub-policy of an ith agent;
step 11: and updating the online policy network.
Step 12: updating the parameters of the target policy network by adopting a soft updating mode of formula 11:
soft updating:where τ represents the adjustment coefficient, θQTo representParameter of Q function, θQ'Parameter, θ, representing updated Q functionμParameter, theta, representing the policy function muμ'Representing the parameters of the updated policy function mu'.
2. The method for planning the soft landing path of the deep space probe based on the multitask deep reinforcement learning, characterized in that step 8 specifically adopts an Adam optimizer to update thetaQ,θQParameters representing the online Q policy network.
3. The method for planning the soft landing path of the deep space probe based on the multitask deep reinforcement learning, characterized in that step 11 specifically adopts an Adam optimizer to update thetaμ,θμRepresenting the parameters of the policy function mu.
Background
The asteroid exploration is a disciplinary comprehensive and high-technology integrated system engineering, and embodies the comprehensive strength and competitiveness of a country. The asteroid detection is not only beneficial to further understanding the origin and evolution of the solar system by human beings, but also can promote the development and verification of new aerospace theories and technologies, promote scientific and technological innovation and further improve the comprehensive national power of the country.
The traditional deep space probe mainly depends on the prior knowledge of human beings to make a flight strategy, and then the probe is landed. However, during the landing process of the probe, due to the lack of the function of autonomous path planning and the particularity and the unknown of the small planet, the probe may have the problems of runaway, overturn or overturn during the landing process.
The method solves the problem of landing of the deep space probe, and is one of key tasks for realizing deep space probe. Due to the complex deep space environment and the weak gravitation of the asteroid, how to reduce the dependence of the detector on the ground artificial priori knowledge is the key for realizing the autonomous sampling of the detector by performing the soft landing of the detector through an autonomous planning path. In addition, because a plurality of obstacles exist in the deep space, the detector can avoid collision with the obstacles and other stars in the deep space by implementing path planning in the soft landing process, thereby improving the landing success rate.
Currently, existing probe landing path planning methods include a planetary vehicle path planning method based on a D3QN PER algorithm, an autonomous navigation method based on optics, landing by fixed timing control, landing by an adhesion-determining strategy, and the like. However, these methods are either only applicable to static environments or lack autonomous planning capability, and are difficult to deal with complex deep space environments, especially when confronted with asteroids of unknown parameters, which easily results in probe landing failure.
Disclosure of Invention
The invention aims to solve the technical problem of high landing failure rate of a deep space probe due to the fact that in the landing process of the deep space probe, the flight distance is long, the communication delay with the ground is long, the autonomous planning capability is lack due to the fact that the autonomous planning capability is mostly depended on artificial experience, the unknown and the particularity of a asteroid are the same, and the like, and creatively provides a deep space probe soft landing path planning method for multitask deep reinforcement learning.
The innovation points of the invention are as follows: based on DDPG (Deep Deterministic Policy reinforcement learning algorithm), a self-attention mechanism of multi-task learning and fusion time context is adopted to realize stable landing of the Deep space detector, and a foundation is laid for the subsequent implementation of asteroid detection, autonomous sampling and astronaut login activities.
The invention is realized by adopting the following technical scheme.
A deep space probe soft landing path planning method for multitask deep reinforcement learning comprises the following steps:
firstly, on the basis of a DDPG model, a multi-agent reinforcement learning model is constructed by adopting multi-task learning. The DDPG algorithm comprises two parts, namely an operator network and a critic network, wherein the operator comprises an online strategy network and a target strategy network, the critic comprises an online Q network and a target Q network, and the online network and the target network are both formed by two MLPs. On the basis of DDPG, MLP is improved by adopting a multitask learning mode based on hard parameter sharing. MLP is improved by adopting a self-attention mechanism integrating time context information, so that each intelligent agent can learn by paying more attention to information which enables the intelligent agent to obtain the maximum profit.
An online strategy network and random noise generate a random process, and the actor network selects an action for each agent according to the random processAnd reacts with the environment to return the rewardAnd new stateFor each agentAnd storing the data into an experience pool as a data set for training an online network.
Then, each agent randomly samples N from the corresponding experience poolAnd the data is used as mini-batch training data of an online strategy network and an online Q network.
And then, calculating the gradient of the online Q network by adopting the mean square error, and then updating the online Q network. The Monte-carlo method is used to calculate the gradient of the policy network, and then the online policy network is updated.
And finally, updating the parameters of the target strategy network by adopting a soft updating mode to complete path planning.
Advantageous effects
Compared with the prior art, the method has the following advantages:
1. by adopting a multi-task learning mode, the countermeasure and cooperation relation between the intelligent agents is fully utilized, the capability of each intelligent agent for dealing with uncertain conditions is further improved, and the overall generalization performance of the model is improved.
2. By adopting a self-attention mechanism fusing time context information, the intelligent agent can be prevented from falling into a local optimal state, and can be focused on information which is beneficial to obtaining the maximum return for the intelligent agent to learn, so that the landing success rate of the detector is further improved.
Drawings
FIG. 1 is a schematic diagram of the model structure of the present invention.
FIG. 2 is a diagram of an agent's multitask learning architecture based on hard parameter sharing.
Fig. 3 is a diagram of a deep reinforcement learning DDPG model structure employed by the method.
FIG. 4 is a graph of experimental results comparing this method with other methods.
Detailed Description
The method of the present invention will be described in further detail with reference to the accompanying drawings.
As shown in fig. 1. A deep space probe soft landing path planning method for multitask deep reinforcement learning comprises the following steps.
Step 1: and defining nodes of the deep space probe and obstacles in the deep space environment as intelligent agents.
Step 2: on the basis of the DDPG model, a multi-agent reinforcement learning model is constructed by adopting multi-task learning. As shown in fig. 2. The method comprises the following specific steps:
the DDPG model consists of an actor network simulating the strategy and a critic network simulating the Q function. The operator network comprises an online policy network and a target policy network, and the critical network comprises an online Q function and a target Q network. As shown in fig. 3.
The online policy network and the target policy network are both composed of two MLPs (Multi-layer Perceptron, MLP). Wherein, the intelligent agents share parameters in the first 5 layers of MLP by adopting a hard parameter sharing multi-task learning method. Through multitask learning, cooperation among agents is achieved. When an agent is learning, other agents are used as supervision signals to improve the learning ability of the current agent.
And step 3: when the MLP carries out feature extraction, a self-attention mechanism fusing time context information is adopted to improve the MLP, and the MLP is shown in formulas 1, 2 and 3:
Λi=softmax(f(Fi-1(oi,ai))) (1)
Fi=Λi*Fi (2)
Fi=Fi+Fi-1 (3)
wherein o isiRepresents an observed value of the ith agent, aiRepresenting the behaviour of the ith agent, F representing the activation function ReLu, Fi-1Features of the i-1 th layer, ΛiRepresents the normalized output, FiIndicating the characteristics of the ith layer.
By using the self-attention mechanism, the intelligent agent can pay more attention to the information which is beneficial to obtaining the maximum return to learn when the intelligent agent performs multi-task learning. Meanwhile, the time context information is adopted, so that the intelligent agent is prevented from falling into a local optimal state.
And 4, step 4: the actor network generates a random process according to the current online strategy mu and the random Noise, and selects one agent for each agent according to the random processThe action of the ith agent at time t. Then, the agent is in the current stateAnd execution in the environmentReturn rewardAnd new stateWherein the reward function is set as shown in formula 4:
wherein d istIndicating the distance, d, of the agent from the asteroid at time tt-1Representing the distance between the agent and the asteroid at the time of t-1; dbodyIndicating the distance of the agent from the detector body, dagent_iRepresenting the distance of the ith agent from the detector body; omegaagent_tRepresenting the acceleration, ω, of the agent at time tagent_t-1Representing the acceleration of the agent at time t-1; v. ofagent_tRepresenting the speed, v, of the agent at time tagent_t-1Representing the speed of the agent at time t-1.
And 5: the operator network will have each agentStored in an experience pool DAs a data set for training the online strategy network, D ═ x, x', a1,...,aN,r1,...,rN) Including observations, behaviors, and rewards for all agents.
Where x represents an observed value of the agent, x' represents an observed value of the agent update, aNRepresenting the action of the Nth agent, rNIndicating the reward for the nth agent.
Step 6: each agent randomly samples N from the corresponding experience pool DThe data is 1 mini-batch training data of the online strategy network and the online Q strategy network.
And 7: the gradient of the online Q network is calculated using the mean square error defined by equation 5.
Wherein, thetaiPolicy function μ representing the ith agentθiIs determined by the parameters of (a) and (b),a Q function value (a) representing the ith agent under the policy mu, the agent observation value x and the action alpha1,...,aN) Representing actions of 1 st through Nth agents, y representing true values, Ex,a,r,x’Represents the expected value, L (θ), of the agent under observation x, action a, reward r, and new observation xi) Is expressed with respect to thetaiOf the loss function riIndicating the reward earned by the i-th agent, gamma a discount factor,indicating that the ith agent is newQ function value under policy μ'1,...,a’N) Representing the new actions of the 1 st through nth agents.
And 8: and updating the online Q policy network. Updating theta with Adam optimizerQ,θQParameters representing the online Q policy network.
And step 9: because of the interactions between agents, the policy of each agent may be influenced by other agents, approximating the policy asWhere phi denotes the parameters of the approximation strategy, abbreviatedThe approximation strategy of the agent is shown in equations 7 and 8:
wherein the content of the first and second substances,representing the approximate policy parameters of the jth agent at the ith iteration,show aboutA loss function of (d);indicating observed value o of j-th agentjConditional execution of ajAn approximate policy function of time;representing an approximation strategyEntropy of (d); λ represents discount sparsity;representing the relative observation ojAnd action ajThe expected value of (d);representing an approximate true value; r isiRepresents a reward value;representing the Q function value after updating the strategy; x' represents the updated observed value,an approximation policy function representing an agent, wherein (o)1,…,oi,…,oN) Representing an observed value of the agent.
Step 10: the maximum reward expected by each agent is as shown in equation 9, and the gradient of the policy network is calculated using equation 10:
wherein u isiRepresenting the policy function, R, of the ith agenti(s, a) represents the reward earned by performing action alpha at state s,the distribution representing the kth sub-strategy obeys uniform distribution unif (1, K) and the state s is pμAnd inAn expected value for a lower execution action alpha;indicating that a gradient calculation is performed;representing sub-policiesThe experience pool of (1); j. the design is a squaree(μi) Representing state s according to distribution pμWhen R isi(s, a) desired value; k represents the number of all sub-strategies;represents that the k-th sub-strategy is sampled at the observed value x and the action alphaAn expected value of time;is expressed in an observed value of oiUnder the condition of action asiPolicy function of the ith agent in the kth sub-policy, where oiRepresents an observed value of the ith agent, aiRepresenting the behavior of the ith agent;represents the observed value as x and the action as (a)1,…,aN) Time enforcement policy muiA value of the Q function of time;represents an observed value of oiThe policy function of the kth sub-policy of the ith agent.
Step 11: and updating the online policy network. Updating theta with Adam optimizerμ,θμRepresenting the parameters of the policy function mu.
Step 12: and updating the parameters of the target strategy network by adopting a soft updating mode of the formula 11.
Soft updating:where τ represents the adjustment coefficient, θQParameter representing Q function, θQ’Parameter, θ, representing updated Q functionμParameter, theta, representing the policy function muμ’Representing the parameters of the updated policy function mu'.
In experimental tests, the hyper-parameter settings of the AMDRL model are shown in table 1:
TABLE 1 AMTDRL model hyperparameters
The parameter settings of the detector are shown in table 2:
TABLE 2 Detector parameters
The comparison is carried out by taking the MADDPG model as a base line, and the experimental result is shown in figure 4. The algorithm iterates over 30000 epsilon, sampling every 100 times. As can be seen from fig. 4, the average rewarded obtained by amdrl and maddppg tends to be consistent when 10000 times before iteration, but the average rewarded value of amdrl is always higher than that of maddppg against the increase of the number of iterations, which indicates that the detector can better avoid obstacles and obtain a better landing path under the method.