Control method of space robot mechanical arm
1. A control method of a space robot mechanical arm is characterized in that an image collected by a space robot base camera is obtained, and a reward function used in a mechanical arm control process is set; constructing a mechanical arm control strategy network, a state action value network and a state value network, inputting images to the control strategy network, outputting action information to control a mechanical arm, and interactively accumulating interactive information pairs for multiple times to form an experience pool; and constructing a target function required by training each network according to maximum entropy reinforcement learning, training each network parameter by using a practical gradient descent algorithm through interactive information in an experience pool to obtain a trained mechanical arm control strategy network, inputting image information, and outputting action information to control the mechanical arm.
2. A method of controlling a robotic arm of a space robot as claimed in claim 1, characterized in that the method comprises the steps of:
(1) according to the control of a Markov decision process modeling mechanical arm, obtaining an image collected by a space robot base camera, and defining the image at the moment t as state information stA matrix of the form W x H x 3, wherein 3 dimensions comprise channels of three RGB colors, the image of each dimension comprising W x H pixels, wherein W is the width of the image and H is the height of the image; the angular velocity a of the space robot jointtAs the motion information, t represents a sampling time;
(2) setting a reward function r for use in robotic arm controlt(st,at) Completing the modeling of the Markov decision process; reward function rt(st,at) The expression of (a) is as follows:
rt(at,at)=-[β1d2+ln(β2d2+εd)+β3||at-at-1||2]
wherein p iseFor the end position of the robot arm of the space robot, ptThe target position of the robot mechanical arm in the target space is d is Euclidean distance, and d is | pe-pt‖,β1=10-3,β2=1,β3=10-1,εdThe function of (a) is to prevent the ln function from generating singularity and epsilondWhen the value is 0.001, superscript T is matrix transposition;
(3) constructing a mechanical arm control strategy networkWhereinParameters representing a robot arm control strategy network to the robot arm control strategy networkInput image stTo obtain motion information angular velocity atThe method specifically comprises the following steps:
(3-1) robot arm control strategy networkIs a convolutional neural network, the convolutional kernel weight w of which1Is a Gw1*Lw1*Mw1Matrix, wherein Gw1Is the number of channels, L, of the convolution kernelw1For the width of the convolution kernel, Mw1Is the height of the convolution kernel;
(3-2) robot arm control strategy networkThe second layer of the system is a batch normalization layer, the batch normalization layer respectively normalizes the multiple feature layers obtained in the previous layer, and the quantity of normalization parameters is in positive linear correlation with the number of layers of the feature layers;
(3-3) mechanical arm control strategy networkThe third layer of (2) is a maximum value pooling layer, and a filter P of the maximum value pooling layer1A matrix of 2 x 2;
(3-4) mechanical arm control strategy networkThe fourth layer of (2) is a convolutional neural network, the convolutional kernel weight W of which2Is a Gw2*Lw2*Mw2Matrix, wherein Gw2Is the number of channels, L, of the convolution kernelw2For the width of the convolution kernel, Mw2Is the height of the convolution kernel;
(3-5) mechanical arm control strategy networkThe fifth layer of the system is a batch normalization layer, the batch normalization layer respectively normalizes the multiple feature layers in the previous layer, and the quantity of normalization parameters is in positive linear correlation with the number of layers of the feature layers;
(3-6) mechanical arm control strategy networkThe sixth layer of (2) is a maximum pooling layer, a filter P of the maximum pooling layer2Matrix of 2 x 2
(3-7) robot arm control strategy networkThe seventh layer of (2) is a convolutional neural network, the convolutional kernel weight W of which3Is a Gw3*Lw3*Mw3Matrix, wherein Gw3Is the number of channels, L, of the convolution kernelw3For the width of the convolution kernel, Mw3Is the height of the convolution kernel;
(3-8) mechanical arm control strategy networkThe eighth layer of (1) is a batch normalization layer, which respectively normalizes the multiple feature layers in the previous layer, the number of normalization parameters and the featuresThe number of layers is relevant;
(3-9) robot arm control strategy networkThe ninth layer of (2) is a maximum value pooling layer, a filter P of the maximum value pooling layer3A matrix of 2 x 2;
(3-10) robot arm control strategy networkThe tenth layer of the neural network is a fully-connected neural network, and the number of input neurons is the number F of the flattened characteristic output by the previous layer and output by the feature layer9And the number of output neurons is F10Neuron weight is W10;
(3-11) robot arm control strategy networkThe eleventh layer of (2) is a fully-connected neural network, and the number of input neurons is F output by the last layer10And the number of output neurons is F11Neuron weight is W11;
(3-12) robot arm control strategy networkThe twelfth layer of the neural network is a fully connected neural network, and the number of input neurons is F output by the previous layer11The number of output neurons is the mean value of Gaussian distributionSum varianceThe weight of the neuron is W12;
(3-13) obtaining a mechanical arm control strategy network according to the steps (3-1) - (3-12)
(3-14) to the robot arm control strategy network of step (3-13)Inputting the RGB three-channel image s collected in the step (1)tMechanical arm control strategy networkOutputting the mean value mu of the obtained Gaussian distributiontSum variance ΣtMean value μtSum variance ΣtAngular velocity a combined into space robot jointtProbability distribution ofObtaining the angular velocity a of the mechanical arm joint through samplingt;
(4) Constructing a mechanical arm state value network V according to the Markov decision process principleψWherein psi represents a parameter of the arm state value network by inputting the image stObtain a status value vtThe method specifically comprises the following steps:
(4-1) repeating the step (3-1) -the step (3-11) and constructing a mechanical arm state value network VψThe first layer to the eleventh layer of (1);
(4-2) mechanical arm state value network VψThe twelfth layer of the neural network is a fully connected neural network, and the number of input neurons is F output by the previous layer11Output neuron-state cost function vtNeuron weight is W12;
(4-3) obtaining a mechanical arm state value network V according to the step (4-1) to the step (4-2)ψ;
(5) Constructing a mechanical arm state action value network Q according to the Markov decision process principleθWherein θ represents a parameter of the robot status motion value network, and θ represents a parameter of the robot status motion value networkθInput image stAnd joint angular velocity atIn the form ofValue of state motion qtThe mapping relationship of (2) specifically comprises the following steps:
(5-1) repeating the steps (3-1) to (3-10) and constructing a mechanical arm state action value network QθThe first to tenth layers of network structure;
(5-2) mechanical arm state action value network QθThe eleventh layer of (2) is a fully-connected neural network, and the number of input neurons is F output by the last layer10And joint angular velocity a of space robottThe number of the combined neurons is F11Neuron weight is W11;
(5-3) mechanical arm state action value network QθThe twelfth layer of the neural network is a fully connected neural network, and the number of input neurons is F output by the previous layer11The output neuron is a state action value function qtNeuron weight is W12;
(5-4) obtaining a mechanical arm state action value network Q according to the step (5-1) to the step (5-3)θ;
(6) To step (3) mechanical arm control strategy networkInputting the image s collected at the sampling time t in the step (1)tThe output is the joint angular velocity a of the mechanical armtThe angular velocity a of the jointtThe output is transmitted to a proportional-derivative controller C, and the proportional-derivative controller C outputs the torque to obtain joint torque so as to realize the control of the robot; image s acquired at sampling time ttAnd joint tracking desired joint angular velocity atInputting the reward function in the step (2) to obtain a reward value rtAnd an image s at time t +1 is obtainedt+1Obtaining the interactive information pair E at the time tt=<st,at,rt,st+1>;
(7) Traversing all images s acquired within a T periodt=1:TRepeating the step (6) to obtain a plurality of groups of interactive information pairs, wherein the plurality of groups of interactive information pairs form an experience playback pool D and respectively provide values to the mechanical arm state in the step (4)Network VψAnd (5) a mechanical arm state action value network QθInputting images s collected at different sampling momentstObtaining a status value vψ(st) Sum state action value qt(st,at);
(8) And (3) establishing an optimization target according to the maximum entropy reinforcement learning, so that the accumulated reward and the entropy of the strategy are maximized:
wherein the content of the first and second substances,representing the entropy of the information;
(9) network Q for minimizing Bellman residual errors to action value of mechanical arm stateθTraining the parameters to obtain a mechanical arm state action value network QθAt the optimum parameter JQExpression of (θ):
(10) value network V for state of mechanical arm by minimizing square loss valueψObtaining a mechanical arm state value network V by parameter trainingψOptimum parameter J ofV(ψ) expression:
(11) training the strategy function by minimizing the expected relative entropy divergence to obtain the optimized parameter of the strategy functionExpression:
(12) training the mechanical arm control strategy network of the step (3) to the step (5) by using the training target obtained in the step (9) to the step (11) and a gradient descent methodState action value network QθAnd status value network VψCompleting the training of the network;
(13) real-time acquisition of image s obtained by camera mounted on space robot basetImages s to be acquired in real timetMechanical arm control strategy network input to step (12)Outputting to obtain the angular velocity a of the mechanical arm joint at the sampling time ttAnd the control on the space robot mechanical arm is realized, and a track planning task in a T time period is realized.
Background
Trajectory planning is the most common task performed by space robots and has been studied extensively. The Generalized Jacobian Matrix (GJM) of the space manipulator can realize the trajectory planning of the robot arm under the condition of not influencing the basic posture of the robot. However, in some cases, there may be singular points in the GJM, which will limit the feasible space for kinematic planning by the GJM inverse method. When the space robot performs path planning, a position of a dynamic singular point exists in a path, and limited joint speed cannot be realized at the position, so that the length of the planned path is increased. Therefore, the traditional space robot trajectory planning scheme mainly solves the problem of singular solution existing in the solution. In recent years, some methods using intelligent optimization solve the problem of the dynamic singularity of the space robot. For example, Wu et al, implemented the task of planning the trajectory of a single target point of a dual-arm space robot using DDPG algorithm, see Wu, Yun-Hua, et al, relationship learning guiding objective-arm trajectory planning for a free-flowing space robot [ J ]. Aerospace Science and technology, 2020, 98: 105657.
however, the pose acquisition of the target still needs to be realized by a separately designed controller, and the model-based and modularized design has certain disadvantages. First, limited modeling details limit the accuracy of the model, and modeling errors and constraints can reduce the control effect. Secondly, the design of the modular controller is very laborious, requiring manual adjustment of the control parameters of each module. Therefore, for the free floating space robot control, end-to-end model-free reinforcement learning is adopted, namely, the controller is directly learned from original image pixels, the problems of singular solution and modeling errors existing in the traditional method can be well solved, perception and decision are unified into one method, and the design of a modularized controller is avoided.
Disclosure of Invention
The invention aims to provide a control method of a space robot mechanical arm, which improves the existing free floating space robot control method to achieve the purpose that a space robot catches space garbage and invalid satellites.
The invention provides a control method of a space robot mechanical arm, which comprises the steps of firstly obtaining an image collected by a space robot base camera, and setting a reward function used in a mechanical arm control process; constructing a mechanical arm control strategy network, a state action value network and a state value network, inputting images to the control strategy network, outputting action information to control a mechanical arm, and interactively accumulating interactive information pairs for multiple times to form an experience pool; and constructing a target function required by training each network according to maximum entropy reinforcement learning, training each network parameter by using a practical gradient descent algorithm through interactive information in an experience pool to obtain a trained mechanical arm control strategy network, inputting image information, and outputting action information to control the mechanical arm.
The invention provides a control method of a space robot mechanical arm, which has the characteristics and advantages that:
according to the control method of the space robot mechanical arm, end-to-end model-free reinforcement learning is adopted, namely, the controller is directly learned from original image pixels, the problems of singular solution and modeling errors existing in the traditional method can be well solved, perception and decision are unified into one method, and modular controller design is avoided. End-to-end model-free reinforcement learning can well solve the problems of singular solution and modeling error in the traditional method, perception and decision are unified into one method, and the problem of manual parameter adjustment in the design process of a modular controller is avoided.
Drawings
Fig. 1 is a flow chart of a control method of a space robot manipulator according to the present invention.
Detailed Description
The invention provides a control method of a space robot mechanical arm, which comprises the steps of firstly obtaining an image collected by a space robot base camera, and setting a reward function used in a mechanical arm control process; constructing a mechanical arm control strategy network, a state action value network and a state value network, inputting images to the control strategy network, outputting action information to control a mechanical arm, and interactively accumulating interactive information pairs for multiple times to form an experience pool; and constructing a target function required by training each network according to maximum entropy reinforcement learning, training each network parameter by using a practical gradient descent algorithm through interactive information in an experience pool to obtain a trained mechanical arm control strategy network, inputting image information, and outputting action information to control the mechanical arm.
The flow chart of the mechanical arm control method of the space robot is shown in fig. 1, and the specific steps are as follows:
(1) according to the control of a Markov decision process modeling mechanical arm, obtaining an image collected by a space robot base camera, and defining the image at the moment t as state information stA matrix of the form W x H x 3, wherein 3 dimensions comprise channels of three RGB colors, the image of each dimension comprising W x H pixels, wherein W is the width of the image and H is the height of the image; the angular velocity a of the space robot jointtAs the motion information, where t represents a sampling time;
(2) setting a reward function r for use in robotic arm controlt(st,at) Completing the modeling of the Markov decision process; reward function rt(st,at) The expression of (a) is as follows:
rt(st,at)=-[β1d2+ln(β2d2+∈d)+β3||at-at-1||2]
wherein p iseFor the end position of the robot arm of the space robot, ptIs the target position of the robot mechanical arm in the object space, d isEuclidean distance, d | | | pe-pt||,β1=10-3,β2=1,β3=10-1,∈dWhen the Ln function is equal to 0.001, the singularity of the ln function is prevented, and the superscript T is matrix transposition; adding-ln (beta) to the reward function2d2+∈d) This is to encourage that a smaller distance d will result in a higher reward, thereby improving accuracy. Furthermore, - β when the end effector is a long distance to the target capture point1d2The reward is influenced more greatly, the mechanical arm can perform actions with larger amplitude, and the change of the reward value is not too violent, thereby being beneficial to fully exploring. And-beta3||at-at-1||2Is a penalty term introduced to make the control curve smoother. Last itemThe aim is to reduce the moment output by the mechanical arm as much as possible, so that the interference on the base can be reduced.
(3) Constructing a mechanical arm control strategy networkWhereinParameters representing a robot arm control strategy network to the robot arm control strategy networkInput image stTo obtain motion information angular velocity atThe method specifically comprises the following steps:
(3-1) robot arm control strategy networkIs a convolutional neural network, the convolutional kernel weight W of which1Is a Gw1*Lw1*Mw1Matrix, wherein Gw1Is the number of channels, L, of the convolution kernelw1Is the convolution kernelWidth of (M)w1Is the height of the convolution kernel;
(3-2) robot arm control strategy networkThe second layer of the system is a batch normalization layer, the batch normalization layer respectively normalizes the multiple feature layers obtained in the previous layer, and the quantity of normalization parameters is in positive linear correlation with the number of layers of the feature layers;
(3-3) mechanical arm control strategy networkThe third layer of (2) is a maximum value pooling layer, and a filter P of the maximum value pooling layer1A matrix of 2 x 2;
(3-4) mechanical arm control strategy networkThe fourth layer of (2) is a convolutional neural network, the convolutional kernel weight W of which2Is a Gw2*Lw2*Mw2Matrix, wherein Gw2Is the number of channels, L, of the convolution kernelw2For the width of the convolution kernel, Mw2Is the height of the convolution kernel;
(3-5) mechanical arm control strategy networkThe fifth layer of the system is a batch normalization layer, the batch normalization layer respectively normalizes the multiple feature layers in the previous layer, and the quantity of normalization parameters is in positive linear correlation with the number of layers of the feature layers;
(3-6) mechanical arm control strategy networkThe sixth layer of (2) is a max-pooling layer, the filter P2 of which is a 2 x 2 matrix
(3-7) robot arm control strategy networkThe seventh layer of (2) is a convolutional neural network, the convolutional kernel weight W of which3Is a Gw3*Lw3*Mw3Matrix, wherein Gw3Is the number of channels, L, of the convolution kernelw3For the width of the convolution kernel, Mw3Is the height of the convolution kernel;
(3-8) mechanical arm control strategy networkThe eighth layer of the system is a batch normalization layer, the batch normalization layer respectively normalizes the multiple feature layers in the previous layer, and the quantity of normalization parameters is related to the number of layers of the feature layers;
(3-9) robot arm control strategy networkThe ninth layer of (2) is a maximum value pooling layer, a filter P of the maximum value pooling layer3A matrix of 2 x 2;
(3-10) robot arm control strategy networkThe tenth layer of the neural network is a fully-connected neural network, and the number of input neurons is the number F of the flattened characteristic output by the previous layer and output by the feature layer9And the number of output neurons is F10Neuron weight is W10;
(3-11) robot arm control strategy networkThe eleventh layer of (2) is a fully-connected neural network, and the number of input neurons is F output by the last layer10And the number of output neurons is F11Neuron weight is W11;
(3-12) robot arm control strategy networkThe twelfth layer of (a) is a fully connected neural network, input neuronsF with the number of outputs of the previous layer11The number of output neurons is the mean value of Gaussian distributionSum varianceThe weight of the neuron is W12;
(3-13) obtaining a mechanical arm control strategy network according to the steps (3-1) - (3-12)
(3-14) to the robot arm control strategy network of step (3-13)Inputting the RGB three-channel image s collected in the step (1)tMechanical arm control strategy networkOutputting the mean value mu of the obtained Gaussian distributiontSum variance ΣtMean value μtSum variance ΣtAngular velocity a combined into space robot jointtProbability distribution ofObtaining the angular velocity a of the mechanical arm joint through samplingt;
(4) Constructing a mechanical arm state value network V according to the Markov decision process principleψWherein psi represents a parameter of the arm state value network by inputting the image stObtain a status value vtThe method specifically comprises the following steps:
(4-1) repeating the step (3-1) -the step (3-11) and constructing a mechanical arm state value network VψThe first layer to the eleventh layer of (1);
(4-2) mechanical arm state value network VψThe twelfth layer of the neural network is a fully connected neural network, and the number of input neurons is the last oneF of layer output11Output neuron-state cost function vtNeuron weight is W12;
(4-3) obtaining a mechanical arm state value network V according to the step (4-1) to the step (4-2)ψ;
(5) Constructing a mechanical arm state action value network Q according to the Markov decision process principleθWherein θ represents a parameter of the robot status motion value network, and θ represents a parameter of the robot status motion value networkθInput image stAnd joint angular velocity atObtaining a state action value qtThe mapping relationship of (2) specifically comprises the following steps:
(5-1) repeating the steps (3-1) to (3-10) and constructing a mechanical arm state action value network QθThe first to tenth layers of network structure;
(5-2) mechanical arm state action value network QθThe eleventh layer of (2) is a fully-connected neural network, and the number of input neurons is F output by the last layer10And joint angular velocity a of space robottThe number of the combined neurons is F11Neuron weight is W11;
(5-3) mechanical arm state action value network QθThe twelfth layer of the neural network is a fully connected neural network, and the number of input neurons is F output by the previous layer11The output neuron is a state action value function qtNeuron weight is W12;
(5-4) obtaining a mechanical arm state action value network Q according to the step (5-1) to the step (5-3)θ;
(6) To step (3) mechanical arm control strategy networkInputting the image s collected at the sampling time t in the step (1)tThe output is the joint angular velocity a of the mechanical armtThe angular velocity a of the jointtThe output is transmitted to a proportional-derivative controller C, and the proportional-derivative controller C outputs the torque to obtain joint torque so as to realize the control of the robot; when samplingImage s collected at moment ttAnd joint tracking desired joint angular velocity atInputting the reward function in the step (2) to obtain a reward value rtAnd an image s at time t +1 is obtainedt+1Obtaining the interactive information pair E at the time tt=<st,at,rt,st+1>;
(7) Traversing all images s acquired within a T periodt=1:TAnd (4) repeating the step (6) to obtain a plurality of groups of interactive information pairs, wherein the plurality of groups of interactive information pairs form an experience playback pool D and respectively go to the mechanical arm state value network V in the step (4)ψAnd (5) a mechanical arm state action value network QθInputting images s collected at different sampling momentstObtaining a status value vψ(st) Sum state action value qt(st,at);
(8) And (3) establishing an optimization target according to the maximum entropy reinforcement learning, so that the accumulated reward and the entropy of the strategy are maximized:
wherein the content of the first and second substances,representing the entropy of the information; the maximum information entropy is used for improving the exploration capability of the algorithm and enabling the learned strategy to generate larger randomness. The degree of randomness α of learning represents the degree of randomness of the maximization of the strategy in learning. In general, by introducing entropy for the policy, and ultimately achieving a higher entropy value, this will enable the agent to explore the environment more extensively.
(9) According to the principle of strategy iteration, under the maximum entropy reinforcement learning framework, strategy learning is alternately carried out through strategy evaluation and strategy improvement. Network Q for minimizing Bellman residual errors to action value of mechanical arm stateθTraining the parameters to obtain a mechanical arm state action value network QθAt the optimum parameter JQExpression of (θ):
(10) value network V for state of mechanical arm by minimizing square loss valueψObtaining a mechanical arm state value network V by parameter trainingψOptimum parameter J ofV(ψ) expression:
(11) training the strategy function by minimizing the expected relative entropy divergence to obtain the optimized parameter of the strategy functionExpression:
(12) training the mechanical arm control strategy network of the step (3) to the step (5) by using the training target obtained in the step (9) to the step (11) and a gradient descent methodState action value network QθAnd status value network VψCompleting the training of the network;
(13) real-time acquisition of image s obtained by camera mounted on space robot basetImages s to be acquired in real timetMechanical arm control strategy network input to step (12)Outputting to obtain the angular velocity a of the mechanical arm joint at the sampling time ttAnd the control on the space robot mechanical arm is realized, and a track planning task in a T time period is realized.