Rapid simulation learning method, system and equipment for robot skill learning
1. A quick simulation learning method for robot skill learning is characterized by comprising the following steps:
s10, when the robot needs to learn new skills, acquiring original teaching data; the original teaching data is the state of an operation environment before and after a demonstrator executes teaching actions;
s20, extracting key frame teaching data from the original teaching data set by a preset key frame extraction method in combination with a pre-trained evaluator;
s30, controlling the robot to reproduce teaching through a trained control strategy model based on the key frame teaching data, thereby completing generalization of robot skill learning;
the training method of the control strategy model and the evaluator comprises the following steps:
a10, collecting exploration data and constructing an exploration data set; the exploration data comprises the states of the operation environment before and after the robot executes the action generated by setting the exploration strategy;
a20, taking the state of the operating environment as the input of supervised learning, taking the generated action as the label of the supervised learning, and training the pre-constructed control strategy model by a behavior cloning method;
a30, collecting evaluation data and constructing an evaluation data set; the evaluation data comprises a test task, the success rate and the operation steps when the test task is executed through the trained control strategy model; the test task consists of an initial state and a target state of an operating environment set by a test;
a40, taking a test task as the input of supervised learning, taking the success rate and the operation steps as labels of the supervised learning, and training a pre-constructed evaluating device by a supervised learning method;
the control strategy model is constructed based on a deep neural network with a convolution structure and a recursion structure; the evaluator is constructed based on a deep neural network with a convolution structure.
2. The rapid simulation learning method for robot skill learning according to claim 1, wherein in step S20, "extracting key frame teaching data from the original teaching data set by a preset key frame extraction method" is performed by:
s21, constructing an initial graph G by taking the state of the operating environment in the original teaching data as a node;
s22, selecting a group of node pairs which are not evaluated by the evaluation device from the initial graph as a transfer task, and predicting the success rate and the operation steps of the control strategy model by using the evaluation device when executing the transfer task;
s23, if the success rate predicted value is 1, executing the step S24, otherwise executing the step S25;
s24, adding a weighted directed edge between the node pairs corresponding to the transfer task and assigning the weight as the predicted operation step number to obtain a weighted directed graph;
s25, judging whether the initial graph has node pairs which are not evaluated by the evaluator, if so, executing a step S22, otherwise, executing a step S26;
and S26, searching the shortest path of the weighted directed graph by using a shortest path algorithm, and using the state corresponding to the node on the shortest path as key frame teaching data.
3. The robot skill learning-oriented rapid simulation learning method according to claim 1, wherein in step S30, "control robot recurrence teaching by trained control strategy model" comprises:
s31, reading the initial state of the operation environment through the sensor of the robot as the current state;
s32, extracting the state of the operating environment in the key frame teaching data as a target state, and deleting the extracted target state in the key frame data;
s33, predicting the action of the robot through the trained control strategy model based on the current state and the target state;
s34, the robot executes the predicted action and updates the current state of the operation environment;
s35, judging whether the current state updated in the step S34 is consistent with the target state extracted in the step S320, if yes, executing a step S36, otherwise, executing a step S32;
and S36, judging whether the state of the operating environment in the key frame teaching data is empty, if so, completing generalization of robot skill learning, and otherwise, skipping to the step S32.
4. The method of claim 1, wherein the set heuristic strategy is a random search.
5. The robot-skill learning-oriented rapid mock learning method according to claim 1, wherein the control strategy model is a Loss function Loss during training1Comprises the following steps:
wherein the content of the first and second substances,f is a forward model, pi is a control strategy model, and thetaπTo control parameters of the strategic model, θfIs a parameter of the forward positive model, o1For initial observation, i.e. the state of the operating environment before performing an action, ogA true value of the target observation, a true value of the state of the operating environment after the action is performed,to perform the predicted value of the target observation after the predicted action,to predict target observations after performing a truth action, which is an action in the exploration data, aiIs the true value action in the ith step,and (4) the predicted action output by the control strategy model in the ith step, wherein lambda is a hyper-parameter, K is the sampling length of exploration data, and L is a binary cross entropy loss function.
6. The rapid mock learning method for robot skill learning according to claim 1, wherein said evaluator model is trainedLoss function Loss during exercise2Comprises the following steps:
wherein, thetapr,θpsParameters of success rate prediction and operation step number prediction output by the evaluator model, r, s,The success rate truth value, the operation step number truth value, the success rate predicted value and the operation step number predicted value are respectively.
7. A quick simulation learning system for robot skill learning is characterized by comprising a teaching data acquisition module, a key frame extraction module and a teaching recurrence module;
the teaching data acquisition module is configured to acquire original teaching data when the robot needs to learn a new skill; the original teaching data is the state of an operation environment before and after a demonstrator executes teaching actions;
the key frame extraction module is configured to extract key frame teaching data from the original teaching data set by combining a pre-trained evaluator through a preset key frame extraction method;
the teaching recurrence module is configured to control the robot recurrence teaching through a trained control strategy model based on the key frame teaching data, so that generalization of robot skill learning is completed;
the training method of the control strategy model and the evaluator comprises the following steps:
a10, collecting exploration data and constructing an exploration data set; the exploration data comprises the states of the operation environment before and after the robot executes the action generated by setting the exploration strategy;
a20, taking the state of the operating environment as the input of supervised learning, taking the generated action as the label of the supervised learning, and training the pre-constructed control strategy model by a behavior cloning method;
a30, collecting evaluation data and constructing an evaluation data set; the evaluation data comprises a test task, the success rate and the operation steps when the test task is executed through the trained control strategy model; the test task consists of an initial state and a target state of an operating environment set by a test;
a40, taking a test task as the input of supervised learning, taking the success rate and the operation steps as labels of the supervised learning, and training a pre-constructed evaluating device by a supervised learning method;
the control strategy model is constructed based on a deep neural network with a convolution structure and a recursion structure; the evaluator is constructed based on a deep neural network with a convolution structure.
8. An apparatus, comprising:
at least one processor; and
a memory communicatively coupled to at least one of the processors; wherein the content of the first and second substances,
the memory stores instructions executable by the processor for execution by the processor to implement the robot skill learning oriented fast mock learning method of any of claims 1-6.
9. A computer readable storage medium storing computer instructions for execution by the computer to implement the robot skill learning oriented fast mock learning method of any of claims 1-6.
Background
Mock learning is a method in which a robot acquires skills by mimicking expert behavior. The simulation learning method can be divided into two stages, namely teaching data acquisition and strategy generation. In teaching data collection, a presenter is required to perform one or more tasks and provide status and action information during the presentation. At the time of the policy generation, the robot learns a control policy for imitating the behavior of a presenter from the teaching data using a policy generation method.
The existing imitation learning method still has certain difference compared with the imitation process of human beings. The main body is as follows: first, a human simulator can learn skills only from a state sequence of tasks when the teacher's motion information cannot be obtained; second, human imitators optimize task execution steps using their own knowledge when they mimic expert behavior. These differences make it necessary to use external sensors or motion recognition modules to acquire expert motions when existing mock learning methods teach robot skills, which results in the teaching data acquisition process becoming time consuming and complex; moreover, when the teach pendant provides suboptimal teaching, the robot cannot optimize the execution process, which results in reduced robot operating efficiency and longer working time. Based on the technical scheme, the invention provides a rapid simulation learning method for robot skill learning.
Disclosure of Invention
In order to solve the problems in the prior art that teaching data acquisition of the existing robot is time-consuming and complex, and when a demonstrator provides suboptimal teaching, the robot cannot optimize an execution process, so that the operation efficiency of the robot is reduced and the operation time is prolonged, the invention provides a robot skill learning-oriented rapid simulation learning method in a first aspect, wherein the method comprises the following steps:
s10, when the robot needs to learn new skills, acquiring original teaching data; the original teaching data is the state of an operation environment before and after a demonstrator executes teaching actions;
s20, extracting key frame teaching data from the original teaching data set by a preset key frame extraction method in combination with a pre-trained evaluator;
s30, controlling the robot to reproduce teaching through a trained control strategy model based on the key frame teaching data, thereby completing generalization of robot skill learning;
the training method of the control strategy model and the evaluator comprises the following steps:
a10, collecting exploration data and constructing an exploration data set; the exploration data comprises the states of the operation environment before and after the robot executes the action generated by setting the exploration strategy;
a20, taking the state of the operating environment as the input of supervised learning, taking the generated action as the label of the supervised learning, and training the pre-constructed control strategy model by a behavior cloning method;
a30, collecting evaluation data and constructing an evaluation data set; the evaluation data comprises a test task, the success rate and the operation steps when the test task is executed through the trained control strategy model; the test task consists of an initial state and a target state of an operating environment set by a test;
a40, taking a test task as the input of supervised learning, taking the success rate and the operation steps as labels of the supervised learning, and training a pre-constructed evaluating device by a supervised learning method;
the control strategy model is constructed based on a deep neural network with a convolution structure and a recursion structure; the evaluator is constructed based on a deep neural network with a convolution structure.
In some preferred embodiments, in step S20, "extracting key frame teaching data from the original teaching data set by a preset key frame extraction method", the method includes:
s21, constructing an initial graph G by taking the state of the operating environment in the original teaching data as a node;
s22, selecting a group of node pairs which are not evaluated by the evaluation device from the initial graph as a transfer task, and predicting the success rate and the operation steps of the control strategy model by using the evaluation device when executing the transfer task;
s23, if the success rate predicted value is 1, executing the step S24; otherwise, executing step S25;
s24, adding a weighted directed edge between the node pairs corresponding to the transfer task and assigning the weight as the predicted operation step number to obtain a weighted directed graph;
s25, judging whether the initial graph has node pairs which are not evaluated by the evaluator, if so, executing a step S22, otherwise, executing a step S26;
and S26, searching the shortest path of the weighted directed graph by using a shortest path algorithm, and using the state corresponding to the node on the shortest path as key frame teaching data.
In some preferred embodiments, in step S30, "control robot recurrence teaching by trained control strategy model", the method includes:
s31, reading the initial state of the operation environment through the sensor of the robot as the current state;
s32, extracting the state of the operating environment in the key frame teaching data as a target state, and deleting the extracted target state in the key frame data;
s33, predicting the action of the robot through the trained control strategy model based on the current state and the target state;
s34, the robot executes the predicted action and updates the current state of the operation environment;
s35, judging whether the current state updated in the step S34 is consistent with the target state extracted in the step S320, if yes, executing a step S36, otherwise, executing a step S32;
and S36, judging whether the state of the operating environment in the key frame teaching data is empty, if so, completing generalization of robot skill learning, and otherwise, skipping to the step S32.
In some preferred embodiments, the set exploration strategy is a random search.
In some preferred embodiments, the control strategy model is a Loss function Loss in training1Comprises the following steps:
wherein the content of the first and second substances, f is a forward model, pi is a control strategy model, and thetaπTo control parameters of the strategic model, θfIs a parameter of the forward positive model, o1For initial observation, i.e. the state of the operating environment before performing an action, ogA true value of the target observation, a true value of the state of the operating environment after the action is performed,to perform the predicted value of the target observation after the predicted action,to predict target observations after performing a truth action, which is an action in the exploration data, aiIs the true value action in the ith step,and (4) the predicted action output by the control strategy model in the ith step, wherein lambda is a hyper-parameter, K is the sampling length of exploration data, and L is a binary cross entropy loss function.
In some preferred embodiments, the evaluator model, loss of training thereofFunction Loss2Comprises the following steps:
wherein, thetapr,θpsParameters of success rate prediction and operation step number prediction output by the evaluator model, r, s,The success rate truth value, the operation step number truth value, the success rate predicted value and the operation step number predicted value are respectively.
The invention provides a rapid simulation learning system for robot skill learning, which comprises a teaching data acquisition module, a key frame extraction module and a teaching recurrence module;
the teaching data acquisition module is configured to acquire original teaching data when the robot needs to learn a new skill; the original teaching data is the state of an operation environment before and after a demonstrator executes teaching actions;
the key frame extraction module is configured to extract key frame teaching data from the original teaching data set by combining a pre-trained evaluator through a preset key frame extraction method;
the teaching recurrence module is configured to control the robot recurrence teaching through a trained control strategy model based on the key frame teaching data, so that generalization of robot skill learning is completed;
the training method of the control strategy model and the evaluator comprises the following steps:
a10, collecting exploration data and constructing an exploration data set; the exploration data comprises the states of the operation environment before and after the robot executes the action generated by setting the exploration strategy;
a20, taking the state of the operating environment as the input of supervised learning, taking the generated action as the label of the supervised learning, and training the pre-constructed control strategy model by a behavior cloning method;
a30, collecting evaluation data and constructing an evaluation data set; the evaluation data comprises a test task, the success rate and the operation steps when the test task is executed through the trained control strategy model; the test task consists of an initial state and a target state of an operating environment set by a test;
a40, taking a test task as the input of supervised learning, taking the success rate and the operation steps as labels of the supervised learning, and training a pre-constructed evaluating device by a supervised learning method;
the control strategy model is constructed based on a deep neural network with a convolution structure and a recursion structure; the evaluator is constructed based on a deep neural network with a convolution structure.
In a third aspect of the invention, an apparatus is presented, at least one processor; and a memory communicatively coupled to at least one of the processors; wherein the memory stores instructions executable by the processor for execution by the processor to implement the robot skill learning oriented fast mock learning method of claims above.
In a fourth aspect of the present invention, a computer-readable storage medium is provided, which stores computer instructions for execution by the computer to implement the robot skill learning oriented fast mock learning method as claimed above.
The invention has the beneficial effects that:
the invention can simplify the data acquisition flow during the simulation learning of the robot, optimize the teaching track based on the performance of the robot control strategy model, shorten the data acquisition time and the robot operation time, effectively improve the simulation learning efficiency of the robot and shorten the operation time.
1) The robot can directly learn the operation skill through the state of the observation operation environment during the task execution of the demonstrator, and does not need to acquire the action of the demonstrator, thereby reducing the workload during the acquisition of the teaching data, shortening the acquisition time of the teaching data and reducing the demonstration cost.
2) When the robot executes the teaching task, the track provided by the demonstrator can be optimized according to the performance of the control strategy model, so that the target task can be completed in fewer operation steps, the robot operation efficiency is improved, and the operation time is shortened.
3) The robot can learn the operation skill of a demonstrator from one observation without acquiring multiple groups of teaching data, so that the teaching data acquisition cost is reduced.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of a rapid simulation learning method for robot skill learning according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of extracting keyframe teaching data associated with a teaching task from an original teaching data set using a keyframe extraction method according to an embodiment of the present invention;
FIG. 3 is a schematic flow diagram of a control strategy model for controlling a robot to follow a keyframe teaching replication teaching according to an embodiment of the present invention;
FIG. 4 is a block diagram of a fast mock learning system for robot-oriented skill learning in accordance with an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention discloses a robot skill learning-oriented rapid simulation learning method, which comprises the following steps as shown in figure 1:
s10, when the robot needs to learn new skills, acquiring original teaching data; the original teaching data is the state of an operation environment before and after a demonstrator executes teaching actions;
s20, extracting key frame teaching data from the original teaching data set by a preset key frame extraction method in combination with a pre-trained evaluator;
s30, controlling the robot to reproduce teaching through a trained control strategy model based on the key frame teaching data, thereby completing generalization of robot skill learning;
the training method of the control strategy model and the evaluator comprises the following steps:
a10, collecting exploration data and constructing an exploration data set; the exploration data comprises the states of the operation environment before and after the robot executes the action generated by setting the exploration strategy;
a20, taking the state of the operating environment as the input of supervised learning, taking the generated action as the label of the supervised learning, and training the pre-constructed control strategy model by a behavior cloning method;
a30, collecting evaluation data and constructing an evaluation data set; the evaluation data comprises a test task, the success rate and the operation steps when the test task is executed through the trained control strategy model; the test task consists of an initial state and a target state of an operating environment set by a test;
a40, taking a test task as the input of supervised learning, taking the success rate and the operation steps as labels of the supervised learning, and training a pre-constructed evaluating device by a supervised learning method;
the control strategy model is constructed based on a deep neural network with a convolution structure and a recursion structure; the evaluator is constructed based on a deep neural network with a convolution structure.
In order to more clearly describe the robot skill learning-oriented rapid simulation learning method of the present invention, the following describes in detail the steps of an embodiment of the method of the present invention with reference to the drawings.
In the following embodiments, the training process of the control strategy model and the evaluator is detailed, and then the generalization process of the robot skill learning is detailed by the rapid simulation learning method for the robot skill learning.
Simulation Learning (also called teaching Learning) refers to a method for acquiring robot skills by simulating expert behaviors. The rapid simulation Learning, also called small sample simulation Learning (Few-shot simulation Learning), refers to a simulation Learning method which can teach the skill of the robot by only one or a few times of teaching, and solves the problem that the traditional simulation Learning teaching data acquisition is complicated. Strategies, teaching data, behavioral cloning are all technical terms involved in mock learning. A policy is a function that maps state to actions for controlling the robot to replicate the teach pendant's behavior. The teaching data are observations in the environment and actions performed by the instructor during the task, and are stored in the order of execution by the instructor. Behavioral cloning is one of the main strategy generation methods of the simulation learning, and the simulation learning is regarded as a supervised learning problem. Specifically, the behavior cloning method takes the action and the state of the teaching data as the input and the label of supervised learning, and the output action of the strategy is consistent with the recorded action of the demonstrator by optimizing the model parameters, so as to achieve the aim of simulation.
1. Training process of the control strategy model and the evaluator
A10, collecting exploration data and constructing an exploration data set; the exploration data comprises the states of the operation environment before and after the robot executes the action generated by setting the exploration strategy;
in this embodiment, before collecting data, motion primitives of the robot are first defined, such as capturing and placing an object, and then parameters of the robot motion primitives, such as captured spatial position, placed spatial position, and the like, are generated by using a set exploration strategy. The exploration strategy is preferably a random strategy (namely, the exploration strategy is randomly generated according to the value range of the action primitive parameters) in the invention, and can also be obtained by a manual design or reinforcement learning method. When the exploration data is collected, the current state of the operation environment is recorded by using a sensor, then the robot executes the action generated by the exploration strategy, and finally the state of the operation environment after the action is executed is recorded. Explore the data by o1,a1,o2,...,ot,at,ot+1Store in the form of ot,atRespectively the observations acquired by the robot (i.e. the state of the search environment) and the actions performed at time t. In a complete data acquisition round, the robot executes a series of actions and stores corresponding exploration data, and when the number of the actions reaches t or the operating environment is damaged, the robot executes reset operation and finishes acquisition of the round.
A20, taking the state of the operating environment as the input of supervised learning, taking the generated action as the label of the supervised learning, and training the pre-constructed control strategy model by a behavior cloning method;
in the present embodiment, a behavior cloning method is used to obtain a control strategy of the robot, i.e., to train a control strategy model. The behavioral cloning approach models the mock learning problem as a supervised learning problem. Firstly, the state-action sequence of the operation environment containing K steps of actions is sampled from the exploration data collected in the previous step, and the operation environment is renumbered to obtain the operation environment with the format of { o1,a1,o2,...,oK,aK,oK+1The training data of. Then, the initial observation (initial state) o1And final observation (final state) oKAs input for supervised learning, willSequence of actions { a1,a2,...,aKAs a label for supervised learning. In the present invention, a high-dimensional color image is observed. In order to process high-dimensional observation information, a deep neural network with a convolution structure and a recursion structure is used as a control strategy model, and a target consistency Loss function is used as a target function Loss during training of the control strategy model1The expression is:
and satisfy
Wherein f is a forward direction model, pi is a control strategy model, and thetaπTo control parameters of the strategic model, θfIs a parameter of the forward positive model, o1For initial observation, i.e. the state of the operating environment before performing an action, ogA true value of the target observation, a true value of the state of the operating environment after the action is performed,to perform the predicted value of the target observation after the predicted action,to predict target observations after performing a truth action, which is an action in the exploration data, aiIs the true value action in the ith step,and (4) the predicted action output by the control strategy model in the ith step, wherein lambda is a hyper-parameter, K is the sampling length of exploration data, and L is a binary cross entropy loss function.
A30, collecting evaluation data and constructing an evaluation data set; the evaluation data comprises a test task, the success rate and the operation steps when the test task is executed through the trained control strategy model; the test task consists of an initial state and a target state of an operating environment set by a test;
in this embodiment, since sufficient exploration data may not be obtained in some scenarios, the robot control strategy model obtained in step a20 may have a generalization problem, i.e., the strategy may not successfully perform tasks in some special cases. In order to quantitatively evaluate the performance of the robot control strategy model, an evaluator of the robot control strategy is obtained by adopting a supervised learning method. The performance of the control strategy on different test tasks needs to be collected as training data of the evaluator. First, a test task is randomly generated in an operating environment, the test task being set by an initial state o of a testsAnd a target state ogAnd (4) defining. Then, the test task is executed by using the control strategy model, and if the control strategy model successfully completes the test task, the success rate p is recordedrIs 1 and records the number of operation steps psOtherwise, the success rate p is recordedrIs 0 and records the number of operation steps psFor null, evaluate data to { os,og,pr,psStore in the form of.
A40, taking the test task as the input of supervised learning, taking the success rate and the operation steps as labels of the supervised learning, and training the pre-constructed evaluation device by a supervised learning method.
In this embodiment, the evaluation problem of the control strategy model is modeled as a supervised learning problem, the initial state osTarget state ogI.e. test tasks are used as input for supervised learning, success rate prNumber of operation steps psIs used as a label for supervised learning. Using a deep neural network with a convolution structure as a model of an estimator, and respectively using two-classification cross entropy Loss and minimum mean square error Loss as an objective function Loss of success rate prediction and operation step number prediction2The expression is:
wherein, thetapr,θpsParameters of success rate prediction and operation step number prediction output by the evaluator model, r, s,The success rate truth value, the operation step number truth value, the success rate predicted value and the operation step number predicted value are respectively. If the robot fails to complete the task, the sign function ensures that the set of training data is not used to optimize θps. After training, an evaluator for evaluating the robot control strategy is obtained.
2. Rapid imitation learning method for robot skill learning
S10, when the robot needs to learn new skills, acquiring original teaching data; the original teaching data is the state of an operation environment before and after a demonstrator executes teaching actions;
in the present embodiment, a demonstrator (i.e. a human) demonstrates a skill in an operating environment, specifically, performs a series of actions so that the operating environment state changes from an initial state to a target state. The demonstrator uses the sensor to record the state of the current environment and stores the state in sequence after executing one action each time, and stores all the states in sequence after finishing executing to obtain a demonstration data sequence which is expressed asWhereinThe environment observation before the 1 st action is executed is shown, and n is the number of actions executed by the demonstrator.
S20, extracting key frame teaching data from the original teaching data set by a preset key frame extraction method in combination with a pre-trained evaluator;
in the present embodiment, in order to improve the operation efficiency when the robot simulates under the suboptimal teaching condition, the original teaching provided from the demonstrator using the key frame extraction methodDdExtracting key frame teaching data D related to teaching taskk. Key frame extraction method follows D through prediction control strategydThe time performance is that a weighted directed graph containing all candidate paths is constructed, and finally the shortest path of the graph is solved to obtain a key frame teaching which is expressed asWherein m is the number of elements, and m < n, as shown in fig. 2, specifically as follows:
s21, constructing an initial graph G by taking the state of the operating environment in the original teaching data as a node;
in this step, D isdAnd (3) obtaining a borderless graph as an initial graph by using nodes of the state graph of all the observed operation environments. Each node of the initial graph corresponds to the state of an operating environment in the original teaching data, and the serial number of the node and the corresponding state thereof are in DdThe same numbers in (1) are used.
S22, selecting a group of node pairs which are not evaluated by the evaluation device from the initial graph as a transfer task, and predicting the success rate and the operation steps of the control strategy model by using the evaluation device when executing the transfer task;
in this step, a round-robin scheme is used to exhaust each pair of nodes in the graph as the transfer task, which is denoted as { oi,ojJ > i, will be oiDefined as the starting state o of the tasksO is mixingjDefined as the target state og。osAnd ogThe success rate predicted value of the control strategy on the task is output by the evaluating device as input to the evaluating device based on the deep neural networkAnd the predicted value of the number of operation steps
S23, if the success rate predicted value is 1, executing the step S24; otherwise, executing step S25;
in this step, if the success rate is not goodMeasured valueIf the success rate is predicted value 1, the control strategy model is considered to be able to complete the task, and step S24 is executedAt 0, the control strategy model is deemed to be unable to complete the task, and step S25 is performed.
S24, adding a weighted directed edge between the node pairs corresponding to the transfer task and assigning the weight as the predicted operation step number to obtain a weighted directed graph;
in this step, the direction of the directed edge is oiPoint to ojAnd the weight value of the directed edge is the predicted value of the operation step number output by the evaluator
S25, judging whether the initial graph has node pairs which are not evaluated by the evaluator, if so, executing a step S22, otherwise, executing a step S26;
in this step, all node pairs are exhausted and it is determined whether there is a predicted result between two nodes that has not been obtained by the evaluator, if so, step S22 is executed, otherwise, step S26 is executed.
And S26, searching the shortest path of the weighted directed graph by using a shortest path algorithm, and using the state corresponding to the node on the shortest path as key frame teaching data.
In the step, the shortest-circuit problem of the weighted directed graph is solved by using a D-algorithm, the shortest-circuit path length and the track of the directed graph are obtained, the states corresponding to all nodes on the shortest-circuit track are sequentially arranged to obtain a key frame teaching, and the key frame teaching is expressed asWhereinRepresenting the 1 st environmental observation, m being a key frameThe number of observation states involved is taught, and m < n.
And S30, controlling the robot to reproduce teaching through the trained control strategy model based on the key frame teaching data, thereby completing generalization of robot skill learning.
In this embodiment, the control strategy model based on the deep neural network trained in step a20 is used to follow the key frame teaching data D obtained in step S20kSo as to realize the robot simulation learning process. As shown in fig. 3, the details are as follows:
s31, reading the initial state of the operation environment through the sensor of the robot as the current state;
in this step, an initial observation (initial state) of the operating environment is read using a sensor as the current state o in the subsequent stepc。
S32, extracting the state of the operating environment in the key frame teaching data as a target state, and deleting the extracted target state in the key frame data;
in this step, key frame teaching data D is extractedkAs the first element in the target state ogAnd is in DkDeleting the element.
S33, predicting the action of the robot through the trained control strategy model based on the current state and the target state;
in this step, o obtained in the above stepcAnd ogThe input is sent to a control strategy model based on a deep neural network, and the output of the control strategy model is used as a parameter of a robot action primitive to obtain a predicted action
S34, the robot executes the predicted action and updates the current state of the operation environment;
in this step, the robot executes parameterized action primitives in the operating environmentAnd after the execution is finished, using the sensor to acquire the observation of the operating environment so as to update the current state oc。
S35, judging whether the current state updated in the step S34 is consistent with the target state extracted in the step S320, if yes, executing a step S36, otherwise, executing a step S32;
in this step, the high-dimensional image is first observed o using a neural networkcAnd ogMapping to a low dimensional space to obtain scAnd sg. Then, s is calculated using the Euclidean distancecAnd sgIf the distance is less than the set threshold value, then o is consideredcAnd ogAnd step S36 is performed for consistency, otherwise step S33 is performed for inconsistency.
And S36, judging whether the state of the operating environment in the key frame teaching data is empty, if so, completing generalization of robot skill learning, and otherwise, skipping to the step S32.
In this step, D is judgedkIf it is an empty set, the impersonation task is considered to be completed if it is an empty set, otherwise the impersonation is considered to be not completed and the process returns to step S32.
A second embodiment of the present invention provides a rapid simulation learning system for robot skill learning, as shown in fig. 4, including: a teaching data acquisition module 100, a key frame extraction module 200, and a teaching reproduction module 300;
the teaching data acquisition module 100 is configured to acquire original teaching data first when the robot needs to learn a new skill; the original teaching data is the state of an operation environment before and after a demonstrator executes teaching actions;
the key frame extraction module 200 is configured to extract key frame teaching data from the original teaching data set by a preset key frame extraction method in combination with a pre-trained evaluator;
the teaching recurrence module 300 is configured to control the robot recurrence teaching through a trained control strategy model based on the keyframe teaching data, so as to complete generalization of robot skill learning;
the training method of the control strategy model and the evaluator comprises the following steps:
a10, collecting exploration data and constructing an exploration data set; the exploration data comprises the states of the operation environment before and after the robot executes the action generated by setting the exploration strategy;
a20, taking the state of the operating environment as the input of supervised learning, taking the generated action as the label of the supervised learning, and training the pre-constructed control strategy model by a behavior cloning method;
a30, collecting evaluation data and constructing an evaluation data set; the evaluation data comprises a test task, the success rate and the operation steps when the test task is executed through the trained control strategy model; the test task consists of an initial state and a target state of an operating environment set by a test;
a40, taking a test task as the input of supervised learning, taking the success rate and the operation steps as labels of the supervised learning, and training a pre-constructed evaluating device by a supervised learning method;
the control strategy model is constructed based on a deep neural network with a convolution structure and a recursion structure; the evaluator is constructed based on a deep neural network with a convolution structure.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
It should be noted that, the robot skill learning-oriented fast simulation learning system provided in the foregoing embodiment is only illustrated by the division of the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
An apparatus of a third embodiment of the invention, at least one processor; and a memory communicatively coupled to at least one of the processors; wherein the memory stores instructions executable by the processor for execution by the processor to implement the robot skill learning oriented fast mock learning method of claims above.
A computer-readable storage medium of a fourth embodiment of the present invention stores computer instructions for execution by the computer to implement the robot skill learning oriented fast mock learning method of the claims above.
It is clear to those skilled in the art that, for convenience and brevity not described, the specific working processes and related descriptions of the above-described apparatuses and computer-readable storage media may refer to the corresponding processes in the foregoing method examples, and are not described herein again.
Referring now to FIG. 5, there is illustrated a block diagram of a computer system suitable for use as a server in implementing embodiments of the method, system, and apparatus of the present application. The server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 5, the computer system includes a Central Processing Unit (CPU) 501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for system operation are also stored. The CPU501, ROM 502, and RAM503 are connected to each other via a bus 504. An Input/Output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output section 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), a compact disc read-only memory (CD-ROM), Optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.