Method for designing ethical agent based on reinforcement learning
1. A method for designing an ethical agent based on reinforcement learning is characterized in that,
the method comprises the following steps: s1, inducing and extracting the meta-ethical behaviors from the behavior specifications;
s2, grading the meta-ethical behaviors by using a crowdsourcing technology to obtain meta-ethical behavior grades;
s3, designing a reward mechanism based on the track tree, the meta-ethical behavior hierarchical design and the reinforcement learning algorithm;
s4 selects a life scenario and performs ethical agent training using the reward mechanism in S3.
2. The method for designing ethical agent based on reinforcement learning as claimed in claim 1,
the behavior specification is a daily behavior specification of primary and middle school students.
3. The method for designing ethical agent based on reinforcement learning as claimed in claim 2,
the specific steps of inducing and extracting the meta-ethical behaviors from the behavior specifications are as follows:
classifying the daily behaviors and extracting the meta-ethical behaviors;
and carrying out opposite face extraction on the meta-ethical behaviors.
4. The method for designing ethical agent based on reinforcement learning as claimed in claim 1,
the method for grading the meta-ethical behaviors by utilizing the crowdsourcing technology comprises the following specific steps of:
dividing the ethical attributes of the meta-ethical behaviors into 7 levels, namely, law compliance L, law violation L, law compliance G, law violation G, ethide compliance V and ethide violation V, wherein N is irrelevant to ethics, and the corresponding data tags of each level are 1, -1, 2, -2, 3, -3 and 0;
designing a task and issuing the task to a crowdsourcing platform to obtain a data tag;
and integrating the crowdsourcing results based on a majority voting method.
5. The method for designing ethical agent based on reinforcement learning as claimed in claim 1,
the reward mechanism is designed based on the track tree, the meta-ethical behavior hierarchical design and the reinforcement learning algorithm, and comprises the following specific steps:
all track sequences are obtained through traversing from a root node to a leaf node based on a track tree, and an intelligent agent tracks the progress of the intelligent agent through the track of the track tree;
mapping an executable action set of an agent in an environment with track tree nodes to obtain a mapping relation between the executable action set and the track tree nodes;
if the current execution action follows the track tree, namely the node corresponding to the executed action is a successor of the current node, awarding and moving the current node mark backwards to the successor node, and if not, giving a penalty;
and carrying out graded reward design based on the meta-ethical behavior grading, and superposing the graded reward with the track tree reward.
6. The method for designing ethical agent based on reinforcement learning as claimed in claim 1,
the life scene is a medicine buying scene.
7. The method for designing ethical agent based on reinforcement learning as claimed in claim 6,
the specific steps of selecting a life scene and performing ethical agent training by using the reward mechanism in S3 are as follows:
selecting a Q-learning algorithm for reinforcement learning to train an intelligent agent, and setting a learning rate alpha, a discount factor gamma and a greedy factor epsilon to be 0.9;
when the intelligent agent purchases the medicine or does not purchase the medicine and returns home, ending the simulation of a round and recording the number of moving steps;
in order to avoid the intelligent agent cycle exploration, a limited step number of one round is set, and when the moving step number reaches the limited step number, the training of the round is forcibly ended, and the test of the next round is carried out.
Background
With the rapid development of science and technology, artificial intelligence has been widely applied to many fields such as medical treatment, traffic, finance and the like, and intelligent bodies in various forms such as intelligent nursing robots, automatic driving automobiles and the like also play more and more important roles in human life. However, when people enjoy the convenience brought by artificial intelligence, the ethical problem brought by the artificial intelligence needs to be solved. For example, the robot mistakenly recognizes a worker as a steel plate cut, the smart speaker suggests suicide for its user, an unmanned car is out of control and dies, and so on. Therefore, how to ensure that the intelligent agent has the ability of complying with the basic ethical specifications of human beings and performs proper and friendly interaction with human beings is a problem to be solved urgently in the field of artificial intelligence at present.
Disclosure of Invention
The invention aims to provide a method for designing an ethical intelligent agent based on reinforcement learning, which aims to enable the intelligent agent to flexibly and efficiently cope with possible human behaviors and have stronger ethical judgment capability.
In order to achieve the above object, the present invention provides a method for designing an ethical agent based on reinforcement learning, comprising: s1, inducing and extracting the meta-ethical behaviors from the behavior specifications;
s2, grading the meta-ethical behaviors by using a crowdsourcing technology to obtain meta-ethical behavior grades;
s3, designing a reward mechanism based on the track tree, the meta-ethical behavior hierarchical design and the reinforcement learning algorithm;
s4 selects a life scenario and performs ethical agent training using the reward mechanism in S3.
The behavior specification is a daily behavior specification of primary and middle school students.
The method comprises the following specific steps of inducing and extracting the element ethical behaviors from the behavior specifications:
classifying the daily behaviors and extracting the meta-ethical behaviors;
and carrying out opposite face extraction on the meta-ethical behaviors.
The method comprises the following specific steps of classifying the meta-ethical behaviors by utilizing a crowdsourcing technology, wherein the specific steps of obtaining the meta-ethical behavior classification comprise:
dividing the ethical attributes of the meta-ethical behaviors into 7 levels, namely, law compliance L, law violation L, law compliance G, law violation G, ethide compliance V and ethide violation V, wherein N is irrelevant to ethics, and the corresponding data tags of each level are 1, -1, 2, -2, 3, -3 and 0;
designing a task and issuing the task to a crowdsourcing platform to obtain a data tag;
and integrating the crowdsourcing results based on a majority voting method.
The method comprises the following specific steps of designing a reward mechanism based on a track tree, a meta-ethical behavior hierarchical design and a reinforcement learning algorithm:
all track sequences are obtained through a traversing mode from a root node to a leaf node based on a track tree, and an agent tracks the progress of the agent through the track of the track tree;
mapping an executable action set of an agent in an environment with track tree nodes to obtain a mapping relation between the executable action set and the track tree nodes;
if the current execution action follows the track tree, namely the node corresponding to the executed action is a successor of the current node, awarding and moving the current node mark backwards to the successor node, and if not, giving a penalty;
and carrying out graded reward design based on the meta-ethical behavior grading, and superposing the graded reward with the track tree reward.
Wherein, the life scene is a medicine buying scene.
The specific steps of selecting a life scene and performing ethical agent training by using the reward mechanism in the S3 are as follows:
selecting a Q-learning algorithm for reinforcement learning to train an intelligent agent, and setting a learning rate alpha, a discount factor gamma and a greedy factor epsilon to be 0.9;
when the intelligent agent purchases the medicine or does not purchase the medicine and returns home, ending the simulation of a round and recording the number of moving steps;
in order to avoid the intelligent agent cycle exploration, a limited step number of one round is set, and when the moving step number reaches the limited step number, the training of the round is forcibly ended, and the test of the next round is carried out.
The invention relates to a method for designing an ethical agent based on reinforcement learning, which comprises the following steps: s1, inducing and extracting the meta-ethical behaviors from the behavior specifications; s2, grading the meta-ethical behaviors by using a crowdsourcing technology to obtain meta-ethical behavior grades; s3, designing a reward mechanism based on the track tree, the meta-ethical behavior hierarchical design and the reinforcement learning algorithm; s4 selects a life scenario and performs ethical agent training using the reward mechanism in S3. The invention has the following advantages:
1. the method extracts the element ethical behaviors from the daily behavior specification of primary and secondary school students, realizes the generalization of similar behaviors in different scenes, can generalize various behaviors in daily life of people, ensures the generality of the environment, and solves the problem of limited scenes to a certain extent.
2. The meta-ethics behaviors are subjected to grading statistics through a crowdsourcing technology, so that time cost can be saved, and judgments of the meta-ethics behaviors by people under different cultural backgrounds are collected.
3. By combining element ethical behavior grading and a track tree, a reward and punishment mechanism in reinforcement learning is perfected, so that the intelligent agent can flexibly and efficiently cope with human behaviors which are possibly met, and the intelligent agent has stronger ethical judgment capability.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method of designing an ethical agent based on reinforcement learning of the present invention;
FIG. 2 is a flow chart of the present invention for generalizing and extracting meta-ethical behaviors from behavior specifications;
FIG. 3 is a flow chart of the present invention for ranking the meta-ethical behavior using crowdsourcing to obtain a meta-ethical behavior ranking;
FIG. 4 is a flow chart of the present invention for designing reward mechanism based on trajectory tree, meta-ethical behavior hierarchy design and reinforcement learning algorithm;
FIG. 5 is a flow chart of the present invention for selecting a life scenario and performing ethical agent training using the reward mechanism in S3;
FIG. 6 is a flow chart of the overall scenario and pharmacy environment of the present invention;
FIG. 7 is a chart of the success rate of execution of actions to get a prescription of the present invention;
FIG. 8 is a chart of the success rate of execution of actions of the present invention without a prescription;
FIG. 9 is a flow chart of the overall architecture of the reinforcement learning training agent of the present invention.
1-buying medicine fold line, 2-robbing money fold line, 3-inserting team fold line, 4-abnormal ending, 5-returning redundant money, 6-helping old milk, 7-stealing medicine fold line, 8-attacking drugstore fold line and 9-refusing to sell by drugstore fold line.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Referring to fig. 1 to 9, the present invention provides a method for designing an ethical agent based on reinforcement learning, comprising:
s1, inducing and extracting the meta-ethical behaviors from the behavior specifications;
the behavior specification is a daily behavior specification of primary and middle school students. The daily behavior standard of the primary and secondary school students plays an important role in the daily behavior standard of the primary and secondary school students written by the education department, the formation of good behavior habits of the primary and secondary school students, the formation of good school style, teaching style and the like. The basic requirements for the ideology and the daily behavior of the students are reflected in a centralized manner, the correct ideal belief is established for the students, good behavior habits are developed, and the development of physical and mental health is promoted. The method comprises the following specific steps:
s11 classifying the daily behaviors;
the method specifically comprises the following steps: the scraps of paper, bottles, plastic bags and the like are thrown on the road, and the general expression T (x) is that any garbage is thrown at any occasion. Similarly, O (x) represents an attack and a threat to others on any occasion; m (x) represents an attempted confusing failure to counterfeit a counterfeit product; p (x) represents a free altering depicting official object; i (x) indicates compliance with relevant regulations; c (x) indicates that queuing is not in order on any occasion; d (x) represents that others have difficulty without rescue, etc.;
s12 performs opposite face extraction on the meta-ethical behaviors.
Performing opposite extraction on the Yuan ethical behaviors, such as D (x) corresponding to rescue and injury support。
S2, grading the meta-ethical behaviors by using a crowdsourcing technology to obtain meta-ethical behavior grades;
the method comprises the following specific steps:
s21, dividing the ethical attributes of the meta-ethical behaviors into 7 levels, namely, law compliance L, law violation L, law compliance G, violation G, ethide compliance V and ethide violation V, wherein N is irrelevant to ethics, and the corresponding data tags of each level are 1, -1, 2, -2, 3, -3 and 0;
the method is characterized in that factors such as laws, regulations and morals are comprehensively considered, ethical attributes of the meta-ethical behaviors are divided into 7 levels, law compliance L, law violation L, law compliance G, violation G, morals compliance V and morals violation V are observed, N is irrelevant to ethics, and data tags corresponding to the levels are 1, -1, 2, -2, 3, -3 and 0.
S22, designing a task and issuing the task to a crowdsourcing platform to obtain annotation data;
designing a task and issuing the task to a platform, wherein the task design layout is shown in table 1:
TABLE 1 ethical behavior hierarchical crowd-sourced task design layout
S23, integrating crowdsourcing results based on a majority voting method;
integrating the labeled data obtained by crowdsourcing based on a majority voting method, wherein the grading of the meta-ethical behaviors is determined by the following formula:
for any task t, the voting value v of the worker for t is 1, -1, 2, -2, 3, -3, 0. The votes vj (j e 1, -1, 2, -2, 3, -3, 0) represent the total votes for each label, with independent worker answers.
Each behavior has a corresponding label, different labels represent different levels, the final result is shown in table 2, and the meta-ethical behavior can be summarized as follows: v ═ B(x),T(x),D(x),C(x)}G={I(x)},L={S(x),O(x),M(x),P(x)},L={S(x),B(x),O(x),M(x),P(x)},N={D(x)。
TABLE 2
S3, designing a reward mechanism based on the track tree, the meta-ethical behavior hierarchical design and the reinforcement learning algorithm;
the method comprises the following specific steps:
s31, acquiring all track sequences in a traversing mode from a root node to a leaf node based on the track tree, and tracking the progress of the agent through the track of the track tree by the agent;
the track tree is a tree capable of restricting the occurrence order of actions, and is a conventional way for human to execute all possible executions of tasks, and all track sequences are acquired in a downward traversal way. The agent can track the progress of the agent through the track of the track tree.
S32, mapping the executable action set of the agent in the environment with the trace tree nodes to obtain the mapping relation between the executable action set and the trace tree nodes;
s33, if the current execution action is following the track tree, namely the node corresponding to the executed action is the successor of the current node, awarding and moving the current node mark backwards to the successor node, otherwise, giving a penalty;
s34 is a hierarchical reward design based on meta-ethical behavior hierarchy and is superimposed with the track tree reward.
And performing behavior grading reward design according to the following formula, and superposing the behavior grading reward with the track tree reward, wherein the reward of the track tree is executed according to the traversing sequence of the track tree from the root node to the leaf node, and the intelligent agent obtains the reward of 10, otherwise obtains the reward of-10.
Where F(s)t,at) Is a reward function based on the meta-ethical behaviors, and the value of the reward function is related to the grading of the meta-ethical behaviors. a ist|stRepresents an agent in stAction a executed in Statet。
S4 selects a life scenario and performs ethical agent training using the reward mechanism in S3.
The life scene is a medicine buying scene. A medicine buying scene which is common in real life is selected. In the environment, the main objective of the agent is to return home with the prescription drug, and the agent should violate ethical and moral standard as little as possible while completing the task, and map its behavior state into the reinforcement learning environment. Such as a pharmacy scenario, which is a 7x7 plane as shown, there are three roles in the simulation, agent, pharmacist, and customer, the agent's goal being to go home with the medicine. It can be seen that each of the @ marked characters, the characters near exit E are the agents being trained, the ones marked with x are the drug store counters, the characters on the counters are the drug store clerks, the responsibility of which is to check the prescription and sell the drugs. The intelligent agent needs to provide the prescription to buy the prescription medicine, the prescription needs to be obtained by the intelligent agent going to a hospital to consult a doctor, and if the intelligent agent cannot provide the prescription, the pharmacy staff cannot sell the medicine to the intelligent agent. The under-counter character is another customer who is in business with the pharmacy staff. This means that the agent must wait in line before interacting with the pharmacist. The agent may choose to skip customers in order to reduce the number of steps performed, communicating directly with the pharmacist, such a queue break being considered non-ethical.
The method comprises the following specific steps:
s41, selecting a Q-learning algorithm for reinforcement learning to train an agent, and setting a learning rate alpha, a discount factor gamma and a greedy factor epsilon to be 0.9;
setting experimental parameters, in order to train reasonably and effectively, a Q-learning algorithm for reinforcement learning is selected for training an agent, and the learning rate alpha, the discount factor gamma and the greedy factor epsilon are 0.9.
S42, when the intelligent agent purchases the medicine or does not purchase the medicine and returns home, ending the simulation of a round and recording the number of moving steps;
when the agent purchases a drug or does not purchase a drug home, the simulation of one round will end.
S43, setting the limit step number of one round, when the moving step number reaches the limit step number, forcibly ending the training of the round, and carrying out the test of the next round.
To avoid the agent's cycle exploration, the next round of training is performed by forcing the end of the round by setting a round limit step number 200.
Analysis of the experimental results was then performed. Due to the randomness of whether the agent gets a prescription at the hospital, the situation is divided into getting a prescription and not getting a prescription. In order to research the variation of each action execution rate (the ratio of the number of executed actions to the result) of the agent along with the number of training rounds, the test is performed every 10 rounds and 1000 rounds of training, and 100 training results are counted to perform average calculation. The final result is shown in fig. 7 and 8, where the performance of the agent changes with/without prescription, the abscissa is the number of training rounds of the agent, and the ordinate is the action execution rate. As can be seen from FIG. 7, the buy medicine curve gradually rises with the number of training rounds and tends to level off, indicating how the agent has bought medicine from the school; the curve representing the money robbery and the team insertion immediately falls after a small amplitude rise in the initial training stage, which indicates that the intelligent agent tries the non-ethical behaviors first and avoids the behaviors after getting punishment; the fact that the initial stage of the abnormal ending curve training is close to 1 indicates that the intelligent agent continuously tries various actions and exceeds the maximum number of the actions in the turn; the action execution rate of the curve of returning the surplus money is 0 when the intelligent agent does not learn to go to the pharmacy, because the intelligent agent cannot find the surplus money and cannot return the surplus money; the curve of the old people is converged fastest, the old people needing help can be found after the intelligent agent leaves home, the old people can be helped to obtain corresponding rewards according to an ethical grading mechanism, and the effectiveness of the ethical grading mechanism is just explained by curve rising and convergence. As can be seen from fig. 8, when the prescription is not obtained in the hospital, the rejection of the medicine by the agent is gradually increased, because the agent attacks the agent in the early stage of the training and finally steals the medicine successfully due to the rejection of the agent, and as the training is increased, the agent learns that the agent should obey ethical regulations and not attack the agent and steal the medicine, so that the behavior of attacking the agent and stealing the medicine gradually decreases and becomes stable with the increase of the training and finally the medicine is sold in a refused manner. The proportion of money robbing to queue insertion is also reduced along with the increase of training, and finally becomes stable. The medicine buying curve gradually rises along with the number of training rounds and tends to be stable, which shows how the intelligent agent buys medicines from the middle school; the curve representing the money robbery and the team insertion immediately falls after a small amplitude rise in the initial training stage, which indicates that the intelligent agent tries the non-ethical behaviors first and avoids the behaviors after getting punishment; the fact that the initial stage of the abnormal ending curve training is close to 1 indicates that the intelligent agent continuously tries various actions and exceeds the maximum number of the actions in the turn; the action execution rate of the curve of returning the surplus money is 0 when the intelligent agent does not learn to go to the pharmacy, because the intelligent agent cannot find the surplus money and cannot return the surplus money; the curve of the old people is converged fastest, the old people needing help can be found after the intelligent agent leaves home, the old people can be helped to obtain corresponding rewards according to an ethical grading mechanism, and the effectiveness of the ethical grading mechanism is just explained by curve rising and convergence.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:定子温度的监控方法、装置、设备和存储介质