The invention discloses a
reinforcement learning reward self-learning method in a
discrete manufacturing scene. The method comprises the following steps: 1, refining the process of the current
production line, wherein g belongs to G = {g1, g2,..., gN}, and the
intelligent agent reaches a preset target g and is recorded as an interaction sequence episode; according to the initial parameters, obtaining multiple sections of episodes corresponding to the g1 as a target, taking state actions in the episodes and a state difference value
delta as a training
data set to be input into a GPR module, andobtaining a
system state transition model based on state difference; enabling the
intelligent Agent to continue to interact with the environment to obtain a new state st, and enabling a Reward network to output r (st), enabling an Actor network to output a (st), enabling a Critic network to output V (st) and enabling a GPR module to output value function Vg as the updating direction of the whole;when the absolute value of Vg-V (st) is smaller than epsilon, considering that award
function learning under the current procedure is completed, and carrying out parameter storage of the Reward network; continuously carrying out interaction, and generating the following sub-target g < n + 1 > as the episodes of the updating direction for updating the GPR; and when the set target G = {g1, g2,...,gN} is all realized in sequence, finishing the process learning of the
production line.