Multi-agent flexible workshop production scheduling method and system based on conditional attribution mechanism

By adopting a multi-agent hierarchical decision-making framework and conditional attribution mechanism in flexible workshop production scheduling, the problem of low human-machine collaboration efficiency in flexible workshops is solved, achieving efficient and robust production scheduling, reducing computational overhead and improving real-time response capabilities.

CN122284552APending Publication Date: 2026-06-26HOHAI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HOHAI UNIV
Filing Date
2026-05-12
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies cannot effectively combine human-machine collaboration in flexible workshop production scheduling, resulting in insufficient system flexibility, low efficiency of human-machine collaboration, and traditional methods have high computational overhead and poor real-time performance, making it difficult to dynamically respond to the arrival of new tasks.

Method used

A multi-agent flexible workshop production scheduling method based on conditional attribution mechanism is adopted. The scheduling problem is decomposed into two sub-problems: job selection and resource allocation. Job agents and resource agents make decisions separately. A conditional expectation mapping network and a heterogeneous value network are constructed by using a proximal policy optimization method and a heterogeneous gradient decoupling mechanism to achieve refined and differentiated policy training.

Benefits of technology

It effectively decouples multi-objective optimization tasks, enables refined collaboration among multiple agents, improves dynamic response capabilities and scheduling quality, and enhances the scheduling robustness and efficiency of the flexible workshop.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122284552A_ABST
    Figure CN122284552A_ABST
Patent Text Reader

Abstract

This invention discloses a multi-agent flexible shophouse production scheduling method and system based on conditional attribution mechanisms. For flexible shophouse production scheduling, two agents are established: a task agent and a resource agent. The task agent selects task actions based on the global state of the shophouse, including selecting tasks to be processed from the unassigned task set. The resource agent selects resource actions based on the global state and the local state of the tasks to be processed, including selecting processing resources and processing modes for the tasks. The flexible shophouse executes the task processing based on the tasks, their processing resources, and processing modes. During the agent's policy update phase, the outputs of the task-side conditional expectation mapping network and the resource-side conditional expectation mapping network are used to calculate the individual attribution advantages of the task agent and the resource agent, respectively. This invention can improve the dynamic response capability, scheduling quality, and robustness of flexible shophouse production scheduling with dynamic human-machine collaboration.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of workshop production scheduling technology, and in particular to a multi-agent scheduling method and system for flexible workshops. Background Technology

[0002] Intelligent manufacturing, as a core path for the transformation and upgrading of the manufacturing industry, aims to build a highly interconnected and intelligent production system. However, while emphasizing technological integration and automation, the development model of intelligent manufacturing often neglects the core role and initiative of "humans" in the production system, leading to increasingly prominent problems such as insufficient system flexibility, low efficiency of human-machine collaboration, and uneven worker workload. With increasingly personalized market demands and the shift of production models towards small-batch, multi-variety production, modern manufacturing workshops have gradually evolved into complex systems with features such as flexible resource scheduling and dynamic task response. Against this backdrop, human-machine collaborative operation models, by integrating the high efficiency and precision of machines with the flexible adaptability of humans, provide an effective path for the transformation and upgrading of manufacturing systems. Therefore, designing intelligent scheduling methods that comprehensively consider multiple factors in the workshop—humans, machines, and materials—and collaboratively optimize production efficiency and worker well-being has become crucial for the manufacturing industry to enhance its competitiveness.

[0003] Shop floor scheduling is considered a complex resource optimization problem in operations research, involving multiple aspects such as job selection strategies, resource allocation, processing mode decisions, and multi-objective optimization. With the expansion of shop floor size, the increased randomness of job arrivals, and the diversification of human-machine collaboration modes, the complexity of solving this problem is further exacerbated. Traditional shop floor scheduling employs metaheuristic algorithms; however, metaheuristic algorithms assume that all job information is completely known in advance, making them offline static scheduling methods. They cannot respond to the arrival of new jobs in real time online. Even with a rolling window strategy, a complete iterative search must be re-executed at each decision point, resulting in high computational overhead and poor real-time performance.

[0004] The rapid development of deep reinforcement learning technology has provided new ideas and methods for solving these problems, including single-agent deep reinforcement learning and multi-agent deep reinforcement learning. However, single-agent reinforcement learning adopts a centralized decision-making model, which requires handling complex multi-dimensional relationships simultaneously, easily leading to parameter explosion and the curse of dimensionality in the policy network. Moreover, learning from scratch lacks prior knowledge guidance, resulting in low training efficiency and slow convergence. Although multi-agent deep reinforcement learning assigns different decision dimensions to different agents, it is still relatively crude in terms of multi-objective optimization signal processing and agent responsibility attribution, easily leading to optimization interference between objectives and confusion of policy update signals, affecting training stability and scheduling performance. Summary of the Invention

[0005] Purpose of the invention: The purpose of this invention is to provide a multi-agent scheduling method and system for flexible workshops, which dynamically optimizes collaborative decision-making in job selection and resource allocation.

[0006] Technical solution: The present invention provides a multi-agent flexible shop floor production scheduling method based on conditional attribution mechanism, comprising the following steps:

[0007] For flexible workshop production scheduling, an operation intelligence agent and a resource intelligence agent are established respectively. The operation intelligence agent selects operation actions based on the global state of the workshop. The operation actions include selecting the operation to be processed from the set of unassigned operations. The resource intelligence agent selects resource actions based on the global state and the local state of the operation to be processed. The resource actions include selecting the processing resources and processing mode of the operation to be processed.

[0008] The flexible workshop executes processing operations based on the tasks to be processed, their processing resources, and processing modes.

[0009] In the policy update phase of the agents, the policy network corresponding to each agent is updated using a proximal policy optimization method. Specifically, the individual attribution advantages of the task agent and the resource agent are calculated using the outputs of the task-side conditional expectation mapping network and the resource-side conditional expectation mapping network, respectively. The task-side conditional expectation mapping network is used to calculate the average expected return of all possible task actions under fixed resource action conditions. The resource-side conditional expectation mapping network is used to calculate the average expected return of all possible resource actions under fixed task action conditions.

[0010] Furthermore, the optimization objectives include minimizing job delay time and minimizing the standard deviation of worker workload;

[0011] Independent value networks are used to calculate the first and second value estimates for minimizing the task delay time objective and minimizing the worker load standard deviation, respectively.

[0012] The first GAE advantage is calculated based on the first value estimate using the generalized advantage estimation method, and the second GAE advantage is calculated based on the second value estimate.

[0013] The combined advantage is obtained by weighted summation of the first GAE advantage and the second GAE advantage.

[0014] Furthermore, the job-side conditional expectation mapping network deduces the first value and first reward for executing the job-side fixed joint action. After performing a specified number of deductions, the average expected return of the job is calculated based on the first value and the first reward. The resource-side conditional expectation mapping network deduces the second value and second reward for executing the resource-side fixed joint action. After performing a specified number of deductions, the average expected return of the resource is calculated based on the first value and the first reward.

[0015] The fixed joint action on the operation side includes a fixed resource action and any operation action;

[0016] The resource-side fixed joint action includes a fixed operation action and any resource action.

[0017] Furthermore, the individual operational attribution advantage for taking joint action is calculated based on the first value estimate, the second value estimate, the joint advantage, and the average expected return of the operation; the individual resource attribution advantage for taking joint action is calculated based on the first value estimate, the second value estimate, the joint advantage, and the average expected return of the resource.

[0018] The combined actions include operational actions and resource actions.

[0019] Furthermore, the global state includes the average late delivery time of completed jobs, the standard deviation of late delivery time of completed jobs, the length of the unassigned job set, the average relaxation time of unassigned jobs, the standard deviation of relaxation time of unassigned jobs, the average workload of all workers, the standard deviation of workload of all workers, the average utilization rate of all robots, and the standard deviation of utilization rate of all robots.

[0020] The local state includes the basic processing time and delivery deadline of the operation to be processed.

[0021] Furthermore, the work agent selects work actions based on the overall state of the workshop, including:

[0022] The task agent selects task rules based on the task probability distribution output by the task strategy network, and selects tasks to be processed from the set of unassigned tasks according to the task rules; the input to the task strategy network is the global state.

[0023] The operation rules include selecting the operation with the shortest basic processing time, selecting the operation with the longest basic processing time, selecting the operation with the shortest relaxation time, and selecting the operation with the longest relaxation time.

[0024] The operation strategy network adopts a multilayer perceptron structure.

[0025] Furthermore, the resource agent selects resource actions based on the global state and the local state of the task to be processed, including:

[0026] The resource agent selects resource rules based on the resource probability distribution output by the resource policy network, and determines the processing resources and processing mode of the task to be processed based on the resource rules; the input of the resource policy network is the global state and the local state.

[0027] The resource rules include selecting the worker with the shortest current cumulative working time, selecting the worker with the longest current cumulative working time, selecting the worker with the shortest waiting time, selecting the worker with the longest waiting time, selecting the robot with the shortest waiting time, selecting the robot with the longest waiting time, selecting the worker with the shortest current cumulative working time and the robot with the shortest waiting time, selecting the worker with the longest current cumulative working time and the robot with the longest waiting time, selecting the worker with the shortest waiting time and the robot with the shortest waiting time, and selecting the worker with the longest waiting time and the robot with the longest waiting time.

[0028] The resource strategy network adopts a multilayer perceptron structure.

[0029] Furthermore, when a new task arrives in the workshop or a task is completed, if there are vacant workstations and unassigned tasks at the current moment, then the current moment is the scheduling decision point.

[0030] At the scheduling decision point, the task agent and resource agent are invoked to select the task to be processed, its processing resources, and processing mode. If the processing resources are idle, the task to be processed will start immediately; otherwise, processing will begin after the processing resources become idle.

[0031] The present invention discloses a multi-agent flexible workshop production scheduling system based on conditional attribution mechanism, comprising an environment interaction layer and an agent decision-making layer;

[0032] The environment interaction layer is used to maintain the global state of the workshop and the local state of the operations to be processed;

[0033] The intelligent agent decision layer is used to establish operation intelligent agents and resource intelligent agents for flexible workshop production scheduling. The operation intelligent agent selects operation actions based on the global state of the workshop. The operation actions include selecting the operation to be processed from the set of unassigned operations. The resource intelligent agent selects resource actions based on the global state and the local state of the operation to be processed. The resource actions include selecting the processing resources and processing mode of the operation to be processed.

[0034] In the policy update phase of the agents, the policy network corresponding to each agent is updated using a proximal policy optimization method. Specifically, the individual attribution advantages of the task agent and the resource agent are calculated using the outputs of the task-side conditional expectation mapping network and the resource-side conditional expectation mapping network, respectively. The task-side conditional expectation mapping network is used to calculate the average expected return of all possible task actions under fixed resource action conditions. The resource-side conditional expectation mapping network is used to calculate the average expected return of all possible resource actions under fixed task action conditions.

[0035] The flexible workshop performs processing according to the tasks to be processed, their processing resources, and processing modes;

[0036] The computer program product of the present invention includes a computer program that, when executed by a processor, implements the multi-agent flexible workshop production scheduling method based on conditional attribution mechanism.

[0037] Beneficial effects: Compared with the prior art, the advantages of the present invention are as follows:

[0038] (1) This invention decomposes the complex scheduling problem of flexible workshop into two sub-problems: job selection and resource allocation. It adopts a dual-agent hierarchical decision-making framework. The job agent is responsible for selecting jobs from the set of unassigned jobs, and the resource agent is responsible for allocating processing resources for the selected jobs and determining the processing mode. This not only reduces the learning difficulty of a single agent directly processing the high-dimensional joint action space, but also preserves the conditional dependency between the two decision-making links.

[0039] (2) The present invention adopts a conditional attribution mechanism to construct an expectation mapping network conditional on the actions of the other party, calculates the average expected return using a finite inference method, separates the independent contributions of each agent from the joint advantage, generates the individual attribution advantages of the task agent and the resource agent, so that each agent is responsible for its own decision and achieves refined and differentiated strategy training.

[0040] (3) The present invention adopts a heterogeneous gradient decoupling mechanism to construct a heterogeneous value network to independently estimate the value functions of the two objectives, and then forms a unified advantage function through weighted fusion, so that the heterogeneous gradients can propagate independently in their respective networks, overcoming gradient conflicts and optimization difficulties.

[0041] In summary, this invention can effectively decouple multi-objective optimization tasks, realize refined collaboration among multiple agents, and has strong dynamic response capabilities, high scheduling quality, and high robustness. Attached Figure Description

[0042] Figure 1 This is a schematic diagram of a flexible workshop according to an embodiment of the present invention.

[0043] Figure 2 This is a network structure diagram of an embodiment of the present invention.

[0044] Figure 3 This is a flowchart illustrating the algorithm training process in an embodiment of the present invention.

[0045] Figure 4 This is a flowchart of the flexible workshop production scheduling according to an embodiment of the present invention.

[0046] Figure 5 This is a scheduling Gantt chart according to an embodiment of the present invention.

[0047] Figure 6 This is a cumulative reward curve diagram of an embodiment of the present invention.

[0048] Figure 7 This is an optimized target curve diagram of an embodiment of the present invention. Detailed Implementation

[0049] The technical solution of the present invention will be further described below with reference to the accompanying drawings.

[0050] for Figure 1 The flexible workshop shown initially contains a batch of jobs awaiting scheduling. During subsequent operation, newly arriving jobs continuously enter the system using a random, dynamic arrival method and are added to the existing set of unassigned jobs, participating in subsequent scheduling decisions alongside the existing jobs in that set. Each job has a known delivery date and basic processing time. The workshop is equipped with... Homogeneous workstations Production workshop workers and The invention utilizes a collaborative robot. For any task to be processed, the invention selects one of three modes based on the current workshop status: independent processing by workers, independent processing by robots, or human-robot collaborative processing. The processing time varies depending on the mode.

[0051] for Figure 1 The present invention addresses the flexible workshop operation scheduling problem by designing a multi-agent near-end policy optimization algorithm that considers conditional attribution mechanisms and heterogeneous gradient decoupling. Based on this algorithm, the multi-agent flexible workshop production scheduling method of the present invention is implemented.

[0052] The strategy design framework of the multi-agent near-end policy optimization algorithm consists of an environment interaction layer and an agent decision layer. The environment interaction layer simulates the workshop operation process, maintaining the state changes of jobs, workstations, workers, and collaborative robots. When both idle workstations and unassigned jobs exist simultaneously in the system, the current moment is determined as the scheduling decision point, and the environment interaction layer calls the agent decision layer to make a scheduling decision. The agent decision layer consists of a job agent and a resource agent, which complete a full scheduling decision sequentially. First, the job agent receives the current global state provided by the environment interaction layer and selects a job to be processed from the set of unassigned jobs according to the job selection rules. Subsequently, the environment interaction layer extracts the local state of the selected job and combines this local state with the current global state to form the input of the resource agent. The resource agent allocates processing resources to the selected job and determines its processing mode according to the resource selection rules.

[0053] In practical implementation, each agent uses a policy network as its core decision-making module, with a one-to-one correspondence between agents and policy networks. The task agent corresponds to the task policy network, its input being the current global state and its output being the probability distribution of task rules, used to determine the task to be processed. The resource agent corresponds to the resource policy network, its input being the local state of the selected task and the current global state, and its output being the probability distribution of resource rules, used to determine the processing resources and processing method for the task. Therefore, the policy network is the direct carrier for agents to select actions. The two agents generate task actions and resource actions through their respective policy networks, together constituting a complete joint scheduling action. Simultaneously, to evaluate the long-term scheduling value of the current global state and guide the updating of the two policy networks, this invention constructs two independent heterogeneous value networks. and .in, Used to estimate the state value corresponding to the delayed delivery penalty target. This is used to estimate the state value corresponding to the worker load balancing objective. Unlike the policy network, the two value networks do not belong to a single agent, but are shared by the task agent and resource agent during training. The task agent and resource agent execute decisions through their respective policy networks, while the two value networks evaluate the overall scheduling effect from the global state. They learn the value function under different optimization objectives through a heterogeneous gradient decoupling mechanism, providing a basis for subsequent advantage function calculation and policy update.

[0054] Through the aforementioned dual-agent hierarchical decision-making framework, this invention decomposes the complex decision-making process in dynamic human-machine collaborative workshop scheduling into job selection and resource allocation sub-decision-making. This reduces the learning difficulty of a single agent directly processing a high-dimensional joint action space while preserving the conditional dependencies between the two decision-making stages. This framework can achieve collaborative optimization of job selection and resource allocation under the constraints of dynamic job arrival and heterogeneous human-machine resources, while simultaneously considering the optimization objectives of time delay and worker load balancing. The multi-agent flexible workshop production scheduling method based on conditional attribution mechanism described in this invention includes the following steps:

[0055] For flexible workshop production scheduling, an operation intelligence agent and a resource intelligence agent are established respectively. The operation intelligence agent selects operation actions based on the global state of the workshop. The operation actions include selecting the operation to be processed from the set of unassigned operations. The resource intelligence agent selects resource actions based on the global state and the local state of the operation to be processed. The resource actions include selecting the processing resources and processing mode of the operation to be processed.

[0056] The flexible workshop executes processing operations based on the tasks to be processed, their processing resources, and processing modes.

[0057] In the policy update phase of the agents, the policy network corresponding to each agent is updated using a proximal policy optimization method. Specifically, the individual attribution advantages of the task agent and the resource agent are calculated using the outputs of the task-side conditional expectation mapping network and the resource-side conditional expectation mapping network, respectively. The task-side conditional expectation mapping network is used to calculate the average expected return of all possible task actions under fixed resource action conditions. The resource-side conditional expectation mapping network is used to calculate the average expected return of all possible resource actions under fixed task action conditions.

[0058] Specifically, the design of the objective function and multi-objective reward function in this embodiment includes the following:

[0059] In this embodiment, to comprehensively evaluate the performance of the scheduling scheme in terms of delivery delay and load balancing, the following objective function is constructed:

[0060]

[0061]

[0062]

[0063] in, To optimize the overall objective function, The target for average delay time is... To achieve the goal of worker load balancing, Weighting coefficients for the load balancing target; This is the set of completed assignments. This represents the number of tasks completed. - }, For homework Completion time, For homework Delivery time; Gather the workers; For the number of workers; ; For the set of assigned jobs; For homework Adopt processing mode Processing time, Indicates that workers are processing the materials. This indicates human-machine collaborative processing; Assignment: 0-1 decision variables By workers In pattern The value is 1 if processed, otherwise it is 0; The average load for all workers. .

[0064] Since the flexible workshop production scheduling problem in this embodiment contains two heterogeneous optimization objectives (minimizing job delay time and minimizing worker load standard deviation) with heterogeneous gradient characteristics, this embodiment designs a dual-path reward function, corresponding to the two optimization objectives respectively, to provide a foundation for subsequent value network training.

[0065] This embodiment will use the first Single-step reward for delayed delivery targets at each scheduling decision point Defined as the negative value of the current scheduled job's delay time, the first... The load balancing target single-step reward for each scheduling decision point Defined as the change in the standard deviation of worker workload before and after the decision:

[0066]

[0067]

[0068] in, For homework Completion time, For homework Delivery time, The standard deviation of worker load before decision-making. The standard deviation of worker load after decision-making.

[0069] Specifically, the state space in this embodiment includes the following:

[0070] The workshop scheduling information dimension includes three objects: tasks, robots, and workers. This embodiment designs a hierarchical state space as shown in Table 1. Each scheduling decision point records the current global workshop state observed by the system as a... The task agent inputs the global state, while the resource agent inputs the local state of the currently selected task in addition to the global state. , forming an enhanced state .

[0071] Table 1 State Space

[0072]

[0073] Specifically, the action space in this embodiment includes the following.

[0074] In this embodiment, the action space consists of scheduling rules. When making scheduling decisions, the intelligent agent determines the execution order of tasks and resource allocation based on predefined scheduling rules. Based on the actual decision-making logic of flexible workshop production scheduling, this invention decomposes scheduling decisions into two sequentially executed sub-decisions: "task selection" and "resource allocation," handled by the task intelligent agent and the resource intelligent agent respectively. The decision result of the task intelligent agent directly affects the input state of the resource intelligent agent. As shown in Table 2, the action space of the task intelligent agent consists of four task selection rules, each selecting tasks from the unassigned task set based on different priority criteria. As shown in Table 3, the action space of the resource intelligent agent consists of ten resource allocation rules, covering three modes: independent worker processing, independent robot processing, and human-robot collaborative processing. The output of the resource allocation rules determines the selection of processing mode and the specific worker and robot allocation scheme.

[0075] Table 2. Job Selection Rules

[0076]

[0077] Table 3 Resource Selection Rules

[0078]

[0079] Specifically, the policy network in this embodiment includes the following.

[0080] This embodiment constructs corresponding policy networks for the task agent and the resource agent, respectively. Both policy networks are parameterized neural networks, with network parameters consisting of weight matrices and bias vectors for each layer, used to determine the output result obtained after the input state is processed by the network. The policy network in this embodiment adopts a multilayer perceptron structure, consisting of an input layer, four hidden layers, a linear output layer, and a SoftMax normalization layer, used to map the current scheduling state to the selection probability of the corresponding scheduling rule.

[0081] (1) Job Strategy Network: Input global state. (Refer to...) Figure 2 In embodiment (a), the global state is first input to hidden layer 1, processed by a fully connected layer with dimension 256 and the ReLU activation function, and then sequentially input to hidden layers 2, 3, and 4. Hidden layers 2 and 3 both have an output dimension of 64 and use the ReLU activation function; hidden layer 4 has an output dimension of 32 and also uses the ReLU activation function. Subsequently, the network maps the hidden features to a 4-dimensional output through linear layers, corresponding to four job selection rules. Finally, after normalization using the SoftMax function, the probability distribution of the job rules is obtained. The job agent selects the corresponding job rule based on this probability distribution and determines the current job to be processed from the unassigned job set accordingly. The network parameters for the job strategy are denoted as... The output of the job strategy network can be represented as:

[0082]

[0083] in, This represents the job rule score generated by the linear output layer of the job policy network. Indicates the global state Select the action below The probability of.

[0084] (2) Resource Policy Network: An enhanced state composed of the local and global states of the currently selected job. (See reference...) Figure 2 In embodiment (b), the enhanced state also sequentially passes through four hidden layers. Hidden layer 1 has an output dimension of 256, hidden layers 2 and 3 each have an output dimension of 64, and hidden layer 4 has an output dimension of 32. Each hidden layer uses the ReLU activation function for non-linear feature extraction. Subsequently, the network maps the hidden features to a 10-dimensional output through a linear layer, corresponding to 10 resource allocation rules. Finally, after normalization using the SoftMax function, the probability distribution of the resource rules is obtained. The resource agent selects the corresponding resource allocation rule based on this probability distribution, thereby determining the processing resources and processing mode of the selected task. The resource policy network parameters are denoted as... The input is the enhanced state. That is, the combination of the global state and the local state of the currently selected job, and its output can be expressed as:

[0085]

[0086] in, This represents the resource rule score generated by the linear output layer of the resource policy network. Indicates an enhanced state Select resource action The probability of this. Therefore, the policy network parameters... and It directly determines the probability distribution of job selection rules and resource allocation rules.

[0087] Thus, the job policy network completes the mapping from the global state to the job rule probability distribution, and the resource policy network completes the mapping from the augmented state to the resource rule probability distribution. The two policy networks serve as the direct decision-making modules for the job agent and the resource agent, respectively, and their output job actions and resource actions together constitute a complete joint scheduling action.

[0088] Specifically, the value network of this embodiment includes the following.

[0089] To avoid gradient conflicts when the delayed intersection penalty objective and the load balancing objective are fitted together in the same value network, this embodiment constructs two independent value networks, denoted as follows: and Both are parameterized neural networks used to decouple the value estimation of two optimization objectives.

[0090] (1) Network: Input the global state, output the expected cumulative delay penalty in the current state, denoted as the first value estimate. . Reference Figure 2 In section (c), this network is specifically responsible for fitting the value function of the delayed intersection objective. Its training objective is to minimize the mean squared error between the predicted value and the delayed intersection reward. Specifically, after the global state is input into the network, it passes through four hidden layers for feature extraction. The output dimension of hidden layer 1 is 256, the output dimensions of hidden layers 2 and 3 are both 64, and the output dimension of hidden layer 4 is 32. Each hidden layer uses the ReLU activation function. Subsequently, a linear output layer maps the 32-dimensional features extracted by hidden layer 4 to a 1-dimensional state value. . The network parameters are denoted as The output is .

[0091] (2) Network: Input the global state, output the expected cumulative load balancing improvement under the current state, denoted as the second value estimate. . Reference Figure 2 In (d), this network is specifically responsible for fitting the value function of the load balancing objective, and its training objective is to minimize the mean squared error between the predicted value and the load balancing reward. This network is... The same multilayer perceptron structure is used, with four hidden layers having output dimensions of 256, 64, 64, and 32 respectively. Each hidden layer uses the ReLU activation function, and the 1-dimensional state value is obtained through the final linear output layer. . Network notation The output is .

[0092] The parameters of the two networks are initialized, trained, and updated independently, without interfering with each other. This heterogeneous gradient decoupling design allows each Critic network to focus on fitting the value function of a single optimization objective, enabling the gradients corresponding to the extension objective and the load balancing objective to propagate independently in their respective networks, avoiding gradient conflicts that occur when a single value network simultaneously fits multiple objectives.

[0093] Specifically, the advantageous functions in this embodiment include the following.

[0094] First, the advantage functions for the two objectives are calculated using generalized advantage estimation.

[0095] For the task delay objective, the time-series difference residual is defined as:

[0096]

[0097] in, For the network State The first value estimate, This is the discount factor.

[0098] The first advantage of GAE in recursively calculating delayed delivery targets:

[0099]

[0100] in, For GAE parameters.

[0101] Similarly, for load balancing objectives, there is a second advantage for GAE:

[0102]

[0103]

[0104] in, For the network State The second value estimate.

[0105] After calculating the two advantage functions separately, they are weighted and merged according to preset weights to form a joint advantage:

[0106]

[0107] in, This is a weighting factor for load balancing advantages. This factor can be adjusted based on the actual application scenario: when order delivery efficiency is a greater focus, it can be appropriately reduced. When more attention is paid to worker load balancing, the load can be appropriately increased. .

[0108] The combined advantages of weighted fusion It also includes optimization signals from two dimensions: delayed delivery penalty and load balancing, providing a foundation for subsequent conditional attribution mechanisms.

[0109] Specifically, the conditional attribution mechanism in this embodiment includes the following:

[0110] This embodiment implements a conditional attribution mechanism by constructing two independent conditional expectation mapping networks to conditionally attribute joint advantages. The two networks are denoted as the job-side conditional expectation mapping network and the resource-side conditional expectation mapping network, respectively. Both adopt a multilayer perceptron structure to estimate the average expected return of all possible actions on this side under the condition of fixing the action on the other side.

[0111] (1) Job-side conditionalized expectation mapping network: This network is used to estimate the average expected return of all possible actions on the job side given resource actions. Input global state Actions of resource intelligent agents Output average expected return Specifically, refer to Figure 2 In step (e), the input information first enters hidden layer 1, undergoes a 128-dimensional fully connected transformation and ReLU activation, and then sequentially enters hidden layers 2 and 3. Hidden layer 2 has an output dimension of 128, and hidden layer 3 has an output dimension of 64. Each hidden layer uses the ReLU activation function for non-linear feature extraction. Subsequently, the network obtains the conditional expected reward output through a multi-head output layer. This multi-head output layer characterizes the expected reward mapping relationship under different resource action conditions, thus outputting the average expected reward of all possible actions on the job side under given resource action conditions. .

[0112] (2) Resource-side conditionalized expectation mapping network: This network is used to estimate the average expected return of all possible resource-side actions given the job action. Input global state Actions of the intelligent agent in the task Output average expected return This network uses the same network structure as the job-side conditional expectation mapping network, referencing... Figure 2 In step (f), the input information is sequentially processed through three hidden layers for feature extraction. The output dimensions of hidden layers 1 and 2 are both 128, while the output dimension of hidden layer 3 is 64. Each hidden layer uses the ReLU activation function. Finally, a multi-head output layer outputs the average expected return of all possible actions on the resource side under a given job action. .

[0113] Through the above conditional expectation mapping network, we can obtain respectively and This provides a foundation for subsequent calculations of individual attribution advantages.

[0114] Furthermore, in order to train the conditional expectation mapping network, it is necessary to generate for each decision point: the average cumulative reward of all possible actions on this side, given that the actions on the other side are fixed.

[0115] The steps for generating job-side reports are as follows:

[0116] Step 1, Save the environment state. Save the complete state of the current decision point t (including all jobs, resource status, current time, etc.), denoted as t. .

[0117] Step 2: Fix resource actions and enumerate job actions. Fixed resource actions are the actually selected... Enumerate all possible operation actions .

[0118] Step 3, truncation deduction. For each enumeration combination... Perform the following operations on the environment copy: Initialize cumulative rewards For the number of deduction steps Always execute fixed joint actions Receive instant rewards ,in, Indicates the first step in the deduction process One decision-making step, The target weights for load balancing are determined. After the simulation, the fusion value of the current state is obtained. The final cumulative return is ,in, The discount factor measures the impact of future rewards on current returns; finally, the average of the cumulative returns from the four enumerated actions is used to obtain the average expected return of the task. :

[0119]

[0120] in: The action space represents the action space of the task agent, which is the set of all available task actions.

[0121] Step 4, store the training samples. Stored in the training buffer of the job-side conditional expectation mapping network. In practical use, given the currently selected resource action... The output of the job-side conditional expectation mapping network is That is, take from 10 output heads The output value of the corresponding header. This value indicates the action of the fixed resource. Under the given conditions, the average expected return of all possible actions on the work side.

[0122] The resource-side reward generation process is symmetrical, with fixed actual operational actions. Enumerate all resource actions and deduce the results for each combination. The first step is to calculate the cumulative return and then take the average to obtain the average expected return of resources. :

[0123]

[0124] in: The action space represents the resource agent, which is the set of all available resource actions.

[0125] Store training samples This is used to conditionally map the network's training buffer on the resource side. In practice, given the currently selected job action... The output of the resource-side conditional expectation mapping network is That is, take it from 4 output heads The output value of the corresponding head. This value indicates the fixed operation action. Under the given conditions, the average expected return of all possible actions on the resource side.

[0126] Furthermore, before calculating the individual attribution advantages of each agent, it is necessary to first deduce the actual joint actions. Value. According to the definition of the dominance function. We can obtain:

[0127]

[0128] Should Indicates the state The following joint actions were taken. The actual expected return is the benchmark value for calculating an individual's attributional advantage. Among them, The fused state value is obtained by adding the state value estimates output by the heterogeneous value networks separately with the same weights as the mixed reward.

[0129]

[0130] The individual attribution advantage of the task is combined Subtract the output of the job-side conditional expectation mapping network from the value:

[0131]

[0132] in, Indicates the state Below, fixed resource actions At that time, the average expected return of all possible actions on the work side.

[0133] The meaning of this individual attribution advantage is: the actual work action currently selected. It offers additional advantages compared to other possible actions. It eliminates the influence of resource action selection, ensuring that the optimization signal of the action agent is only related to the quality of its own decisions.

[0134] Similarly, the individual resource attribution advantage is used in conjunction with... Subtract the output of the resource-side conditional expectation mapping network from the value:

[0135]

[0136] in, Indicates fixed operation actions as At that time, the average expected return of all possible actions on the resource side.

[0137] Through the aforementioned conditional attribution mechanism, each agent is only responsible for the quality of its own decisions, thus decoupling responsibility attribution. When the reward generated by the joint action is unsatisfactory, it is possible to clearly distinguish whether the cause is an error in task selection or resource allocation, avoiding the problems of mutual interference of gradient signals and mutual cancellation of policy updates between the two agents in the traditional shared advantage model.

[0138] Specifically, the training process of the multi-agent proximal policy optimization algorithm in this embodiment mainly includes five stages: environment interaction, experience storage, advantage calculation, conditional attribution, and policy update.

[0139] like Figure 3 As shown, firstly, at each scheduling decision point, the environment provides the current global state to the job policy network. The job policy network outputs the job rule probability distribution via SoftMax and determines the job action through sampling. The system selects jobs from the set of unassigned jobs. Then, the environment generates corresponding local states based on the selected jobs, and inputs both the local and global states into the resource policy network. The resource policy network outputs a resource rule probability distribution via SoftMax, and resource actions are determined through sampling. The system allocates processing resources and determines the processing mode for the selected job. Job actions and resource actions together constitute a joint action. Upon receiving this joint action, the environment executes a scheduling decision and feeds back the next state. Target bonus for delayed delivery and load balancing target rewards The aforementioned interaction information is stored in the experience pool to form trajectory samples for subsequent training.

[0140] After collecting a certain number of samples (e.g., 2048 samples) in the experience pool, the system uses two value networks to estimate the value of the global state. A single sample includes... , , , , , , .in, Output the state value corresponding to the delayed delivery target. , Output the state value corresponding to the load balancing target. .

[0141] Subsequently, the system uses GAE to calculate the advantage function based on the two-way reward and two-way value estimation. and The advantages of the two approaches are weighted and combined according to preset weights to obtain a combined advantage. This serves as the basis for subsequent individual attribution. Simultaneously, based on the dominance function... and Receive the corresponding reward, and defer the target reward. Load balancing target return The aforementioned rewards serve as training objectives for both value networks, used to calculate the value loss:

[0142]

[0143]

[0144] in, This refers to the batch sample size. and These represent the value loss from delayed delivery and the value loss from load balancing, respectively. By minimizing these value losses, the system updates the value network parameters. and This allows the two value networks to approximate the state value function under the corresponding optimization objective.

[0145] The update relationship of value network parameters can be represented as:

[0146]

[0147]

[0148] in, For the value network learning rate, The parameter represents the loss of value due to delayed delivery of the target. gradient, This indicates the load balancing target value loss with respect to the parameter. The gradient.

[0149] Job-side conditional expectation mapping network based on global state and resource actions Output the average expected return of all possible actions on the job side under fixed resource action conditions. Resource-side conditional expectation mapping network based on global state and work actions Output the average expected return of all possible actions on the resource side under fixed operation conditions. Subsequently, the average impact of the opponent's actions was extracted from the combined advantage to generate individual attribution advantages for each task. and resource individual attribution advantages This allows the task strategy network and resource strategy network to update their strategies based on their respective decision-making contributions.

[0150] During the policy update phase, the PPO-Clip method is used to update the job policy network and the resource policy network separately. For the job policy network, the probability ratio between the old and new policies is constructed as follows:

[0151]

[0152] in, and Let represent the network parameters of the current job policy and the network parameters of the old job policy, respectively. The PPO-Clip policy loss of the job policy network is defined as:

[0153]

[0154] in, PPO cutting factor This represents the probability ratio. Limited to Within the range.

[0155] Similarly, for the resource policy network, construct the probability ratio between the old and new policies:

[0156]

[0157] in, and These represent the network parameters for the current resource policy and the network parameters for the old resource policy, respectively.

[0158] The PPO-Clip policy loss of the resource policy network is defined as:

[0159]

[0160] To enhance policy exploration capabilities and prevent premature policy convergence, this invention incorporates an entropy regularization term into policy updates. The policy entropy of the job policy network is defined as:

[0161]

[0162] The policy entropy of a resource policy network is defined as:

[0163]

[0164] The larger the entropy value, the more dispersed the action selection distribution and the stronger the strategy exploration.

[0165] Ultimately, the total loss of the policy network can be expressed as:

[0166]

[0167] in, These are the entropy regularization weight coefficients. The system minimizes the total training loss. Update job strategy network parameters Resource strategy network parameters Its update relationship can be represented as:

[0168]

[0169]

[0170] in, For the policy network learning rate, and These represent calculating the gradients with respect to the network parameters of the job strategy and the network parameters of the resource strategy, respectively.

[0171] Through the above training process, the algorithm can learn job selection and resource allocation strategies while continuously interacting with the dynamic workshop environment, and improve worker load balancing while reducing job delay time.

[0172] Specifically, the multi-agent flexible workshop production scheduling method based on the multi-agent proximal policy optimization algorithm in this embodiment includes the following steps.

[0173] like Figure 4As shown, during workshop operation, the current state is updated using an event-driven approach. This event-driven approach means that instead of scanning incrementally at fixed time intervals, time is advanced to the corresponding event moment when a new job arrives or a job is completed. After an event occurs, the system first determines if there are any vacant workstations at the current moment. Specifically, a currently occupied job identifier is maintained for each workstation. When the occupied job identifier for a workstation is empty, the workstation is considered vacant. If there are no vacant workstations, it means that the conditions for job allocation are not met, and the process continues to the next event moment. If there are vacant workstations, the system further checks if there are any unassigned jobs in the unassigned job set. Unassigned jobs are managed uniformly through the unassigned job set. When a job arrives at the workshop, it is added to the unassigned job set. When the job is selected and resource allocation is completed, it is removed from the unassigned job set and added to the assigned job set. Therefore, when the length of the unassigned job set is greater than 0, it is determined that there are currently unassigned jobs. Only when both vacant workstations and unassigned jobs exist simultaneously is the current moment considered a scheduling decision point; otherwise, the process continues to the next event moment.

[0174] Upon reaching a scheduling decision point, the job agent first selects a job to be processed from the unassigned job set using job selection rules. Subsequently, the resource agent selects appropriate processing resources for the job and determines the processing mode using resource selection rules. After job selection and resource allocation, the availability of the selected resources is further assessed. If the selected resources are currently available, the job begins processing immediately; if the selected resources are temporarily unavailable, processing begins only after the corresponding resources become available. After job processing is completed, the process continues into the next round of scheduling decisions until all jobs are completed or the termination conditions are met.

[0175] The method described in this invention is verified through simulation examples below.

[0176] (a) Simulation Case Setup

[0177] This experiment was conducted using a self-built dynamic workshop scheduling environment. Workshop jobs entered the system using a random dynamic arrival method; that is, based on the initial job, subsequent jobs entered the workshop sequentially according to a Poisson arrival process, with the arrival interval following an exponential distribution to simulate the random release and dynamic insertion of orders in actual production. Each arriving job had a basic processing time and delivery date attribute. The basic processing time followed a log-normal distribution, with a value range of [15, 90] minutes, a mean of 45, and a standard deviation of 15. The processing modes included three forms: independent worker processing, independent robot processing, and human-robot collaborative processing, with corresponding time coefficients of 1.0, 2.0, and 0.7, respectively. The delivery date coefficient was randomly sampled within the range of [1.2, 1.7].

[0178] (II) Algorithm convergence verification

[0179] Use a set of sizes (Meaning: 10 workstations, 10 workers, 4 robots, 5 initial jobs, 45 randomly arriving jobs, job arrival rate 0.4) A simulation example was used for training and testing, and the final scheduling result is as follows: Figure 5 As shown, Figure 5 The scheduling Gantt chart for this simulation example is shown, with the horizontal axis representing time (minutes) and the vertical axis representing workstations. Indicates different workstations. This indicates an independent processing mode for workers. This indicates the robot's independent processing mode. This indicates a human-machine collaborative processing mode. Each colored block represents a processing period assigned to a corresponding workstation. The start and end points of the colored blocks on the timeline correspond to the start and end times of the operation, respectively, and the length of the colored block represents the processing duration of the operation.

[0180] Figure 6 The weighted reward curve for training is shown. After the experience pool collects a certain number of samples, one round of iteration is performed. As can be seen from the figure, as the number of iterations increases, the average reward value shows an upward trend and quickly stabilizes. Figure 7 The curve showing the change of the objective function is displayed, exhibiting a downward trend and also tending to stabilize. This indicates that the model has converged and achieved good performance.

[0181] (III) Comparison of Composite Scheduling Rules

[0182] Seven superior composite scheduling rules (Rule 1 to Rule 7) were selected for comparative experiments, specifically: (1) JR1+WR1, (2) JR1+WR3, (3) JR1+WR3_RR1, (4) JR2+WR1, (5) JR2+WR1_RR1, (6) JR3+WR1_RR1, and (7) JR3+WR3_RR1. During the test, each case was repeated 10 times. The weighted objective value (the weighted sum of delay penalty and load balancing) was used as the evaluation index. The comparison results are shown in Table 4. The optimal solution in each case is highlighted in bold. The simulation cases in the table are... The number of assignments is It can be seen that the present invention performs well in both small-scale or relatively stable tasks and large-scale or highly volatile tasks, with minimal fluctuations in the weighted objective value, demonstrating good robustness. In contrast, various composite scheduling rules are limited by fixed rule combinations, resulting in higher weighted objective values ​​in all cases compared to the method proposed in this invention, and overall weaker scheduling performance.

[0183] Table 4 Comparison of Simulation Data Examples and Composite Scheduling Rules

[0184]

[0185] (iv) Comparison of this invention with mainstream deep reinforcement learning algorithms

[0186] Four mainstream deep reinforcement learning algorithms—A2C, DDQN, Dueling DQN, and SAC—were selected as benchmarks for comparison. The experiment employed multiple simulation cases of varying scales, totaling 16 cases, with each case undergoing 10 repeated experiments. The weighted objective value was used as the comprehensive performance evaluation index. The experimental results are shown in Table 5, with the optimal solution for each case highlighted in bold. The data in the table shows that the weighted objective value of this invention is reduced by an average of 26.77%, 22.69%, 19.94%, and 44.69% compared to the other four algorithms, respectively. This effectively balances job delay time with worker load balancing, demonstrating its superior scheduling capability and broad adaptability.

[0187] Table 5. Comparison of Simulation Data Cases and Mainstream Deep Reinforcement Learning Algorithms

[0188]

[0189] (V) Comparison of this invention with metaheuristic algorithms

[0190] Genetic Algorithm (GA), Simulated Annealing (SA), and Particle Swarm Optimization (PSO) were selected as benchmarks for comparison. Traditional metaheuristic algorithms are offline static scheduling methods that assume all job information is completely known in advance and cannot respond online to the random dynamic arrival of new jobs. To ensure a fair comparison, the experiments in this section were conducted in a static environment where all job information was completely known. The experiments used multiple simulation cases of different scales, covering different workstations, number of workers, number of robots, and total number of jobs, totaling 16 groups. Each group of cases underwent 10 repeated experiments. The weighted objective value was used as the evaluation index, and the experimental results are shown in Table 6. The optimal solution in each case is highlighted in bold. The results show that the average weighted objective value proposed in this invention is 45.63, which is 85.99%, 82.00%, and 83.59% lower than GA, SA, and PSO on average, respectively, demonstrating a significant advantage. As the problem scale increases, the advantages of the proposed algorithm become increasingly apparent, with the weighted objective value being much smaller than the suboptimal solution. This trend indicates that when faced with large-scale, highly complex scheduling problems, traditional metaheuristic algorithms experience a rapid expansion of the search space, are prone to getting trapped in local optima or experiencing computational explosion, leading to a significant decline in scheduling quality. In contrast, this invention can effectively decouple multi-objective optimization tasks and achieve refined collaboration among agents, while maintaining stable scheduling performance in large-scale scenarios.

[0191] Table 6 Comparison of Simulation Data Cases and Static Metaheuristic Algorithms

[0192]

[0193] Based on three types of comparative experiments, this invention can effectively decouple multi-objective optimization tasks and realize refined collaboration among multiple agents. It significantly outperforms existing methods in terms of dynamic response capability, scheduling quality, and robustness, providing an efficient and reliable solution for dynamic human-machine collaborative workshop scheduling problems.

[0194] Example 2

[0195] The present invention discloses a multi-agent flexible workshop production scheduling system based on conditional attribution mechanism, comprising an environment interaction layer and an agent decision-making layer;

[0196] The environment interaction layer is used to maintain the global state of the workshop and the local state of the operations to be processed;

[0197] The intelligent agent decision layer is used to establish operation intelligent agents and resource intelligent agents for flexible workshop production scheduling. The operation intelligent agent selects operation actions based on the global state of the workshop. The operation actions include selecting the operation to be processed from the set of unassigned operations. The resource intelligent agent selects resource actions based on the global state and the local state of the operation to be processed. The resource actions include selecting the processing resources and processing mode of the operation to be processed.

[0198] In the policy update phase of the agents, the policy network corresponding to each agent is updated using a proximal policy optimization method. Specifically, the individual attribution advantages of the task agent and the resource agent are calculated using the outputs of the task-side conditional expectation mapping network and the resource-side conditional expectation mapping network, respectively. The task-side conditional expectation mapping network is used to calculate the average expected return of all possible task actions under fixed resource action conditions. The resource-side conditional expectation mapping network is used to calculate the average expected return of all possible resource actions under fixed task action conditions.

[0199] The flexible workshop performs processing according to the tasks to be processed, their processing resources, and processing modes;

[0200] Example 3

[0201] The computer program product of the present invention includes a computer program that, when executed by a processor, implements the multi-agent flexible workshop production scheduling method based on conditional attribution mechanism.

Claims

1. A multi-agent flexible shop floor production scheduling method based on conditional attribution mechanism, characterized in that, Includes the following steps: For flexible workshop production scheduling, an operation intelligence agent and a resource intelligence agent are established respectively. The operation intelligence agent selects operation actions based on the global state of the workshop. The operation actions include selecting the operation to be processed from the set of unassigned operations. The resource intelligence agent selects resource actions based on the global state and the local state of the operation to be processed. The resource actions include selecting the processing resources and processing mode of the operation to be processed. The flexible workshop executes processing operations based on the tasks to be processed, their processing resources, and processing modes. In the policy update phase of the agents, the policy network corresponding to each agent is updated using a proximal policy optimization method. Specifically, the individual attribution advantages of the task agent and the resource agent are calculated using the outputs of the task-side conditional expectation mapping network and the resource-side conditional expectation mapping network, respectively. The task-side conditional expectation mapping network is used to calculate the average expected return of all possible task actions under fixed resource action conditions. The resource-side conditional expectation mapping network is used to calculate the average expected return of all possible resource actions under fixed task action conditions.

2. The multi-agent flexible shop floor production scheduling method based on conditional attribution mechanism according to claim 1, characterized in that, The optimization objectives include minimizing job delay time and minimizing the standard deviation of worker workload; Independent value networks are used to calculate the first and second value estimates for minimizing the task delay time objective and minimizing the worker load standard deviation, respectively. The first GAE advantage is calculated based on the first value estimate using the generalized advantage estimation method, and the second GAE advantage is calculated based on the second value estimate. The combined advantage is obtained by weighted summation of the first GAE advantage and the second GAE advantage.

3. The multi-agent flexible shop floor production scheduling method based on conditional attribution mechanism according to claim 2, characterized in that, The task-side conditional expectation mapping network deduces the first value and first reward for executing fixed joint actions on the task side. After performing a specified number of deductions, the average expected return of the task is calculated based on the first value and the first reward. The resource-side conditional expectation mapping network deduces the second value and second reward for executing fixed joint actions on the resource side. After performing a specified number of deductions, the average expected return of the resource is calculated based on the first value and the first reward. The fixed joint action on the operation side includes a fixed resource action and any operation action; The resource-side fixed joint action includes a fixed operation action and any resource action.

4. The multi-agent flexible shop floor production scheduling method based on conditional attribution mechanism according to claim 3, characterized in that, The individual operational attribution advantage for taking joint action is calculated based on the first value estimate, the second value estimate, the combined advantage, and the average expected return of the operation; the individual resource attribution advantage for taking joint action is calculated based on the first value estimate, the second value estimate, the combined advantage, and the average expected return of the resources. The combined actions include operational actions and resource actions.

5. The multi-agent flexible shop floor production scheduling method based on conditional attribution mechanism according to claim 1, characterized in that, The global state includes the average late delivery time of completed jobs, the standard deviation of late delivery time of completed jobs, the length of the unassigned job set, the average relaxation time of unassigned jobs, the standard deviation of relaxation time of unassigned jobs, the average workload of all workers, the standard deviation of workload of all workers, the average utilization rate of all robots, and the standard deviation of utilization rate of all robots. The local state includes the basic processing time and delivery deadline of the operation to be processed.

6. The multi-agent flexible shop floor production scheduling method based on conditional attribution mechanism according to claim 1, characterized in that, The work agent selects work actions based on the overall state of the workshop, including: The task agent selects task rules based on the task probability distribution output by the task strategy network, and selects tasks to be processed from the set of unassigned tasks according to the task rules; the input to the task strategy network is the global state. The operation rules include selecting the operation with the shortest basic processing time, selecting the operation with the longest basic processing time, selecting the operation with the shortest relaxation time, and selecting the operation with the longest relaxation time. The operation strategy network adopts a multilayer perceptron structure.

7. The multi-agent flexible shop floor production scheduling method based on conditional attribution mechanism according to claim 1, characterized in that, The resource agent selects resource actions based on the global state and the local state of the task to be processed, including: The resource agent selects resource rules based on the resource probability distribution output by the resource policy network, and determines the processing resources and processing mode of the task to be processed based on the resource rules; the input of the resource policy network is the global state and the local state. The resource rules include selecting the worker with the shortest current cumulative working time, selecting the worker with the longest current cumulative working time, selecting the worker with the shortest waiting time, selecting the worker with the longest waiting time, selecting the robot with the shortest waiting time, selecting the robot with the longest waiting time, selecting the worker with the shortest current cumulative working time and the robot with the shortest waiting time, selecting the worker with the longest current cumulative working time and the robot with the longest waiting time, selecting the worker with the shortest waiting time and the robot with the shortest waiting time, and selecting the worker with the longest waiting time and the robot with the longest waiting time. The resource strategy network adopts a multilayer perceptron structure.

8. The multi-agent flexible shop floor production scheduling method based on conditional attribution mechanism according to claim 1, characterized in that, When a new job arrives in the workshop or a job is completed, if there are vacant workstations and unassigned jobs at the current moment, then the current moment is the scheduling decision point. At the scheduling decision point, the task agent and resource agent are invoked to select the task to be processed, its processing resources, and processing mode. If the processing resources are idle, the task to be processed will start immediately; otherwise, processing will begin after the processing resources become idle.

9. A multi-agent flexible workshop production scheduling system based on conditional attribution mechanism, characterized in that, It includes an environmental interaction layer and an agent decision-making layer; The environment interaction layer is used to maintain the global state of the workshop and the local state of the operations to be processed; The intelligent agent decision layer is used to establish operation intelligent agents and resource intelligent agents for flexible workshop production scheduling. The operation intelligent agent selects operation actions based on the global state of the workshop. The operation actions include selecting the operation to be processed from the set of unassigned operations. The resource intelligent agent selects resource actions based on the global state and the local state of the operation to be processed. The resource actions include selecting the processing resources and processing mode of the operation to be processed. In the policy update phase of the agents, a proximal policy optimization method is used to update the policy network corresponding to each agent. The outputs of the job-side conditional expectation mapping network and the resource-side conditional expectation mapping network are used to calculate the individual attribution advantages of the job agent and the resource agent, respectively. The job-side conditional expectation mapping network is used to calculate the average expected return of all possible job actions under fixed resource action conditions. The resource-side conditional expectation mapping network is used to calculate the average expected return of all possible resource actions under fixed job action conditions. The flexible workshop performs processing operations based on the tasks to be processed, their processing resources, and processing modes.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the multi-agent flexible workshop production scheduling method based on conditional attribution mechanism according to any one of claims 1-8.