Quality inspection task scheduling method, device and medium
By constructing a quality inspection task scheduling model based on reinforcement learning, and utilizing the characteristics of task processing time and equipment occupancy rate, combined with action selection rules and reward functions, the problem of the inability of existing technologies to describe the degrees of freedom and nonlinear relationships in quality inspection task scheduling is solved, thus achieving efficient scheduling of quality inspection tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- STATE GRID ZHEJIANG ELECTRIC POWER CO MARKETING SERVICE CENT
- Filing Date
- 2022-12-08
- Publication Date
- 2026-06-23
Smart Images

Figure CN116128334B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of bone conduction quality inspection technology, and more particularly to a quality inspection task scheduling method, device and medium based on reinforcement learning. Background Technology
[0002] Quality inspection of meter readings is a crucial task in power metering. To improve the efficiency and accuracy of testing various metering devices, automated quality inspection task scheduling has become a natural choice. However, unlike the existing flexible job shop scheduling problem where the required processes for each workpiece are fixed, the quality inspection task scheduling problem does not have fixed quality inspection items for each sample. A batch of quality inspection tasks can be completed on a batch of samples, resulting in a larger optimization space. At the same time, quality inspection tasks have nonlinear relationships such as serial, parallel, and mutually exclusive relationships, making the constraints more complex.
[0003] In existing reinforcement learning methods for flexible job shop scheduling problems, the state features cannot fully describe the scheduling degrees of freedom and nonlinear task relationships of the quality inspection task scheduling problem, and the reward function cannot reflect the influence of multiple factors such as task order, sample scheduling, and equipment scheduling. Therefore, existing scheduling algorithms cannot be directly applied. Summary of the Invention
[0004] In order to overcome the shortcomings of the prior art, one of the objectives of this invention is to provide a quality inspection task scheduling method, which is based on reinforcement learning to construct a quality inspection task scheduling model, thereby improving the sample detection efficiency in the quality inspection task scheduling process.
[0005] One of the objectives of this invention is achieved through the following technical solution:
[0006] A quality inspection task scheduling method, characterized by the following steps:
[0007] S1. Initialize the model training parameters, where the model is a reinforcement learning model;
[0008] S2. Construct scheduling status features, which are obtained by splicing together task processing time channels, sample-device occupancy rate, and sample-device availability time channels.
[0009] S3. Output the corresponding action according to the current scheduling state, and decode the scheduling state to obtain the sample and device corresponding to the action;
[0010] S4. Calculate the reward value and update the training parameters based on the action and decoding result;
[0011] S5. Determine if the scheduled task has been completed:
[0012] When the scheduling task is completed and the number of training steps is reached, the training ends; otherwise, return to step S2.
[0013] If the scheduled task is not completed, proceed to the next scheduling state and return to step S2.
[0014] Furthermore, the training parameters include batch setting, training steps, replay buffer, replay time, and empirical hyperparameters.
[0015] Furthermore, after calculating the reward value and updating the training parameters based on the action and decoding result, the method further includes:
[0016] The scheduling status, action, decoding result and reward value are stored in a cache pool, which is used for experience replay during training.
[0017] Furthermore, when the scheduling task is not completed, entering the next scheduling state and returning to step S2 also includes:
[0018] Determine whether experience replay is needed. If so, replay the experience; otherwise, proceed to the next scheduling state and return to step S2.
[0019] Furthermore, the task processing time channel is a three-dimensional matrix of (n+1)×(m+1)×j, where n is the number of samples, m is the number of devices, and j is the number of quality inspection items. The three-dimensional matrix of the processing time channel includes matrix element p. a,b,c p a,m,c and p n,b,c , where p a,b,c p represents the processing time required for quality inspection task c to be completed by sample a and equipment b. a,m,c and p n,b,c This indicates the feasibility of processing quality inspection task c on sample a and equipment b;
[0020] The sample-device occupancy rate channel is a two-dimensional matrix of (n+1)×9m+1, and the sample-device occupancy rate channel matrix includes matrix element u. a,b u a,m and u n,b , where u a,b u represents the cumulative time spent on quality inspection tasks performed by sample a on device b. a,m and u n, These are the cumulative processing times for sample a and device b, respectively.
[0021] The sample-device available time channel matrix is a two-dimensional matrix of (n+1)×(m+1), and the sample-device available time channel matrix includes matrix element l. a,b l a,m and l n,b , where l a,b This indicates the end time of the last task executed by sample a on device b. a,m and ln,b These are the final occupancy and release times for sample a and device b, respectively.
[0022] The concatenated scheduling state features are represented by a scheduling state feature of dimension (n+1×(m+1×(j+2)).
[0023] Furthermore, output the corresponding action based on the current scheduling state, satisfying the scheduling state representation principle: a i =π(S) i ), a i =S i+1 -S i r i =R(a i ,S i ,S i+1 (where a) i For the current action, S i As the current state, r i R is the reward for the current action, R is the reward function, and π is the action selection strategy.
[0024] Furthermore, in S3, action selection rules replace direct action output. These action selection rules include:
[0025] (1) Select the task with the shortest processing time;
[0026] (2) Select the task with the longest processing time;
[0027] (3) Select the task with the fewest available samples;
[0028] (4) Select serial task;
[0029] (5) Select parallel tasks;
[0030] (6) Select the preceding task in the mutually exclusive task pair;
[0031] (7) Select the subsequent task in the mutually exclusive task pair;
[0032] (8) Select an unconstrained task;
[0033] The heuristic rules for decoding include:
[0034] Rule 1: Heuristic sample selection is performed on the chromosomes of each individual. The sample with the shortest completion time for each experiment is selected in the order of the experiments from front to back. If multiple samples meet the selection criteria, the sample with the shortest completed experiment time is selected.
[0035] Rule 2: Perform heuristic device selection for each individual's chromosomes. Select the device with the shortest completion time based on the selected sample. If multiple devices meet the selection criteria, select the device with the lowest workload.
[0036] Furthermore, the calculation of the reward value satisfies:
[0037] R = αU - βE,
[0038] Where R is the reward value, α and β are empirical parameters, and U and E are the scheduling environment utilization and idle time, respectively;
[0039] The calculation of the scheduling environment utilization rate satisfies:
[0040]
[0041] Among them, U N U M For sample and equipment utilization, u n,M u N,m C represents the cumulative processing time of sample n and device m in the sample-device occupancy channel. max This is the current longest processing time;
[0042] The calculation of the void time satisfies:
[0043]
[0044] Among them, E N E M For sample and equipment void time, l n,M l N,m u represents the last occupancy and release time of sample n and device m in the sample-device availability time channel. n,M u N,m This represents the cumulative processing time for sample n and device m in the sample-device occupancy channel.
[0045] The second objective of this invention is to provide an electronic device for performing one of the objectives of the invention, comprising a processor, a storage medium, and a computer program, wherein the computer program is stored in the storage medium and, when executed by the processor, implements the aforementioned quality inspection task scheduling method.
[0046] A third objective of this invention is to provide a computer-readable storage medium that stores one of the objectives of the invention, wherein a computer program is stored thereon, and when the computer program is executed by a processor, it implements the above-mentioned quality inspection task scheduling method.
[0047] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0048] This invention constructs a quality inspection task scheduling model based on reinforcement learning, enhances the algorithm's ability to learn scheduling states, replaces the agent's direct learning of action decisions with action selection rules, improves the algorithm's convergence speed using heuristic rules, enhances the interpretability of the model's action selection, can fully describe the scheduling degrees of freedom and nonlinear task relationships of the quality inspection task scheduling problem, and can be applied to quantity transmission quality inspection. Attached Figure Description
[0049] Figure 1 This is a flowchart of the quality inspection task scheduling method in Implementation Example 1;
[0050] Figure 2 This is a flowchart of the quality inspection task scheduling method in Implementation Example 1 after incorporating experience playback;
[0051] Figure 3 This is a structural block diagram of the electronic device in Embodiment 3. Detailed Implementation
[0052] The present invention will now be described in more detail with reference to the accompanying drawings. It should be noted that the following description of the present invention with reference to the accompanying drawings is merely illustrative and not restrictive. Various embodiments can be combined with each other to form other embodiments not shown in the following description.
[0053] Example 1
[0054] Example 1 provides a quality inspection task scheduling method, which aims to construct a quality inspection task scheduling model by using reinforcement learning. It addresses the serial, parallel and mutually exclusive characteristics between quality inspection tasks by using a reinforcement learning action selection rule mechanism to replace the agent's direct learning of action decisions.
[0055] Quality inspection task scheduling algorithms need to determine the processing order of quality inspection tasks and allocate samples and equipment to each task to obtain the shortest possible processing time. Essentially, this involves selecting from a finite series of decisions. The core of applying reinforcement learning to quality inspection task scheduling algorithms is to transform the scheduling problem into a Markov process or a semi-Markov process, i.e., defining the state, action, transition probability, and reward function.
[0056] This embodiment addresses the challenge of scheduling quality inspection tasks across a batch of samples and machines. It proposes a scheduling state representation method suitable for reinforcement learning, enhancing the algorithm's ability to learn scheduling states. Please refer to... Figure 1 As shown, a quality inspection task scheduling method includes the following steps:
[0057] S1. Initialize the model training parameters, where the model is a reinforcement learning model;
[0058] The training parameters mentioned above include batch size, number of training steps, replay buffer, replay time, and empirical hyperparameters.
[0059] S2. Construct scheduling status features, which are obtained by splicing together task processing time channels, sample-device occupancy rate, and sample-device availability time channels.
[0060] To enhance model performance and increase the policy diversity of training data, a cache pool is added during model training. Specifically, the scheduling state, action, decoding result, and reward value are stored one-to-one in the cache pool, which is used for experience replay during training. Cache pool experience replay involves storing the state, action, and feedback as a set of experience samples, and then rereading them as training data after a certain number of training steps to improve the policy diversity of the training data.
[0061] Please refer to the process after adding to the cache pool. Figure 2 As shown, when the scheduling task is not completed, the process proceeds to the next scheduling state and returns to step S2, which also includes:
[0062] Determine whether experience replay is needed. If so, replay the experience; otherwise, proceed to the next scheduling state and return to step S2.
[0063] The next scheduling state is the new state that the current state transitions to after an action selection, including: the queue of tasks to be processed, the tasks already processed, the sample device load, and idle time.
[0064] When selecting scheduling state features in S2, the following principles should be followed:
[0065] (1) The state characteristics should contain all the information required for action decision-making to fully describe the scheduling environment, that is, satisfy: a i =π(S) i ), where a i For the current action, S i This is the current state.
[0066] π represents the action strategy, which is the input scheduling state and the output target action;
[0067] (2) Between adjacent scheduling states, the action corresponding to the transition relationship should be unique, that is, satisfying: a i =S i+1 -S i ;
[0068] (3) The reward for an action depends only on the preceding and following states, i.e., it satisfies: r i =R(a i ,S i ,S i+1 ),in,
[0069] r i R is the reward for the current action, and R is the reward function.
[0070] The selection of state features should be related to the scheduling objective to reduce redundancy of feature information.
[0071] Specifically, the task processing time channel is a three-dimensional matrix of (n+1)×(m+1)×j, where n is the number of samples, m is the number of devices, and j is the number of quality inspection items. The three-dimensional matrix of the processing time channel includes matrix elements p. a,b,c p a,m,c and p n,,c , where p a,b,c p represents the processing time required for quality inspection task c to be completed by sample a and equipment b. a,m,c and p n,b,c This indicates the feasibility of processing quality inspection task c on sample a and equipment b. 0 indicates that quality inspection task c cannot be processed on this sample (equipment) or does not need to be repeated, and 1 indicates that quality inspection task c has not been executed and can be processed on this sample (equipment).
[0072] The sample-device occupancy rate channel is a two-dimensional matrix of (9n+1)×(m+1), and the sample-device occupancy rate channel matrix includes matrix element u. a,b u a,m and u n,b , where u a,b u represents the cumulative time spent on quality inspection tasks performed by sample a on device b. a,m and u n,b These are the cumulative processing times for sample a and device b, respectively.
[0073] The sample-device available time channel matrix is a two-dimensional matrix of (n+1)×(m+1) channels, wherein the sample-device available time channel matrix includes matrix element l. a,b l a,m and l n,b , where l a,b This indicates the end time of the last task executed by sample a on device b. a,m and l n,b These are the final occupancy and release times for sample a and device b, respectively.
[0074] By concatenating the task processing time channel, sample-device occupancy rate, and sample-device availability time channel, a scheduling state feature representation with dimension (n+1×(m+1×(j+2)) is finally obtained.
[0075] S3. Output the corresponding action according to the current scheduling state, and decode the scheduling state to obtain the sample and device corresponding to the action;
[0076] Since the quality inspection task scheduling problem involves a large number of tasks and virtually no need for repeated selection, to reduce the difficulty of reinforcement learning training and improve algorithm stability, step S3 replaces the direct output of actions with action selection rules. Step S3 outputs one of the following action selection rules:
[0077] (1) Select the task with the shortest processing time;
[0078] (2) Select the task with the longest processing time;
[0079] (3) Select the task with the fewest available samples;
[0080] (4) Select serial task;
[0081] (5) Select parallel tasks;
[0082] (6) Select the preceding task in the mutually exclusive task pair;
[0083] (7) Select the subsequent task in the mutually exclusive task pair;
[0084] (8) Select an unconstrained task;
[0085] The action to be processed is obtained based on the selected rules, and the action is decoded according to the following heuristic rules:
[0086] Rule 1: Heuristic sample selection is performed on the chromosomes of each individual. The sample with the shortest completion time for each experiment is selected in the order of the experiments from front to back. If multiple samples meet the selection criteria, the sample with the shortest completed experiment time is selected.
[0087] Rule 2: Perform heuristic device selection for each individual's chromosomes. Select the device with the shortest completion time based on the selected sample. If multiple devices meet the selection criteria, select the device with the lowest workload.
[0088] S4. Calculate the reward value and update the training parameters based on the action and decoding result;
[0089] The calculation of the above reward value satisfies:
[0090] R = αU - βE,
[0091] Where R is the reward value, α and β are empirical parameters, and U and E are the scheduling environment utilization and idle time, respectively;
[0092] The calculation of the scheduling environment utilization rate satisfies:
[0093]
[0094] Among them, U N U M For sample and equipment utilization, un,M u N,m C represents the cumulative processing time of sample n and device m in the sample-device occupancy channel. max This is the current longest processing time;
[0095] The calculation of the void time satisfies:
[0096]
[0097] Among them, E N E M For sample and equipment void time, l n,M l N,m u represents the last occupancy and release time of sample n and device m in the sample-device availability time channel. n,M u N,m The cumulative processing time of sample n and device m in the sample-device occupancy channel is the difference between them, which is the hole time.
[0098] S5. Determine if the scheduled task has been completed:
[0099] When the scheduling task is completed and the number of training steps is reached, the training ends; otherwise, return to step S2.
[0100] If the scheduled task is not completed, proceed to the next scheduling state and return to step S2.
[0101] In S5, the scheduling task is completed when the queue of tasks to be processed is empty.
[0102] In summary, this embodiment addresses the problem of sparse rewards for quality inspection task scheduling by proposing a reward function that integrates sample utilization and equipment voids. Considering the serial, parallel, and mutually exclusive characteristics among quality inspection tasks, a set of action selection rules is proposed to replace the agent's direct learning of action decisions. Heuristic rules are used to improve the algorithm's convergence speed and enhance the interpretability of the model's action selection. To improve the policy diversity of the model's training samples, scheduling states, action decisions, and other results are added to a replay cache pool, and experience replay is performed based on the number of training steps.
[0103] When the trained model is used for quality inspection task scheduling strategies, the inputs are samples, equipment, and the number and types of test items, and the outputs are the test item processing sequence and the corresponding samples and equipment.
[0104] Example 2
[0105] Example 2 presents the experimental results of the quality inspection task scheduling method described in Example 1, to demonstrate the effectiveness of the method.
[0106] This embodiment uses real data from a power grid company's quality inspection laboratory for experimental verification, including data from 56 experimental items, data from 21 types (26 units) of equipment, and sample data. Considering the confidentiality requirements of relevant information, the test and equipment names are represented by serial numbers, as shown in Table 1.
[0107] Table 1. Test-Time-Equipment Information Table
[0108]
[0109]
[0110] The nonlinear relationships of serial, parallel, mutual exclusion, and device mutual exclusion in the 56 experimental items are also represented using symbols, as shown in Table 2:
[0111] Table 2 Nonlinear Relationship Table
[0112] Nonlinear relationship Experiment Items Task serial [53,60] parallel tasks [[9],[3]] Task Mutual Exclusion [[33,34],[35,36,37]] Device Mutual Exclusion [6,15],[16,17]
[0113] Among them, tasks 53 and 60 must be executed sequentially on the same sample; task 9 must be executed simultaneously on 3 samples; after executing tasks 33 and 34 on any sample, tasks 35, 36, and 37 cannot be executed, but the reverse is not restricted; devices 6 and 15, and devices 16 and 17 cannot be operated simultaneously.
[0114] The experimental parameters are set as follows:
[0115] Processor: Intel(R) Xeon(R) Silver 4110, CPU 2.10GHz, 128GB RAM, graphics card: GTX1080Ti, compatible with Ubuntu operating system.
[0116] The parameters of the quality inspection task scheduling algorithm based on reinforcement learning are shown in Table 3. The training scale is 8000, the experience replay cache pool is 100000, the target network parameter update steps are 200, the reward function parameter α is set to 0.8, and the parameter β is set to 1.0.
[0117] Experimental Results Analysis
[0118] The difficulty of solving the quality inspection task scheduling problem is directly related to the number of quality inspection tasks. Based on the existing data, this embodiment divides the dataset into different sizes for testing, setting the number of test items to 10, 20, 30, 40, and 50 respectively, maintaining a sample size of 5, and the equipment data is determined according to the test number used. To verify the effectiveness of the algorithm, this embodiment selects the classic greedy algorithm MWKR to select the job with the longest remaining processing time, the classic genetic algorithm (GA), and the particle swarm optimization (PSO) algorithm for comparison. The results are shown in Table 3 below.
[0119] Table 3 Single Batch Algorithm Validation Table
[0120]
[0121] Table 3 shows the results of quality inspection task scheduling for a single batch of samples, with 10, 20, 30, 40, and 50 experimental items completed respectively. Each case was run 10 times, and the shortest and longest completion times were recorded as the objective function and the average algorithm time to measure algorithm performance. The algorithm time does not include model loading time. In terms of objective function value, the OURS algorithm has an average quality improvement of 12.10% compared to the MWKR algorithm, 2.07% compared to the GA algorithm, and 3.40% compared to the PSO algorithm. In terms of algorithm time, the OURS and MWKR algorithms are more than 99% more efficient than the GA and PSO algorithms. Obviously, the efficiency of relying on randomly generated chromosomes and populations is much lower than that of model inference without repetition. The experimental results show that the OURS algorithm is superior to existing algorithms in both scheduling quality and solution time, fully verifying its effectiveness in solving the quality inspection scheduling problem.
[0122] Example 3
[0123] Figure 3 This is a schematic diagram of the structure of an electronic device provided in Embodiment 3 of the present invention, as shown below. Figure 3 As shown, the electronic device includes a processor 210, a memory 220, an input device 230, and an output device 240; the number of processors 210 in the computer device can be one or more. Figure 3 Taking a processor 210 as an example; the processor 210, memory 220, input device 230, and output device 240 in the electronic device can be connected via a bus or other means. Figure 3 Taking the example of a connection between China and Israel via a bus.
[0124] The memory 220, as a computer-readable storage medium, can be used to store software programs, computer-executable programs, and modules. The processor 210 executes various functional applications and data processing of the electronic device by running the software programs, instructions, and modules stored in the memory 220, thereby implementing the quality inspection task scheduling method of Embodiments 1 and 2 described above.
[0125] The memory 220 may primarily include a program storage area and a data storage area. The program storage area may store the operating system and at least one application program required for a given function; the data storage area may store data created based on terminal usage. Furthermore, the memory 220 may include high-speed random access memory and non-volatile memory, such as at least one disk storage device, flash memory, or other non-volatile solid-state storage device. In some instances, the memory 220 may further include memory remotely located relative to the processor 210, which can be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0126] The input device 230 can be used to receive input user identity information, sample data, and training parameters, etc. The output device 240 may include a display screen or other display device.
[0127] Example 4
[0128] Embodiment 4 of the present invention also provides a storage medium containing computer-executable instructions, which can be used by a computer to execute a quality inspection task scheduling method, the method comprising:
[0129] S1. Initialize the model training parameters, where the model is a reinforcement learning model;
[0130] S2. Construct scheduling status features, which are obtained by splicing together task processing time channels, sample-device occupancy rate, and sample-device availability time channels.
[0131] S3. Output the corresponding action according to the current scheduling state, and decode the scheduling state to obtain the sample and device corresponding to the action;
[0132] S4. Calculate the reward value and update the training parameters based on the action and decoding result;
[0133] S5. Determine if the scheduled task has been completed:
[0134] When the scheduling task is completed and the number of training steps is reached, the training ends; otherwise, return to step S2.
[0135] If the scheduled task is not completed, proceed to the next scheduling state and return to step S2.
[0136] Of course, the computer-executable instructions provided in the embodiments of the present invention are not limited to the method operations described above, but can also execute related operations in the quality inspection task scheduling method provided in any embodiment of the present invention.
[0137] Based on the above description of the implementation methods, those skilled in the art can clearly understand that the present invention can be implemented using software and necessary general-purpose hardware, and of course, it can also be implemented using hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, read-only memory (ROM), random access memory (RAM), flash memory, hard disk, or optical disk, etc., including several instructions to cause an electronic device (which may be a mobile phone, personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.
[0138] For those skilled in the art, various other corresponding changes and modifications can be made based on the technical solutions and concepts described above, and all such changes and modifications should fall within the protection scope of the claims of this invention.
Claims
1. A quality inspection task scheduling method, characterized in that, Includes the following steps: S1. Initialize the model training parameters, where the model is a reinforcement learning model; S2. Construct scheduling status features, which are obtained by concatenating task processing time channels, sample-device occupancy rate channels, and sample-device availability time channels; the task processing time channel is... A three-dimensional matrix, where For the number of samples, For the number of devices, To represent the number of quality inspection items, the three-dimensional matrix of the task processing time channel includes matrix elements. , and ,in, Indicates quality inspection task From the sample and equipment Complete the required processing time, and Indicates quality inspection task In the sample and equipment The feasibility of the above processing; The sample-device occupancy rate channel is A two-dimensional matrix, the sample-device occupancy channel matrix including matrix elements. , and ,in, Indicates sample In the equipment The cumulative time spent on quality inspection tasks. and Samples and equipment The cumulative processing time; The sample-device available time channel is A two-dimensional matrix, the sample-device available time channel matrix including matrix elements. , and ,in, Indicates sample In the equipment The end time of the last executed task. and Samples and equipment Finally, the release time is used up; The scheduling state feature obtained by splicing is a dimension ( The scheduling state characteristics are represented; S3. Output the corresponding action according to the current scheduling state, and decode the scheduling state to obtain the sample and device corresponding to the action; replace the direct output of the action with action selection rules, the action selection rules including: (1) Select the task with the shortest processing time; (2) Select the task with the longest processing time; (3) Select the task with the fewest available samples; (4) Select serial task; (5) Select parallel tasks; (6) Select the preceding task in the mutually exclusive task pair; (7) Select the subsequent task in the mutually exclusive task pair; (8) Select an unconstrained task; The heuristic rules for decoding include: Rule 1: Heuristic sample selection is performed on the chromosomes of each individual. The sample with the shortest completion time for each experiment is selected in the order of the experiments from front to back. If multiple samples meet the selection criteria, the sample with the shortest completed experiment time is selected. Rule 2: Heuristically select equipment for each individual's chromosomes. Based on the selected samples, choose the equipment with the shortest completion time. If multiple equipment meet the selection criteria, choose the equipment with the lowest workload. S4. Calculate the reward value and update the training parameters based on the action and decoding result; S5. Determine if the scheduled task has been completed: When the scheduling task is completed and the number of training steps is reached, the training ends; otherwise, return to step S2. If the scheduled task is not completed, proceed to the next scheduling state and return to step S2.
2. The quality inspection task scheduling method as described in claim 1, characterized in that, The training parameters include batch setting, training steps, replay buffer, replay time, and empirical hyperparameters.
3. The quality inspection task scheduling method as described in claim 2, characterized in that, After calculating the reward value and updating the training parameters based on the action and decoding result, the method further includes: The scheduling status, action, decoding result and reward value are stored in a cache pool, which is used for experience replay during training.
4. The quality inspection task scheduling method as described in claim 3, characterized in that, When the scheduled task is not completed, the process proceeds to the next scheduling state and returns to step S2, further including: Determine whether experience replay is needed. If so, replay the experience; otherwise, proceed to the next scheduling state and return to step S2.
5. The quality inspection task scheduling method as described in claim 1, characterized in that, Output the corresponding action based on the current scheduling state, satisfying the scheduling state representation principle: , , ,in, For the current action, This is the current state. As a reward for the current action, For the reward function, Choose a strategy for the action.
6. The quality inspection task scheduling method as described in claim 1, characterized in that, The calculation of the reward value satisfies: , in, As a reward value, These are empirical parameters. These are the scheduling environment utilization and idle time, respectively. The calculation of the scheduling environment utilization rate satisfies: , in, For sample and equipment utilization, , For samples in the sample-device occupancy channel ,equipment The cumulative processing time, This is the current longest processing time; The calculation of the void time satisfies: , in, , For sample and equipment void time, , For samples in the device available time channel ,equipment The final time of occupation and release, , For samples in the sample-device occupancy channel ,equipment The cumulative processing time.
7. An electronic device comprising a processor, a storage medium, and a computer program, wherein the computer program is stored in the storage medium, characterized in that, When the computer program is executed by the processor, it implements the quality inspection task scheduling method according to any one of claims 1 to 6.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the quality inspection task scheduling method according to any one of claims 1 to 6.