A Multi-Parameter Collaborative Control Method and System for Feed Processing Based on Reinforcement Learning

By adopting a multi-parameter collaborative control method based on reinforcement learning, the problem of complex multi-parameter coupling relationships in feed processing was solved, and joint dynamic regulation of multiple parameters was realized, which improved the robustness and response speed of the system, as well as the system's robustness and production efficiency.

CN121900194BActive Publication Date: 2026-06-30SICHUAN XINTE AGRI & ANIMAL HUSBANDRY TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SICHUAN XINTE AGRI & ANIMAL HUSBANDRY TECH CO LTD
Filing Date
2026-03-23
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

The existing feed processing process involves complex multi-parameter coupling relationships, making it difficult to achieve coordinated control and real-time optimization. Traditional scheduling methods are also unable to cope with equipment status fluctuations and disturbance events.

Method used

A multi-parameter collaborative control method based on reinforcement learning is constructed. By acquiring real-time operating data, a state space, action space, and reward function are constructed. The deep reinforcement learning agent outputs adjustment strategies to achieve joint dynamic regulation of multiple parameters. The control strategy is optimized through an experience replay pool.

Benefits of technology

It achieves efficient collaborative optimization of multiple parameters, improves the robustness and response speed of the system in complex dynamic environments, and enhances production efficiency and product quality stability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121900194B_ABST
    Figure CN121900194B_ABST
Patent Text Reader

Abstract

This invention provides a multi-parameter collaborative control method and system for feed processing based on reinforcement learning, belonging to the field of feed processing automation and intelligent process control technology. The method includes: acquiring real-time operating data, including equipment operating status, material characteristic parameters, and current process control parameters; constructing the state space, action space, and reward function of a reinforcement learning model; inputting the state into a deep reinforcement learning agent, whose policy network outputs the original action; correcting the original action based on preset process safety constraints and parameter coupling relationships, generating action commands and issuing them to the actuator; acquiring the state and reward value at the next moment; storing each state, action, reward, and next state as an experience tuple in an experience replay pool, and randomly sampling to iteratively optimize network parameters. This solves the problem of complex multi-parameter coupling relationships and difficulty in achieving collaborative control and real-time optimization in existing feed processing processes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of automation and intelligent process control technology in feed processing production, and more specifically, to a multi-parameter collaborative control method and system for feed processing based on reinforcement learning. Background Technology

[0002] In feed processing, collaborative scheduling of multiple equipment and control of process parameters are key to achieving efficient and stable production. Traditional scheduling methods often rely on fixed rules or static scheduling systems, which are ill-suited to handle complex situations in actual production, such as equipment state fluctuations, frequent task switching, and frequent disturbances. Patent document CN120355201A proposes a deep learning-based method for collaborative scheduling and operation optimization of feed production equipment. By constructing equipment state sequences and disturbance sample datasets, and combining a disturbance-resistant residual fusion network with an equipment conflict inference graph, it achieves structured constraints and dynamic corrections to the scheduling path. Simultaneously, this method introduces an improved NGBoost model to model the uncertainty of scheduling behavior, outputting predicted expected values ​​and risk scores, and selecting the optimal scheduling scheme based on a multi-objective evaluation mechanism. This method improves the scheduling system's responsiveness to disturbances and conflicts to a certain extent, exhibiting strong robustness and interpretability.

[0003] Existing methods primarily focus on optimizing scheduling behavior at the equipment level, without deeply modeling the coupling relationships and dynamic control mechanisms among multiple parameters (such as temperature, humidity, pressure, and mixing time) during processing. They also have limitations in handling multi-objective collaborative control, real-time adaptive parameter adjustment, and nonlinear interactions between parameters in complex process environments, making it difficult to achieve efficient collaborative optimization of multiple parameters throughout the entire process. Therefore, how to construct a reinforcement learning framework capable of sensing dynamic changes in multiple parameters and possessing self-learning and adaptive capabilities to achieve collaborative control and real-time optimization of multiple parameters in feed processing remains a pressing technical problem to be solved. Summary of the Invention

[0004] The purpose of this invention is to provide a multi-parameter collaborative control method and system for feed processing based on reinforcement learning, which aims to solve the problems of complex multi-parameter coupling relationships and difficulty in achieving collaborative control and real-time optimization in existing feed processing processes.

[0005] This invention is achieved through the following technical solution:

[0006] A multi-parameter collaborative control method for feed processing based on reinforcement learning includes the following steps:

[0007] Acquire real-time operating data of the feed processing process; the real-time operating data includes equipment operating status, material characteristic parameters, and process control parameters at the current moment;

[0008] The state space, action space, and reward function of the reinforcement learning model are constructed; the state space is constructed based on the real-time operating data; the action space is defined as the adjustment amount of the process control parameters; and the reward function is constructed based on a preset cooperative control objective.

[0009] The current state space is input into the deep reinforcement learning agent, and the policy network in the deep reinforcement learning agent outputs the original action; the original action includes the adjustment direction and magnitude of each process control parameter;

[0010] The original action is modified based on preset process safety constraints and parameter coupling relationships, and action instructions are generated.

[0011] The action command is sent to the actuator in the feed processing process, and the state space at the next moment after the action command is executed and the reward value calculated according to the reward function are obtained.

[0012] The current state space, executed instructions, reward value, and next state space are stored as an experience tuple in the experience replay pool. Multiple experience tuples are randomly sampled from the experience replay pool to iteratively optimize the network parameters of the deep reinforcement learning agent.

[0013] Optionally, the specific process of acquiring real-time operating data of the feed processing process is as follows:

[0014] By deploying sensor arrays on the feed processing production line, operating status data, material characteristic data, and environmental condition data of the equipment layer are collected synchronously according to a preset high-frequency sampling cycle.

[0015] The collected multi-source heterogeneous data is time-stamped and cleaned to remove outliers and noise points. Interpolation is used to fill in missing data caused by sensor interruptions, forming a standardized time-series data stream.

[0016] Extract the current equipment operating status, material characteristic parameters, and process control parameters from the time-series data stream;

[0017] The current equipment operating status, material characteristic parameters, and process control parameters are normalized and combined to construct a multi-dimensional state vector representing the current processing condition. This multi-dimensional state vector is then used as the input state for the reinforcement learning model.

[0018] Optionally, the specific process of constructing the state space of the reinforcement learning model is as follows:

[0019] The multidimensional state vector is decomposed into three mutually exclusive feature subsets: equipment state feature subset, material characteristic feature subset, and process control parameter feature subset. The equipment state feature subset includes equipment operating power, load rate, continuous operating time, and rated operating margin parameters. The material characteristic feature subset includes material moisture content, particle size distribution, proportion of core components in the formulation, and material bulk density parameters. The process control parameter feature subset includes current conditioning temperature, conditioning humidity, granulation pressure, mixing time, and cooling air temperature parameters.

[0020] Based on the feed processing mechanism, the coupling correlation metric calculation is performed on all parameters in the three feature subsets. The linear correlation coefficient and nonlinear mutual information value between any two parameters are obtained respectively. Parameter pairs with coupling correlation degree exceeding the preset coupling threshold are screened out, and a parameter coupling correlation feature matrix is ​​constructed. Principal component analysis is then performed on the parameter coupling correlation feature matrix to reduce the dimensionality and obtain a one-dimensional coupling correlation feature vector.

[0021] Using the current moment as the end point, a standardized time-series data stream of a preset sliding window length is extracted, and the time-series change characteristics of each parameter within the time-series window are extracted to construct a time-series dynamic feature vector. The time-series change characteristics include the parameter change rate, fluctuation amplitude, and cumulative deviation from the process reference value.

[0022] Match the feed formula and production line equipment model corresponding to the current processing batch, extract the upper and lower limits of safe operation, the boundary of rated operating conditions of equipment and the threshold range of material adaptation parameters for each process control parameter, and construct the constraint boundary feature vector;

[0023] The one-dimensional vectorized data of the three feature subsets are concatenated with the one-dimensional coupled feature vector, the temporal dynamic feature vector and the constraint boundary feature vector in the same dimension to generate a full-dimensional state vector.

[0024] The state space of the reinforcement learning model is constructed by linearly mapping the full-dimensional state vector to the standardized numerical range required for the input of the reinforcement learning model.

[0025] Optionally, the specific process of constructing the action space of the reinforcement learning model is as follows:

[0026] Match the feed formula, production line equipment model, and process control parameter feature subset in the state space corresponding to the current processing batch, and screen out the process control parameters with real-time online control capabilities to form a complete set of controllable parameters;

[0027] Based on the parameter coupling correlation feature matrix, the coupling correlation degree between any two parameters in the set of controllable parameters is calculated. Parameters with a coupling correlation degree exceeding the preset coupling threshold are divided into the same coupling control group, and the remaining parameters are divided into independent control groups, thus completing the grouping of the coupling characteristics of controllable parameters.

[0028] Based on the constraint boundary feature vector and combined with the feed processing mechanism, for each process control parameter in the set of adjustable parameters, a single adjustment direction constraint, minimum adjustment step size, single maximum positive adjustment amplitude, single maximum negative adjustment amplitude, and steady-state response time threshold after parameter adjustment are set respectively, forming a single parameter adjustment boundary constraint set;

[0029] For each coupled control group, based on the coupling correlation of parameters within the group and the feed processing mechanism, the linkage adjustment ratio range, adjustment direction matching rules, and parameter adjustment conflict avoidance constraints within the group are set to form a set of linkage adjustment rules for the coupled group.

[0030] The single adjustment amount of each process control parameter in the set of adjustable parameters is defined as an independent dimension of the action space; among them, the parameter dimension in the coupled control group is subject to the constraint of the corresponding coupled group linkage adjustment rule set, and the parameter dimension in the independent control group is an independent dimension without linkage constraint.

[0031] Based on the single-parameter adjustment boundary constraint set, the legal value range of each action dimension is determined, and the legal value range of all action dimensions is linearly mapped to the standardized numerical range required by the output of the reinforcement learning policy network, thus completing the construction of the action space of the reinforcement learning model.

[0032] Optionally, the specific process of constructing the reward function of the reinforcement learning model is as follows:

[0033] Match the feed formula, preset production target and rated operating parameters of production line equipment corresponding to the current processing batch, and combine the constraint boundary feature vector to determine the four core sub-items of the reward function, and calibrate the benchmark quantization threshold and initial weight coefficient corresponding to each core sub-item;

[0034] The four core sub-items are output revenue sub-item, energy consumption cost sub-item, product consistency reward sub-item, and process constraint violation penalty sub-item; among them, the output revenue sub-item is calculated based on the qualified feed product output per unit time after the current action instruction is executed, and the preset rated qualified output per unit time of the current batch is used as the benchmark quantification threshold. The ratio of the actual output to the benchmark quantification threshold is calculated to obtain the positive vectorized value of the output revenue sub-item. The positive vectorized value of the output revenue sub-item increases positively as the actual output exceeds the benchmark quantification threshold, and decreases positively as the actual output falls below the benchmark quantification threshold.

[0035] The energy consumption cost sub-item is calculated based on the total energy consumption of the entire production line equipment cluster per unit time after the current action command is executed. The rated energy consumption per unit time of the current batch is used as the benchmark quantization threshold. The ratio of the actual energy consumption to the benchmark quantization threshold is calculated to obtain the negative vectorized value of the energy consumption cost sub-item. The negative vectorized value of the energy consumption cost sub-item increases negatively as the actual energy consumption exceeds the benchmark quantization threshold, and decreases negatively as the actual energy consumption is lower than the benchmark quantization threshold.

[0036] The product consistency reward sub-item is calculated based on the degree of deviation between the core quality indicators of the finished feed and the preset standard indicator range after the current action command is executed. The cumulative deviation of each core quality indicator from the preset standard indicator range is calculated, and the positive vectorization value of the product consistency reward sub-item is obtained based on the reciprocal normalization result of the cumulative deviation. The positive vectorization value of the product consistency reward sub-item increases positively as the cumulative deviation decreases. The core quality indicators include finished product moisture content, gelatinization degree, pellet hardness, and powdering rate.

[0037] The process constraint violation penalty sub-item is judged based on whether each process control parameter and equipment operating parameter exceeds the safe operating range calibrated by the constraint boundary feature vector after the current action command is executed. For parameters that exceed the safe operating range, the single parameter penalty value is calculated by multiplying the excess range by the preset unit amplitude penalty coefficient. The single parameter penalty values ​​of all out-of-bounds parameters are accumulated to obtain the negative vectorized value of the process constraint violation penalty sub-item.

[0038] Combining the parameter coupling correlation feature matrix, the coupling control group, and the coupling group linkage adjustment rule set, a coupling control linkage compliance penalty sub-item is constructed. For each coupling control group, the coupling control linkage compliance penalty sub-item verifies whether the adjustment amount of the parameters executed within the group conforms to the linkage adjustment ratio range, adjustment direction matching rules, and adjustment conflict avoidance constraints specified in the coupling group linkage adjustment rule set. For coupling control groups that violate the linkage rules, the single-group penalty value is calculated by multiplying the weighted sum of the parameter coupling correlation degree within the group with the degree of rule violation. The single-group penalty values ​​of all violating coupling control groups are accumulated to obtain the negative vectorized value of the coupling control linkage compliance penalty sub-item.

[0039] The four core sub-items and the coupled regulation and linkage compliance penalty sub-item are weighted and summed. Based on the state space data at the current moment, the weight coefficients of the four core sub-items and the coupled regulation and linkage compliance penalty sub-item are dynamically adjusted in real time to obtain the instant reward value at the current moment. The instant reward value is linearly mapped to the standardized numerical range required by the reinforcement learning model to complete the construction of the reward function.

[0040] Optionally, the specific process of dynamically adjusting the weight coefficients of the four core sub-items and the coupled control linkage compliance penalty sub-item based on the current state space data is as follows:

[0041] Extract the equipment load rate, material characteristic parameters and formula baseline values ​​deviation, production progress completion rate, disturbance event occurrence level, grid peak and valley electricity price time period identifier and equipment energy efficiency attenuation coefficient in the current state space, and adjust the weight coefficients of process constraint violation penalty sub-item, product consistency reward sub-item, output revenue sub-item, coupled control linkage compliance penalty sub-item, and energy consumption cost sub-item accordingly.

[0042] When the equipment load rate exceeds the preset load threshold, the weight coefficient of the process constraint violation penalty item is increased synchronously according to the excess ratio;

[0043] When the deviation of material characteristic parameters from the formula baseline value exceeds the preset deviation threshold, the weight coefficient of the product consistency reward sub-item will be increased synchronously according to the deviation magnitude.

[0044] When the production progress completion rate is lower than the preset progress threshold, the weight coefficient of the output revenue sub-item will be increased synchronously according to the lag ratio.

[0045] When a disturbance event is detected, the weight coefficient of the coupled control linkage compliance penalty sub-item is increased synchronously according to the disturbance event level;

[0046] When the power grid is in a peak electricity price period or the equipment energy efficiency degradation coefficient exceeds the preset degradation threshold, the weight coefficient of the energy consumption cost sub-item is increased synchronously according to the electricity price increase ratio or the energy efficiency degradation magnitude; when the power grid is in a low electricity price period and the equipment energy efficiency degradation coefficient is lower than the preset degradation threshold, the weight coefficient of the energy consumption cost sub-item is decreased synchronously.

[0047] Optionally, the specific process of modifying the original action based on preset process safety constraints and parameter coupling relationships, and generating action instructions, is as follows:

[0048] The original motion is inversely mapped to the actual adjustment amount corresponding to each adjustable parameter according to the mapping relationship when the motion space is constructed, and an initial adjustment amount set is generated.

[0049] Based on the grouping results of the coupling characteristics of the adjustable parameters, the actual adjustment of each parameter in the initial adjustment set is split into independent control groups and each coupled control group, forming independent parameter adjustment subsets and multiple coupled group adjustment subsets respectively;

[0050] For each process control parameter in the independent parameter adjustment subset, based on the single parameter adjustment boundary constraint set, the adjustment direction, adjustment step size, and adjustment magnitude of its actual adjustment amount are verified sequentially. If the adjustment direction does not conform to the preset direction constraint, or the absolute value of the adjustment amount is less than the preset minimum adjustment step size, the corresponding adjustment amount is set to zero. If the adjustment amount exceeds the maximum positive adjustment magnitude or the maximum negative adjustment magnitude in a single operation, the corresponding adjustment amount is clamped to the corresponding magnitude boundary value. After the correction is completed, an independent parameter compliant adjustment set is generated.

[0051] For each coupled group adjustment subset, the actual adjustment amount of each parameter within the group is initially verified in terms of magnitude and direction based on the single-parameter adjustment boundary constraint set. Adjustments exceeding the single-parameter boundary are clamped, resulting in a set of initial calibration adjustment amounts within the group. Based on the coupled group linkage adjustment rule set corresponding to the coupled control group, the adjustment direction matching relationship of the initial calibration adjustment amount set within the group, whether the linkage adjustment ratio between parameters is within a preset range, and whether there are adjustment conflicts are verified. If the verification fails, the coupling correlation degree of the parameters within the group is used as a weighting coefficient, with the goal of minimizing the deviation between the adjustment amount and the initial value. Within the limits of the linkage adjustment ratio range, adjustment direction matching rules, and adjustment conflict avoidance constraints, the adjustment amounts of each process control parameter within the group are iteratively corrected until the corrected adjustment amounts simultaneously conform to both the coupled group linkage adjustment rule set and the single-parameter adjustment boundary constraint set, generating a compliant adjustment set for each coupled control group. The compliant adjustment sets of all coupled control groups are then summarized to obtain the compliant adjustment set of the coupled parameters.

[0052] The independent parameter compliance adjustment set and the coupled parameter compliance adjustment set are summarized to obtain the full parameter adjustment set to be executed; based on the safe operating upper and lower limits of each process control parameter, the predicted value of each process control parameter after the adjustment is calculated. The predicted value is the sum of the current value of the process control parameter and the corresponding adjustment amount; if the predicted value exceeds the safe operating range, the adjustment amount of the parameter is clamped and corrected a second time with the constraint of not exceeding the safe operating range. After the correction is completed, the final full parameter adjustment set is generated.

[0053] Based on the final adjustment set of all parameters, and combined with the control protocol of the actuator corresponding to each process control parameter, the adjustment amount of each process control parameter is converted into the digital control instruction of the corresponding actuator. The steady-state response time threshold corresponding to the process control parameter is added to each digital control instruction to complete the generation of the action instruction.

[0054] Based on the same inventive concept, this invention also provides a multi-parameter collaborative control system for feed processing based on reinforcement learning, used to implement the aforementioned multi-parameter collaborative control method for feed processing based on reinforcement learning, comprising:

[0055] The multi-source data acquisition and preprocessing module is used to synchronously collect equipment operation status data, material characteristic data, and environmental condition data at the equipment layer through a sensor array deployed on the feed processing production line, according to a preset high-frequency sampling period. It performs timestamp alignment and data cleaning on the collected multi-source heterogeneous data, removing outliers and noise points, and uses interpolation to fill in missing data caused by sensor interruptions, forming a standardized time-series data stream. From the time-series data stream, it extracts the current equipment operation status, material characteristic parameters, and process control parameters, performs normalization processing, and combines them to construct a multi-dimensional state vector representing the current processing condition. This multi-dimensional state vector is then used as the input state for a reinforcement learning model.

[0056] The reinforcement learning model construction and configuration module, which is communicatively connected to the multi-source data acquisition and preprocessing module, is used to construct the state space, action space, and reward function of the reinforcement learning model. The state space is constructed based on the multi-dimensional state vector and combined with parameter coupling correlation features, temporal dynamic features, and constraint boundary features. The action space is defined as the adjustment amount of process control parameters with real-time online control capabilities. The controllable parameters are grouped according to the coupling correlation between parameters, and single-parameter adjustment boundary constraint sets and coupled group linkage adjustment rule sets are set for independent control groups and coupled control groups, respectively. The reward function is constructed based on preset collaborative control objectives, which include maximizing output, minimizing energy consumption, product consistency index, process constraint violation penalty items, and coupled control linkage compliance penalty items. The weight coefficients of each sub-item are dynamically corrected in real time based on the state space data at the current moment.

[0057] The deep reinforcement learning decision module, which internally deploys a policy network, is communicatively connected to the multi-source data acquisition and preprocessing module and the reinforcement learning model construction and configuration module. It is used to input the current state space into the policy network, which then outputs the original actions. These original actions include the adjustment direction and magnitude for each process control parameter.

[0058] The action correction and instruction generation module is communicatively connected to the deep reinforcement learning decision module. It is used to correct the original action based on preset process safety constraints and parameter coupling relationships, and generate action instructions. Specifically, it includes: inversely mapping the original action to the actual adjustment amount of each adjustable parameter, and splitting it into an independent control group and adjustment subsets of each coupled control group according to the coupling characteristics; sequentially performing compliance verification and iterative correction on the adjustment amounts of the independent control group and each coupled control group based on the single-parameter adjustment boundary constraint set and the coupled group linkage adjustment rule set; summarizing the corrected full-parameter adjustment set to be executed, and performing secondary clamping correction on the predicted values ​​of the adjusted parameters based on the safe operating upper and lower limits of each process control parameter to generate the final full-parameter adjustment set; finally, converting the adjustment amount into digital control instructions for the corresponding actuator, and attaching a steady-state response time threshold corresponding to the process control parameter to each digital control instruction.

[0059] The instruction execution and feedback acquisition module is communicatively connected to the multi-source data acquisition and preprocessing module and the action correction and instruction generation module; it is used to send action instructions to the actuators in the feed processing process, and to obtain the state space of the next moment after the action instruction is executed and the reward value calculated according to the reward function.

[0060] The experience replay and network training module is communicatively connected to the reinforcement learning model construction and configuration module, the deep reinforcement learning decision module, and the instruction execution and feedback acquisition module. It is used to store the current state space, executed instruction, reward value, and next state space as an experience tuple in the experience replay pool, and randomly sample multiple experience tuples from the experience replay pool to iteratively optimize the network parameters of the deep reinforcement learning agent.

[0061] Based on the same inventive concept, the present invention also provides an electronic device, including a memory and a processor, wherein the memory is used to store a computer program, and the processor runs the computer program to enable the electronic device to perform the above-described reinforcement learning-based multi-parameter collaborative control method for feed processing.

[0062] Based on the same inventive concept, the present invention also provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the above-described multi-parameter collaborative control method for feed processing based on reinforcement learning.

[0063] The technical solution of the present invention has at least the following advantages and beneficial effects:

[0064] By constructing a state space that includes equipment operating status, material characteristics, and current process parameters, the adjustment amount of process parameters is defined as the action space. By using a deep reinforcement learning agent to output a collaborative adjustment strategy, the nonlinear coupling relationship between parameters is explored, and joint dynamic control of multiple parameters is achieved. This overcomes the control conflicts and inefficiencies caused by the isolated adjustment of parameters in traditional methods.

[0065] By leveraging the trial-and-error learning mechanism of deep reinforcement learning, the control strategy can be continuously optimized based on real-time operating data, and process parameters can be adaptively adjusted to respond to equipment status fluctuations, material changes and disturbance events, thereby improving the robustness and response speed of the system in complex dynamic environments.

[0066] By designing a reward function that comprehensively considers product quality, energy consumption, production efficiency, and safety constraints, the agent can be guided to seek the optimal balance among multiple control objectives, thereby achieving multi-objective collaborative optimization of the feed processing process and effectively improving overall economic benefits and product quality stability.

[0067] By using an experience replay pool to store historical interaction data and updating network parameters iteratively through random sampling, the agent can continuously learn from historical experience, thereby improving decision-making accuracy and control effectiveness, and ensuring the stability and optimization capability of the system in long-term operation. Attached Figure Description

[0068] Figure 1 This is a flowchart illustrating the multi-parameter collaborative control method for feed processing based on reinforcement learning, according to an embodiment of the present invention.

[0069] Figure 2 This is a schematic diagram of the structure of a multi-parameter collaborative control system for feed processing based on reinforcement learning, according to an embodiment of the present invention. Detailed Implementation

[0070] The following is a detailed description of the embodiments, in conjunction with the accompanying drawings.

[0071] Reference Figure 1 A multi-parameter collaborative control method for feed processing based on reinforcement learning includes the following steps:

[0072] Step 1: Obtain real-time operating data of the feed processing process; the real-time operating data includes equipment operating status, material characteristic parameters, and process control parameters at the current moment.

[0073] In some embodiments, the specific process of acquiring real-time operating data of the feed processing process is as follows:

[0074] By deploying sensor arrays on the feed processing production line, operating status data, material characteristic data, and environmental condition data of the equipment layer are collected synchronously according to a preset high-frequency sampling cycle.

[0075] The collected multi-source heterogeneous data is time-stamped and cleaned to remove outliers and noise points. Interpolation is used to fill in missing data caused by sensor interruptions, forming a standardized time-series data stream.

[0076] Extract the current equipment operating status, material characteristic parameters, and process control parameters from the time-series data stream;

[0077] The current equipment operating status, material characteristic parameters, and process control parameters are normalized and combined to construct a multi-dimensional state vector representing the current processing condition. This multi-dimensional state vector is then used as the input state for the reinforcement learning model.

[0078] Step 2: Construct the state space, action space, and reward function of the reinforcement learning model; the state space is constructed based on the real-time operating data; the action space is defined as the adjustment amount of the process control parameters; the reward function is constructed based on a preset cooperative control objective.

[0079] In some embodiments, the specific process of constructing the state space of the reinforcement learning model is as follows:

[0080] The multidimensional state vector is decomposed into three mutually exclusive feature subsets: equipment state feature subset, material characteristic feature subset, and process control parameter feature subset. The equipment state feature subset includes equipment operating power, load rate, continuous operating time, and rated operating margin parameters. The material characteristic feature subset includes material moisture content, particle size distribution, proportion of core components in the formulation, and material bulk density parameters. The process control parameter feature subset includes current conditioning temperature, conditioning humidity, granulation pressure, mixing time, and cooling air temperature parameters.

[0081] Based on the feed processing mechanism, the coupling correlation metric calculation is performed on all parameters in the three feature subsets. The linear correlation coefficient and nonlinear mutual information value between any two parameters are obtained respectively. Parameter pairs with coupling correlation degree exceeding the preset coupling threshold are screened out, and a parameter coupling correlation feature matrix is ​​constructed. Principal component analysis is then performed on the parameter coupling correlation feature matrix to reduce the dimensionality and obtain a one-dimensional coupling correlation feature vector.

[0082] Using the current moment as the end point, a standardized time-series data stream of a preset sliding window length is extracted, and the time-series change characteristics of each parameter within the time-series window are extracted to construct a time-series dynamic feature vector. The time-series change characteristics include the parameter change rate, fluctuation amplitude, and cumulative deviation from the process reference value.

[0083] Match the feed formula and production line equipment model corresponding to the current processing batch, extract the upper and lower limits of safe operation, the boundary of rated operating conditions of equipment and the threshold range of material adaptation parameters for each process control parameter, and construct the constraint boundary feature vector;

[0084] The one-dimensional vectorized data of the three feature subsets are concatenated with the one-dimensional coupled feature vector, the temporal dynamic feature vector and the constraint boundary feature vector in the same dimension to generate a full-dimensional state vector.

[0085] The state space of the reinforcement learning model is constructed by linearly mapping the full-dimensional state vector to the standardized numerical range required for the input of the reinforcement learning model.

[0086] In some embodiments, the specific process of constructing the action space of the reinforcement learning model is as follows:

[0087] Match the feed formula, production line equipment model, and process control parameter feature subset in the state space corresponding to the current processing batch, and screen out the process control parameters with real-time online control capabilities to form a complete set of controllable parameters;

[0088] Based on the parameter coupling correlation feature matrix, the coupling correlation degree between any two parameters in the set of controllable parameters is calculated. Parameters with a coupling correlation degree exceeding the preset coupling threshold are divided into the same coupling control group, and the remaining parameters are divided into independent control groups, thus completing the grouping of the coupling characteristics of controllable parameters.

[0089] Based on the constraint boundary feature vector and combined with the feed processing mechanism, for each process control parameter in the set of adjustable parameters, a single adjustment direction constraint, minimum adjustment step size, single maximum positive adjustment amplitude, single maximum negative adjustment amplitude, and steady-state response time threshold after parameter adjustment are set respectively, forming a single parameter adjustment boundary constraint set;

[0090] For each coupled control group, based on the coupling correlation of parameters within the group and the feed processing mechanism, the linkage adjustment ratio range, adjustment direction matching rules, and parameter adjustment conflict avoidance constraints within the group are set to form a set of linkage adjustment rules for the coupled group.

[0091] The single adjustment amount of each process control parameter in the set of adjustable parameters is defined as an independent dimension of the action space; among them, the parameter dimension in the coupled control group is subject to the constraint of the corresponding coupled group linkage adjustment rule set, and the parameter dimension in the independent control group is an independent dimension without linkage constraint.

[0092] Based on the single-parameter adjustment boundary constraint set, the legal value range of each action dimension is determined, and the legal value range of all action dimensions is linearly mapped to the standardized numerical range required by the output of the reinforcement learning policy network, thus completing the construction of the action space of the reinforcement learning model.

[0093] In some embodiments, the specific process of constructing the reward function of the reinforcement learning model is as follows:

[0094] Match the feed formula, preset production target and rated operating parameters of production line equipment corresponding to the current processing batch, and combine the constraint boundary feature vector to determine the four core sub-items of the reward function, and calibrate the benchmark quantization threshold and initial weight coefficient corresponding to each core sub-item;

[0095] The four core sub-items are: output revenue sub-item, energy consumption cost sub-item, product consistency reward sub-item, and process constraint violation penalty sub-item. The output revenue sub-item is calculated based on the output of qualified feed products per unit time after the execution of the current action command. It uses the preset rated qualified output per unit time for the current batch as a benchmark quantification threshold, calculates the ratio of the actual output to the benchmark quantification threshold, and obtains the positive vectorized value of the output revenue sub-item. This positive vectorized value increases positively as the actual output exceeds the benchmark quantification threshold and decreases positively as the actual output falls below the benchmark quantification threshold. The output revenue sub-item is shown in the following formula:

[0096]

[0097] in, express The positive vectorized value of the output revenue at any given time; This indicates the yield of qualified feed products per unit time after the action is performed; This indicates the preset rated qualified output per unit time for the current batch;

[0098] The energy consumption cost sub-item is calculated based on the total energy consumption of the entire production line's equipment cluster per unit time after the execution of the current action command. Using the preset rated energy consumption per unit time for the current batch as a benchmark quantization threshold, the ratio of actual energy consumption to the benchmark quantization threshold is calculated to obtain the negative vectorized value of the energy consumption cost sub-item. The negative vectorized value of the energy consumption cost sub-item increases negatively as the actual energy consumption exceeds the benchmark quantization threshold, and decreases negatively as the actual energy consumption falls below the benchmark quantization threshold. The energy consumption cost sub-item is shown in the following formula:

[0099]

[0100] in, express The negative vectorized value of energy consumption cost at any given moment; This indicates the total energy consumption of the entire production line's equipment cluster per unit time after the action is performed; This indicates the preset rated energy consumption per unit time for the current batch.

[0101] The product consistency reward sub-item is calculated based on the deviation of the core quality indicators of the finished feed from the preset standard indicator range after the execution of the current action command. The cumulative deviation of each core quality indicator from the preset standard indicator range is calculated, and the positive vectorized value of the product consistency reward sub-item is obtained based on the reciprocal normalization result of the cumulative deviation. The positive vectorized value of the product consistency reward sub-item increases positively as the cumulative deviation decreases. The core quality indicators include finished product moisture content, gelatinization degree, pellet hardness, and powdering rate. The product consistency reward sub-item is shown in the following formula:

[0102]

[0103] in, express The positive vectorized value of the time-bound product consistency reward; This represents the cumulative deviation, which is calculated using the following formula:

[0104]

[0105] in, Indicates the first Deviation of each indicator; Indicates the first Measured values ​​of the core quality indicators ( ), such as the moisture content of the finished product, degree of gelatinization, etc.; For the first Preset standard ranges for each core quality indicator;

[0106] The process constraint violation penalty sub-item is judged based on whether each process control parameter and equipment operating parameter exceeds the safe operating range calibrated by the constraint boundary feature vector after the current action command is executed. For parameters that exceed the safe operating range, the single parameter penalty value is calculated by multiplying the excess magnitude by a preset unit magnitude penalty coefficient. The single parameter penalty values ​​of all out-of-bounds parameters are accumulated to obtain the negative vectorized value of the process constraint violation penalty sub-item; the process constraint violation penalty sub-item is shown in the following formula:

[0107]

[0108] in, express The negative vectorized value of the penalty for violating the process constraint at a given moment; representing the first... The unit amplitude penalty coefficient for each parameter; Indicates the first The extent by which each parameter exceeds the safe range is shown in the following formula:

[0109]

[0110] in, Indicates the first The actual value of a process parameter or equipment operating parameter; For the first The safe operating range of each parameter;

[0111] Combining the parameter coupling correlation feature matrix, coupled control groups, and the set of coupled group linkage adjustment rules, a coupling control linkage compliance penalty sub-item is constructed. For each coupled control group, the coupling control linkage compliance penalty sub-item verifies whether the adjustment amount of the parameters within the group conforms to the linkage adjustment ratio range, adjustment direction matching rules, and adjustment conflict avoidance constraints specified in the coupled group linkage adjustment rule set. For coupled control groups that violate the linkage rules, a single-group penalty value is calculated by multiplying the weighted sum of the parameter coupling correlation degree within the group by the degree of rule violation. The single-group penalty values ​​of all violating coupled control groups are accumulated to obtain the negative vectorized value of the coupling control linkage compliance penalty sub-item. The coupling control linkage compliance penalty sub-item is shown in the following formula:

[0112]

[0113] in, express The negative vectorized value of the compliance penalty that is coupled and linked at all times; Represents the set of coupled control groups; Indicates the first A set of parameter indexes for each coupled control group; For parameter pairs The degree of coupling correlation; Indicates the first The degree of rule violation in each coupled control group quantifies the violation of the linkage adjustment rule set (linkage ratio range, direction matching rule, conflict avoidance constraint) by the actual adjustment amount of the parameters within the group. The specific quantification method can be defined according to the process mechanism, as shown in the following formula:

[0114]

[0115] in, and Parameters and The actual adjustment amount; The median of the allowed proportion range; For tolerance.

[0116] The four core sub-items and the coupled regulation and linkage compliance penalty sub-item are weighted and summed. Based on the state space data at the current moment, the weight coefficients of the four core sub-items and the coupled regulation and linkage compliance penalty sub-item are dynamically adjusted in real time to obtain the instant reward value at the current moment. The instant reward value is linearly mapped to the standardized numerical range required by the reinforcement learning model to complete the construction of the reward function.

[0117] In some embodiments, the specific process of dynamically adjusting the weight coefficients of the four core sub-items and the coupled control linkage compliance penalty sub-item based on the state space data at the current moment is as follows:

[0118] Extract the equipment load rate, deviation of material characteristic parameters from the formula baseline value, production progress completion rate, disturbance event occurrence level, grid peak and valley electricity price period identifier, and equipment energy efficiency attenuation coefficient within the current state space. Adjust the weight coefficients of the following sub-items accordingly: process constraint violation penalty, product consistency reward, output revenue, coupled control linkage compliance penalty, and energy cost. The initial weight coefficients for these sub-items can be set as follows: , , , and .

[0119] When the equipment load rate exceeds the preset load threshold, the weighting coefficient of the process constraint violation penalty item is increased proportionally, as shown in the following formula:

[0120]

[0121] in, This indicates the weight coefficient of the penalty sub-item for violating the revised process constraints; This is the weighting adjustment coefficient for the penalty item for violating process constraints; The adjustment coefficient is used to correct the penalty sub-item for the violation of the preset process constraints. This controls the sensitivity of the weight correction to the part of the equipment load rate that exceeds the threshold. The larger the value, the greater the weight increase when the same proportion is exceeded. Indicates the equipment load rate; This indicates the load threshold.

[0122] When the deviation of material characteristic parameters from the formula baseline value exceeds a preset deviation threshold, the weighting coefficient of the product consistency reward sub-item is increased synchronously according to the deviation magnitude, as shown in the following formula:

[0123]

[0124] in, This represents the revised weighting coefficient for the product consistency reward sub-item; The weighting adjustment coefficient for the product consistency reward sub-item; The adjustment coefficient is set as the preset product consistency reward sub-item correction coefficient to control the sensitivity of material characteristic deviation to weight correction. The larger the value, the greater the weight increase under the same deviation. This indicates the deviation of material characteristic parameters from the formulation baseline value; This indicates the deviation threshold.

[0125] When the production progress completion rate is lower than the preset progress threshold, the weight coefficient of the output revenue sub-item is increased synchronously according to the lag ratio, as shown in the following formula:

[0126]

[0127] in, This represents the adjusted weighting coefficient for the output revenue sub-item; This is the weighting adjustment factor for the output revenue sub-item; The adjustment coefficient is set to the preset output revenue sub-item to control the sensitivity of the production progress lag to the weight adjustment. The larger the value, the greater the weight increase under the same lag ratio. Indicates the production progress completion rate; This indicates the progress threshold.

[0128] When a disturbance event is detected, the weighting coefficient of the coupling control linkage compliance penalty sub-item is increased synchronously according to the disturbance event level, as shown in the following formula:

[0129]

[0130] in, This represents the weight coefficient of the modified coupling control linkage compliance penalty sub-item; Adjustment coefficient for the weight of the compliance penalty sub-item in the coupled regulation and linkage; To couple and regulate the compliance penalty sub-item adjustment coefficient, the sensitivity of the disturbance event level to the weight adjustment is controlled. The larger the value, the greater the weight increase under the same disturbance level. It represents the level of disturbance events, extracted from state space data, and quantifies the severity of the current disturbance event (such as equipment failure, material blockage, material fluctuation, etc.). Its value can be predefined according to factors such as disturbance type, duration, and scope of impact (such as level 1 minor, level 2 moderate, level 3 severe, etc.).

[0131] When the power grid is in a peak electricity price period or the equipment's energy efficiency degradation coefficient exceeds a preset degradation threshold, the weighting coefficient of the energy consumption cost sub-item is increased synchronously according to the electricity price increase ratio or the energy efficiency degradation magnitude; when the power grid is in a low electricity price period and the equipment's energy efficiency degradation coefficient is below the preset degradation threshold, the weighting coefficient of the energy consumption cost sub-item is decreased synchronously. As shown in the following formula:

[0132]

[0133] in, This represents the corrected weighting coefficient for the energy consumption cost sub-item; This is the weighting adjustment factor for the energy consumption cost sub-item; The first adjustment coefficient for the preset energy consumption cost sub-item controls the sensitivity of peak electricity price or energy efficiency degradation to weight increase. The larger the value, the greater the weight increase under the same conditions. The second adjustment coefficient for the preset energy consumption cost sub-item is used to control the sensitivity of the weight reduction when the off-peak electricity price is controlled and the energy efficiency is good. The larger the value, the greater the weight reduction under the same conditions. This is used to indicate peak electricity price periods for the power grid. It is set to 1 when it is a peak electricity price period, and 0 otherwise. This is used to indicate the off-peak electricity pricing period for the power grid. It is set to 1 when it is an off-peak electricity pricing period, and 0 otherwise. Indicates the percentage increase in electricity price; Indicates the energy efficiency attenuation coefficient of the equipment; Indicates the attenuation threshold; This indicates that the attenuation exceeds the allowable ratio; As an indicator function, when the equipment energy efficiency degradation coefficient Less than or equal to the preset attenuation threshold The value is 1 if the device is in good energy efficiency condition and 0 otherwise.

[0134] The weighted summation yields the original immediate reward, as shown in the following formula:

[0135]

[0136] in, This represents the original instant reward value.

[0137] Will The standardized numerical range required for linear mapping to reinforcement learning models (For example ).set up and They are respectively The possible minimum and maximum estimates (which can be determined through empirical or theoretical analysis) are then used to calculate the mapped reward, as shown in the following formula:

[0138]

[0139] like ,but:

[0140]

[0141] in, This represents the standardized reward value after mapping; This represents the lower limit of the standardized numerical range; This represents the upper limit of the standardized numerical range.

[0142] Step 3: Input the current state space into the deep reinforcement learning agent, and output the original action through the policy network in the deep reinforcement learning agent; the original action includes the adjustment direction and magnitude of each process control parameter.

[0143] In some embodiments, the specific process is as follows:

[0144] A policy network for a deep reinforcement learning agent is constructed. This policy network employs a multi-layer fully connected neural network structure. Its input layer dimension matches the dimension of the full-dimensional state vector generated in step two, and its output layer dimension matches the dimension of the action space constructed in step two. The policy network contains two hidden layers with 256 and 128 neurons respectively. The hidden layer activation function is ReLU to introduce non-linear mapping capabilities. The output layer activation function is tanh, which restricts the original action output value range to the standardized interval [-1, 1], thus matching the standardized numerical ranges of each dimension of the action space in step two.

[0145] The current full-dimensional state vector is used as input to the policy network. After layer-by-layer forward propagation, the original action vector output by the policy network is obtained. Each element in this original action vector corresponds to a standardized adjustment of an adjustable parameter, where positive values ​​represent positive adjustments (increasing the parameter value) and negative values ​​represent negative adjustments (decreasing the parameter value). The absolute value of the element represents the proportion of the adjustment magnitude relative to the standardized interval. The original action vector fully reflects the agent's initial coordinated adjustment intention for all adjustable parameters under the current operating condition.

[0146] The original action vector is used as the initial decision result of the reinforcement learning agent and output to step four for subsequent process safety constraints and parameter coupling relationship correction, so as to ensure that the final issued action command conforms to the safety boundary and parameter linkage rules of actual production.

[0147] Step 4: Modify the original action based on the preset process safety constraints and parameter coupling relationship, and generate action instructions.

[0148] In some embodiments, the specific process of modifying the original action based on preset process safety constraints and parameter coupling relationships, and generating action instructions, is as follows:

[0149] The original motion is inversely mapped to the actual adjustment amount corresponding to each adjustable parameter according to the mapping relationship when the motion space is constructed, and an initial adjustment amount set is generated.

[0150] Based on the grouping results of the coupling characteristics of the adjustable parameters, the actual adjustment of each parameter in the initial adjustment set is split into independent control groups and each coupled control group, forming independent parameter adjustment subsets and multiple coupled group adjustment subsets respectively;

[0151] For each process control parameter in the independent parameter adjustment subset, based on the single parameter adjustment boundary constraint set, the adjustment direction, adjustment step size, and adjustment magnitude of its actual adjustment amount are verified sequentially. If the adjustment direction does not conform to the preset direction constraint, or the absolute value of the adjustment amount is less than the preset minimum adjustment step size, the corresponding adjustment amount is set to zero. If the adjustment amount exceeds the maximum positive adjustment magnitude or the maximum negative adjustment magnitude in a single operation, the corresponding adjustment amount is clamped to the corresponding magnitude boundary value. After the correction is completed, an independent parameter compliant adjustment set is generated.

[0152] For each coupled group adjustment subset, the actual adjustment amount of each parameter within the group is initially verified in terms of magnitude and direction based on the single-parameter adjustment boundary constraint set. Adjustments exceeding the single-parameter boundary are clamped, resulting in a set of initial calibration adjustment amounts within the group. Based on the coupled group linkage adjustment rule set corresponding to the coupled control group, the adjustment direction matching relationship of the initial calibration adjustment amount set within the group, whether the linkage adjustment ratio between parameters is within a preset range, and whether there are adjustment conflicts are verified. If the verification fails, the coupling correlation degree of the parameters within the group is used as a weighting coefficient, with the goal of minimizing the deviation between the adjustment amount and the initial value. Within the limits of the linkage adjustment ratio range, adjustment direction matching rules, and adjustment conflict avoidance constraints, the adjustment amounts of each process control parameter within the group are iteratively corrected until the corrected adjustment amounts simultaneously conform to both the coupled group linkage adjustment rule set and the single-parameter adjustment boundary constraint set, generating a compliant adjustment set for each coupled control group. The compliant adjustment sets of all coupled control groups are then summarized to obtain the compliant adjustment set of the coupled parameters.

[0153] The independent parameter compliance adjustment set and the coupled parameter compliance adjustment set are summarized to obtain the full parameter adjustment set to be executed; based on the safe operating upper and lower limits of each process control parameter, the predicted value of each process control parameter after the adjustment is calculated. The predicted value is the sum of the current value of the process control parameter and the corresponding adjustment amount; if the predicted value exceeds the safe operating range, the adjustment amount of the parameter is clamped and corrected a second time with the constraint of not exceeding the safe operating range. After the correction is completed, the final full parameter adjustment set is generated.

[0154] Based on the final adjustment set of all parameters, and combined with the control protocol of the actuator corresponding to each process control parameter, the adjustment amount of each process control parameter is converted into the digital control instruction of the corresponding actuator. The steady-state response time threshold corresponding to the process control parameter is added to each digital control instruction to complete the generation of the action instruction.

[0155] Step 5: Send the action command to the actuator in the feed processing process, and obtain the state space of the next moment after the action command is executed and the reward value calculated according to the reward function.

[0156] In some embodiments, action commands are issued to the actuators in the feed processing process, and the state space at the next moment after the action command is executed and the reward value calculated according to the reward function are obtained. The specific process is as follows:

[0157] The final set of all parameters generated in step four is transmitted to the actuator controllers of the corresponding process control parameters via the production line industrial control network (such as using OPCUA, Modbus / TCP, or Profinet protocols). Each action command includes: parameter identifier, adjustment amount (which can be absolute or relative), execution mode (immediate execution or execution at a specified ramp rate), and the steady-state response time threshold added in step four. Before issuance, the command undergoes cyclic redundancy check to ensure transmission integrity; the timestamp of the command issuance is recorded (with millisecond precision) and synchronized with the production line's unified clock source to provide a benchmark for subsequent timing data alignment.

[0158] Upon receiving instructions, each actuator controller drives the corresponding actuator (such as regulating valve, frequency converter, heater, cooling fan, etc.) to change process control parameters according to the instructions. For parameters involving coupled control groups, the controller coordinates actions according to the linkage ratio or timing relationship set in the coupled group linkage adjustment rule set, ensuring that the coupling relationship between parameters is maintained at the physical execution level. During execution, sensors deployed near the actuators provide real-time feedback on the parameter change rate. If the actual change rate deviates from the expected rate by more than a preset threshold, an anomaly alarm is triggered and subsequent actions are suspended, awaiting system diagnostics.

[0159] Based on the steady-state response time threshold attached to the instruction, the system starts a timer to wait for the transition process after parameter adjustment to end, allowing the process parameters to reach a new steady state. During the waiting period, real-time values ​​of relevant parameters are continuously collected at a high-frequency sampling period (e.g., 100ms) to monitor whether they enter the preset steady-state error band (e.g., within ±2% of the target value) within the threshold time. If the steady state is not reached within the time limit, an anomaly flag for the transition process is recorded, and the system is transferred to the anomaly handling process (e.g., reissue the instruction or maintain the current value). Simultaneously, the anomaly information is appended to subsequent empirical tuples for robust training of the reinforcement learning model. After the steady-state waiting period ends, the system immediately triggers a new round of real-time operating condition data acquisition, performing the same operations as in step one. To ensure data quality, an online data verification mechanism based on residual statistics can be introduced to mark data points that significantly deviate from physical limits or have abnormal rates of change and replace them with the previous valid value. Following the state space construction method in step two, the preprocessed next-time-series data stream is converted into a full-dimensional state vector.

[0160] Based on the stable operating condition data at the next moment and the monitoring data during the action execution process, the immediate reward value of the current action is calculated according to the reward function constructed in step two. The specific calculation process includes:

[0161] After the data collection action is executed, the yield of qualified products, total energy consumption, and core quality indicators of finished products per unit time (such as moisture content and gelatinization degree obtained through online near-infrared analyzer, or rapid feedback after offline sampling and testing).

[0162] Based on the preset benchmark thresholds of the formula batch, calculate the production revenue sub-item, energy consumption cost sub-item, and product consistency reward sub-item;

[0163] Check whether all process parameters and equipment operating parameters exceed the safe operating range specified by the constraint boundary feature vector, and calculate the process constraint violation penalty item;

[0164] Verify whether the actual adjustment amount of the parameters within the coupled control group conforms to the linkage adjustment rule set, and calculate the linkage compliance penalty sub-item of coupled control.

[0165] The weights of each sub-item are dynamically adjusted in real time based on the equipment load rate, material characteristic deviation, production progress completion rate, disturbance event level, grid electricity price period and equipment energy efficiency attenuation coefficient in the current state space.

[0166] The corrected sub-items are weighted and summed, and then linearly mapped to a standardized numerical range (e.g., [-1, 1]) to obtain the final instant reward value.

[0167] The full-dimensional state vector at the current moment before the action is executed. The final action command after step four correction The calculated instant reward value and the full-dimensional state vector at the next moment. Combined into an empirical tuple ,in, This is the round termination indicator, calibrated based on whether the current processing batch is completed or whether the production line has triggered a shutdown protection condition (if the batch is completed or a shutdown occurs, then...). ;otherwise The experience tuples are stored in the experience replay pool for use in subsequent step six, which involves iterative optimization of network parameters.

[0168] Step 6: Store the current state space, executed instructions, reward value, and next state space as an experience tuple in the experience replay pool, and randomly sample multiple experience tuples from the experience replay pool to iteratively optimize the network parameters of the deep reinforcement learning agent.

[0169] In some embodiments, the specific process of iteratively optimizing the network parameters of the deep reinforcement learning agent is as follows:

[0170] Initialize the network architecture and training hyperparameters of the deep reinforcement learning agent. Set the agent to adopt a dual-delay deep deterministic policy gradient architecture of dual critic network-single policy network. Simultaneously initialize the network parameters of the main policy network, target policy network, first main critic network, second main critic network, first target critic network, and second target critic network, so that the initial parameters of the target network are consistent with the initial parameters of the corresponding main network. At the same time, calibrate the training hyperparameters, including empirical sampling batch size, policy network delayed update frequency, target network soft update coefficient, learning rate, discount factor, iteration termination threshold, and maximum number of iterations.

[0171] According to the preset sampling batch size, a corresponding number of independent experience tuples are randomly sampled from the experience replay pool. Each experience tuple contains the full-dimensional state vector at the current moment, the final action instruction after correction by process safety constraints and parameter coupling relationship, the instant reward value calculated after the corresponding action is executed, the full-dimensional state vector at the next moment, and the round termination flag. The round termination flag is calibrated according to whether the feed processing batch is completed and whether the production line triggers the shutdown protection condition.

[0172] The sampled full-dimensional state vector for the next time step is input into the target policy network, which outputs the original target action for the next time step. Based on the preset process safety constraints, parameter coupling correlation feature matrix and coupling group linkage adjustment rule set, the original target action is constrained and corrected to obtain the corrected target action instruction. The full-dimensional state vector for the next time step and the corrected target action instruction are input into the first target critic network and the second target critic network, respectively, and two sets of target state action values ​​are output. The smaller value of the two sets of target state action values ​​is taken as the baseline target value. Combined with the immediate reward value, the preset discount factor and the round termination flag, the target Q value corresponding to each experience tuple is calculated.

[0173] The sampled current full-dimensional state vector and the corrected final action command are input into the first main critic network and the second main critic network, respectively, and two sets of current state action values ​​are output. The mean squared error loss between the two sets of current state action values ​​and the corresponding target Q value is calculated to obtain the first critic loss value and the second critic loss value. Based on the preset learning rate, the parameters of the two main critic networks are updated by gradient through the backpropagation algorithm to minimize the corresponding loss value.

[0174] The number of updates to the main commentator network is accumulated. When the number of updates reaches the preset policy network delay update frequency, the policy network parameters are updated. The sampled full-dimensional state vector at the current time is input into the main policy network, and the original action to be optimized at the current time is output. Based on the parameter coupling correlation feature matrix and the coupling group linkage adjustment rule set, the degree of violation of the coupling control rules by the original action to be optimized is calculated and a coupling constraint penalty term is generated. The full-dimensional state vector at the current time and the original action to be optimized are input into the first main commentator network, and the expected value of the current state action is output. The policy network loss function is constructed with the optimization objective of maximizing the expected value of the current state action and minimizing the coupling constraint penalty term. The parameters of the main policy network are updated by gradient through the backpropagation algorithm based on the preset learning rate.

[0175] After each parameter update of the main policy network is completed, the parameters of the target policy network and the main policy network, the first target commentator network and the first main commentator network, the second target commentator network and the second main commentator network are weighted and softly updated according to the preset soft update coefficient, so that the parameters of the target network smoothly follow the iteration of the main network parameters.

[0176] The cumulative number of iterations is calculated. When the preset maximum number of iterations is reached, or when the change in the policy network loss value and the commentator network loss value is lower than the preset iteration termination threshold for a consecutive preset number of iterations, the iterative optimization of the network parameters is terminated, and the parameters of the current main policy network and the main commentator network are fixed as the inference and operation parameters of the agent; otherwise, the iterative update process is repeated.

[0177] Based on the same inventive concept, and corresponding to any of the above embodiments, refer to... Figure 2 This invention provides a multi-parameter collaborative control system for feed processing based on reinforcement learning, used to implement the aforementioned multi-parameter collaborative control method for feed processing based on reinforcement learning, comprising:

[0178] The multi-source data acquisition and preprocessing module is used to synchronously collect equipment operation status data, material characteristic data, and environmental condition data at the equipment layer through a sensor array deployed on the feed processing production line, according to a preset high-frequency sampling period. It performs timestamp alignment and data cleaning on the collected multi-source heterogeneous data, removing outliers and noise points, and uses interpolation to fill in missing data caused by sensor interruptions, forming a standardized time-series data stream. From the time-series data stream, it extracts the current equipment operation status, material characteristic parameters, and process control parameters, performs normalization processing, and combines them to construct a multi-dimensional state vector representing the current processing condition. This multi-dimensional state vector is then used as the input state for a reinforcement learning model.

[0179] The reinforcement learning model construction and configuration module, which is communicatively connected to the multi-source data acquisition and preprocessing module, is used to construct the state space, action space, and reward function of the reinforcement learning model. The state space is constructed based on the multi-dimensional state vector and combined with parameter coupling correlation features, temporal dynamic features, and constraint boundary features. The action space is defined as the adjustment amount of process control parameters with real-time online control capabilities. The controllable parameters are grouped according to the coupling correlation between parameters, and single-parameter adjustment boundary constraint sets and coupled group linkage adjustment rule sets are set for independent control groups and coupled control groups, respectively. The reward function is constructed based on preset collaborative control objectives, which include maximizing output, minimizing energy consumption, product consistency index, process constraint violation penalty items, and coupled control linkage compliance penalty items. The weight coefficients of each sub-item are dynamically corrected in real time based on the state space data at the current moment.

[0180] The deep reinforcement learning decision module, which internally deploys a policy network, is communicatively connected to the multi-source data acquisition and preprocessing module and the reinforcement learning model construction and configuration module. It is used to input the current state space into the policy network, which then outputs the original actions. These original actions include the adjustment direction and magnitude for each process control parameter.

[0181] The action correction and instruction generation module is communicatively connected to the deep reinforcement learning decision module. It is used to correct the original action based on preset process safety constraints and parameter coupling relationships, and generate action instructions. Specifically, it includes: inversely mapping the original action to the actual adjustment amount of each adjustable parameter, and splitting it into an independent control group and adjustment subsets of each coupled control group according to the coupling characteristics; sequentially performing compliance verification and iterative correction on the adjustment amounts of the independent control group and each coupled control group based on the single-parameter adjustment boundary constraint set and the coupled group linkage adjustment rule set; summarizing the corrected full-parameter adjustment set to be executed, and performing secondary clamping correction on the predicted values ​​of the adjusted parameters based on the safe operating upper and lower limits of each process control parameter to generate the final full-parameter adjustment set; finally, converting the adjustment amount into digital control instructions for the corresponding actuator, and attaching a steady-state response time threshold corresponding to the process control parameter to each digital control instruction.

[0182] The instruction execution and feedback acquisition module is communicatively connected to the multi-source data acquisition and preprocessing module and the action correction and instruction generation module; it is used to send action instructions to the actuators in the feed processing process, and to obtain the state space of the next moment after the action instruction is executed and the reward value calculated according to the reward function.

[0183] The experience replay and network training module is communicatively connected to the reinforcement learning model construction and configuration module, the deep reinforcement learning decision module, and the instruction execution and feedback acquisition module. It is used to store the current state space, executed instruction, reward value, and next state space as an experience tuple in the experience replay pool, and randomly sample multiple experience tuples from the experience replay pool to iteratively optimize the network parameters of the deep reinforcement learning agent.

[0184] Based on the same inventive concept, corresponding to any of the above embodiments, the present invention provides an electronic device, including a memory and a processor. The memory is used to store a computer program, and the processor runs the computer program to enable the electronic device to execute the reinforcement learning-based multi-parameter collaborative control method for feed processing according to the embodiments.

[0185] Alternatively, the aforementioned electronic device may be a server.

[0186] In addition, this embodiment also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the reinforcement learning-based multi-parameter collaborative control method for feed processing according to the embodiment.

[0187] It is understood that the processor in the embodiments of the present invention may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The general-purpose processor may be a microprocessor or any conventional processor.

[0188] The method steps in the embodiments of the present invention can be implemented in hardware or by a processor executing software instructions. The software instructions can consist of corresponding software modules, which can be stored in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, portable hard disks, CD-ROMs, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, enabling the processor to read information from and write information to the storage medium. Of course, the storage medium can also be a component of the processor. The processor and the storage medium can reside in an ASIC.

[0189] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of the present invention is generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a storage medium or transmitted through a storage medium. The computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive (SSD)).

Claims

1. A multi-parameter collaborative control method for feed processing based on reinforcement learning, characterized in that, Includes the following steps: Real-time operating data of the feed processing process is acquired, including: synchronously collecting equipment operating status data, material characteristic data, and environmental condition data at the equipment layer through sensor groups deployed on the feed processing production line according to a preset high-frequency sampling period; performing timestamp alignment and data cleaning on the collected multi-source heterogeneous data to remove outliers and noise points, and using interpolation to fill in missing data caused by sensor interruptions, forming a standardized time-series data stream; extracting the current equipment operating status, material characteristic parameters, and process control parameters from the time-series data stream; normalizing the current equipment operating status, material characteristic parameters, and process control parameters, combining them to construct a multi-dimensional state vector representing the current processing condition, and using the multi-dimensional state vector as the input state of the reinforcement learning model; The reinforcement learning model's state space, action space, and reward function are constructed. The state space is built based on the real-time operating data, including the following steps: The multi-dimensional state vector is decomposed into three mutually exclusive feature subsets: equipment state feature subset, material characteristic feature subset, and process control parameter feature subset. The equipment state feature subset includes equipment operating power, load rate, continuous operating time, and rated operating margin parameters. The material characteristic feature subset includes material moisture content, particle size distribution, proportion of core components in the formula, and material bulk density parameters. The process control parameter feature subset includes current conditioning temperature, conditioning humidity, pelleting pressure, mixing time, and cooling air temperature parameters. Based on the feed processing mechanism, the coupling correlation metric is calculated for all parameters within the three feature subsets. The linear correlation coefficient and nonlinear mutual information value between any two parameters are obtained. Parameter pairs with a coupling correlation exceeding a preset coupling threshold are selected to construct parameter coupling. A one-dimensional coupled feature vector is obtained by performing principal component analysis on the coupled feature matrix and reducing its dimensionality. Using the current time point as the endpoint, a standardized time-series data stream of a preset sliding window length is extracted, and the temporal change characteristics of each parameter within this window are extracted to construct a temporal dynamic feature vector. These temporal change characteristics include the parameter change rate, fluctuation amplitude, and cumulative deviation from the process baseline value. The feed formula and production line equipment model corresponding to the current processing batch are matched, and the upper and lower limits of safe operation, the equipment rated operating condition boundary, and the threshold range of material adaptation parameters corresponding to each process control parameter are extracted to construct a constraint boundary feature vector. The one-dimensional vectorized data of the three feature subsets are concatenated with the one-dimensional coupled feature vector, the temporal dynamic feature vector, and the constraint boundary feature vector in the same dimension to generate a full-dimensional state vector. This full-dimensional state vector is then linearly mapped to the standardized numerical range required by the reinforcement learning model input, completing the construction of the reinforcement learning model's state space. The action space is defined as the adjustment amount of the process control parameters, including the following steps: matching the feed formula, production line equipment model, and process control parameter feature subset in the state space corresponding to the current processing batch, filtering out process control parameters with real-time online control capabilities, and forming a complete set of controllable parameters; based on the parameter coupling correlation feature matrix, calculating the coupling correlation degree between any two parameters in the complete set of controllable parameters, classifying parameters with a coupling correlation degree exceeding a preset coupling threshold into the same coupling control group, and classifying the remaining parameters into independent control groups, thus completing the grouping of the coupling characteristics of the controllable parameters; according to the constraint boundary feature vector and combined with the feed processing mechanism, setting the direction constraint, minimum adjustment step size, maximum positive adjustment amplitude, maximum negative adjustment amplitude, and parameter adjustment result for each process control parameter in the complete set of controllable parameters. Steady-state response time thresholds are used to form a set of single-parameter adjustment boundary constraints. For each coupled control group, based on the coupling correlation of parameters within the group and the feed processing mechanism, the linkage adjustment ratio range, adjustment direction matching rules, and parameter adjustment conflict avoidance constraints within the group are set to form a set of linkage adjustment rules for the coupled group. The single adjustment amount of each process control parameter in the set of all adjustable parameters is defined as an independent dimension of the action space. Among them, the parameter dimension within the coupled control group is subject to the constraints of the corresponding linkage adjustment rule set of the coupled group, while the parameter dimension within the independent control group is an independent dimension without linkage constraints. Based on the set of single-parameter adjustment boundary constraints, the legal value range of each action dimension is determined, and the legal value ranges of all action dimensions are linearly mapped to the standardized numerical range required by the output of the reinforcement learning policy network, thus completing the construction of the action space of the reinforcement learning model. The reward function is constructed based on a preset collaborative control objective, including the following steps: matching the feed formula, preset production target, and rated operating parameters of the production line equipment corresponding to the current processing batch; combining the constraint boundary feature vector to determine the four core sub-items of the reward function; and calibrating the benchmark quantization threshold and initial weight coefficient corresponding to each core sub-item. The four core sub-items are the output revenue sub-item, energy consumption cost sub-item, product consistency reward sub-item, and process constraint violation penalty sub-item. Among them, the output revenue sub-item is calculated based on the qualified feed product output per unit time after the current action instruction is executed, and the preset rated qualified output per unit time for the current batch is used as the benchmark quantization threshold. The ratio of the actual output to the benchmark quantization threshold is calculated to obtain the output revenue sub-item. The positive vectorized value of the output revenue sub-item increases positively as the actual output exceeds the benchmark quantification threshold, and decreases positively as the actual output falls below the benchmark quantification threshold. The energy consumption cost sub-item is calculated based on the total energy consumption of the entire production line cluster per unit time after the execution of the current action command, using the preset rated energy consumption per unit time for the current batch as the benchmark quantification threshold. The ratio of actual energy consumption to the benchmark quantification threshold is calculated to obtain the negative vectorized value of the energy consumption cost sub-item. The negative vectorized value of the energy consumption cost sub-item increases negatively as the actual energy consumption exceeds the benchmark quantification threshold, and decreases negatively as the actual energy consumption falls below the benchmark quantification threshold. The product consistency reward sub-item is based on the core quality indicators of the finished feed after the execution of the current action command and the predicted... Using the deviation from the standard indicator range as the calculation basis, the cumulative deviation of each core quality indicator from the preset standard indicator range is calculated. Based on the reciprocal normalization of the cumulative deviation, the positive vectorized value of the product consistency reward sub-item is obtained. The positive vectorized value of the product consistency reward sub-item increases positively as the cumulative deviation decreases. The core quality indicators include finished product moisture content, gelatinization degree, particle hardness, and powdering rate. The process constraint violation penalty sub-item is judged based on whether each process control parameter and equipment operating parameter exceeds the safe operating range calibrated by the constraint boundary feature vector after the current action command is executed. For parameters exceeding the safe operating range, a single parameter penalty value is calculated by multiplying the excess range by a preset unit range penalty coefficient, and all values ​​are accumulated. The single-parameter penalty value of the superboundary parameter is used to obtain the negative vectorized value of the process constraint violation penalty sub-item; combined with the parameter coupling correlation feature matrix, coupling control group, and coupling group linkage adjustment rule set, a coupling control linkage compliance penalty sub-item is constructed; for each coupling control group, the coupling control linkage compliance penalty sub-item verifies whether the adjustment amount of the parameters executed within the group conforms to the linkage adjustment ratio range, adjustment direction matching rule, and adjustment conflict avoidance constraint specified in the coupling group linkage adjustment rule set; for coupling control groups that violate the linkage rules, the single-group penalty value is calculated by multiplying the weighted sum of the parameter coupling correlation degree within the group with the rule violation degree; the single-group penalty values ​​of all violating coupling control groups are accumulated to obtain the negative vectorized value of the coupling control linkage compliance penalty sub-item;The four core sub-items and the coupled regulation and linkage compliance penalty sub-item are weighted and summed. Based on the state space data at the current moment, the weight coefficients of the four core sub-items and the coupled regulation and linkage compliance penalty sub-item are dynamically adjusted in real time to obtain the instant reward value at the current moment. The instant reward value is linearly mapped to the standardized numerical range required by the reinforcement learning model to complete the construction of the reward function. The current state space is input into the deep reinforcement learning agent of the reinforcement learning model, and the original action is output through the policy network in the deep reinforcement learning agent. The original action includes the adjustment direction and magnitude of each process control parameter. The policy network adopts a multi-layer fully connected neural network structure, with the input layer dimension consistent with the state space dimension and the output layer dimension consistent with the action space dimension. The hidden layer activation function is ReLU, and the output layer activation function is tanh. Based on preset process safety constraints and parameter coupling relationships, the original action is modified, and action instructions are generated, including the following steps: The original action is inversely mapped to the actual adjustment amount corresponding to each adjustable parameter according to the mapping relationship constructed in the action space, generating an initial adjustment amount set; According to the grouping results of the adjustable parameters' coupling characteristics, the actual adjustment amount of each parameter in the initial adjustment amount set is split into independent control groups and each coupled control group, forming independent parameter adjustment subsets and multiple coupled group adjustment subsets respectively; For each process control parameter in the independent parameter adjustment subset, based on the single-parameter adjustment boundary constraint set, the adjustment direction, adjustment step size, and adjustment amplitude of its actual adjustment amount are verified sequentially; if the adjustment... If the direction does not conform to the preset direction constraint, or the absolute value of the adjustment is less than the preset minimum adjustment step, the corresponding adjustment is set to zero. If the adjustment exceeds the maximum positive or negative adjustment range in a single operation, the corresponding adjustment is clamped to the corresponding amplitude boundary value. After correction, an independent set of compliant parameter adjustments is generated. For each coupled group adjustment subset, the amplitude and direction of the actual adjustment of each parameter within the group are initially verified based on the single-parameter adjustment boundary constraint set. Adjustments exceeding the single-parameter boundary are clamped to obtain the initial calibration adjustment set within the group. Based on the coupled group linkage adjustment rule set corresponding to the coupled control group, the adjustment direction matching relationship and parameter linkage adjustment of the initial calibration adjustment set within the group are verified. The system checks whether the ratio is within a preset range and whether there are adjustment conflicts. If the check fails, it uses the coupling correlation of parameters within the group as a weighting coefficient, aiming to minimize the deviation between the adjustment amount and the initial value. Within the limits of the linkage adjustment ratio range, adjustment direction matching rules, and adjustment conflict avoidance constraints, it iteratively corrects the adjustment amount of each process control parameter within the group until the corrected adjustment amount simultaneously meets the linkage adjustment rule set of the coupled group and the single parameter adjustment boundary constraint set, generating a compliant adjustment set for each coupled control group. The compliant adjustment sets of all coupled control groups are then summarized to obtain the compliant adjustment set of coupled parameters. Finally, the compliant adjustment sets of independent parameters and coupled parameters are summarized to obtain the full parameter uncompromised set. Execute the adjustment set; based on the safe operating upper and lower limits of each process control parameter, calculate the predicted value of each process control parameter after adjustment, where the predicted value is the sum of the current value of the process control parameter and the corresponding adjustment amount; if the predicted value exceeds the safe operating range, then, with the constraint of not exceeding the safe operating range, perform a secondary clamping correction on the adjustment amount of the parameter, and generate the final adjustment set of all parameters after correction; based on the final adjustment set of all parameters, and combined with the control protocol of the actuator corresponding to each process control parameter, convert the adjustment amount of each process control parameter into the digital control instruction of the corresponding actuator, and attach the steady-state response time threshold corresponding to the process control parameter to each digital control instruction to complete the generation of action instructions; The action command is sent to the actuator in the feed processing process, and the state space of the next moment after the action command is executed and the reward value calculated according to the reward function are obtained. The current state space, executed instructions, reward value, and next state space are stored as an experience tuple in the experience replay pool. Multiple experience tuples are randomly sampled from the experience replay pool to iteratively optimize the network parameters of the deep reinforcement learning agent.

2. The multi-parameter collaborative control method for feed processing based on reinforcement learning as described in claim 1, characterized in that, The specific process of dynamically adjusting the weight coefficients of the four core sub-items and the coupled control linkage compliance penalty sub-item based on the current state space data is as follows: Extract the equipment load rate, material characteristic parameters and formula baseline values ​​deviation, production progress completion rate, disturbance event occurrence level, grid peak and valley electricity price time period identifier and equipment energy efficiency attenuation coefficient in the current state space, and adjust the weight coefficients of process constraint violation penalty sub-item, product consistency reward sub-item, output revenue sub-item, coupled control linkage compliance penalty sub-item, and energy consumption cost sub-item accordingly. When the equipment load rate exceeds the preset load threshold, the weight coefficient of the process constraint violation penalty item is increased synchronously according to the excess ratio; When the deviation of material characteristic parameters from the formula baseline value exceeds the preset deviation threshold, the weight coefficient of the product consistency reward sub-item will be increased synchronously according to the deviation magnitude. When the production progress completion rate is lower than the preset progress threshold, the weight coefficient of the output revenue sub-item will be increased synchronously according to the lag ratio. When a disturbance event is detected, the weight coefficient of the coupled control linkage compliance penalty sub-item is increased synchronously according to the disturbance event level; When the power grid is in a peak electricity price period or the equipment energy efficiency degradation coefficient exceeds the preset degradation threshold, the weight coefficient of the energy consumption cost sub-item is increased synchronously according to the electricity price increase ratio or the energy efficiency degradation magnitude; when the power grid is in a low electricity price period and the equipment energy efficiency degradation coefficient is lower than the preset degradation threshold, the weight coefficient of the energy consumption cost sub-item is decreased synchronously.

3. A multi-parameter collaborative control system for feed processing based on reinforcement learning, used to implement the multi-parameter collaborative control method for feed processing based on reinforcement learning as described in any one of claims 1-2, characterized in that, include: The multi-source data acquisition and preprocessing module is used to synchronously collect equipment operation status data, material characteristic data, and environmental condition data at the equipment layer through a sensor array deployed on the feed processing production line, according to a preset high-frequency sampling period. It performs timestamp alignment and data cleaning on the collected multi-source heterogeneous data, removing outliers and noise points, and uses interpolation to fill in missing data caused by sensor interruptions, forming a standardized time-series data stream. From the time-series data stream, it extracts the current equipment operation status, material characteristic parameters, and process control parameters, performs normalization processing, and combines them to construct a multi-dimensional state vector representing the current processing condition. This multi-dimensional state vector is then used as the input state for a reinforcement learning model. The reinforcement learning model construction and configuration module, which is communicatively connected to the multi-source data acquisition and preprocessing module, is used to construct the state space, action space, and reward function of the reinforcement learning model. The state space is constructed based on the multi-dimensional state vector and combined with parameter coupling correlation features, temporal dynamic features, and constraint boundary features. The action space is defined as the adjustment amount of process control parameters with real-time online control capabilities. The controllable parameters are grouped according to the coupling correlation between parameters, and single-parameter adjustment boundary constraint sets and coupled group linkage adjustment rule sets are set for independent control groups and coupled control groups, respectively. The reward function is constructed based on preset collaborative control objectives, which include maximizing output, minimizing energy consumption, product consistency index, process constraint violation penalty items, and coupled control linkage compliance penalty items. The weight coefficients of each sub-item are dynamically corrected in real time based on the state space data at the current moment. The deep reinforcement learning decision module has a policy network internally deployed and is communicatively connected to the multi-source data acquisition and preprocessing module and the reinforcement learning model construction and configuration module. This is used to input the current state space into the policy network, which then outputs the original actions; the original actions include the adjustment direction and magnitude of each process control parameter. The action correction and instruction generation module is communicatively connected to the deep reinforcement learning decision module. This tool is used to modify the original action based on preset process safety constraints and parameter coupling relationships, and generate action instructions. Specifically, it includes: inversely mapping the original action to the actual adjustment amount of each adjustable parameter, and splitting it into an independent control group and an adjustment subset of each coupled control group according to the grouping results of coupling characteristics; sequentially performing compliance verification and linkage iterative correction on the adjustment amounts of the independent control group and each coupled control group based on the single parameter adjustment boundary constraint set and the coupled group linkage adjustment rule set; summarizing the modified full parameter adjustment set to be executed, and performing secondary clamping correction on the adjusted parameter prediction values ​​based on the safe operating upper and lower limits of each process control parameter to generate the final full parameter adjustment set; finally, converting the adjustment amount into the digital control instructions of the corresponding actuator, and attaching the steady-state response time threshold corresponding to the process control parameter to each digital control instruction. The instruction execution and feedback acquisition module is communicatively connected to the multi-source data acquisition and preprocessing module and the action correction and instruction generation module; it is used to send action instructions to the actuators in the feed processing process, and to obtain the state space of the next moment after the action instruction is executed and the reward value calculated according to the reward function. The experience replay and network training module is communicatively connected to the reinforcement learning model construction and configuration module, the deep reinforcement learning decision module, and the instruction execution and feedback acquisition module. It is used to store the current state space, executed instruction, reward value, and next state space as an experience tuple in the experience replay pool, and randomly sample multiple experience tuples from the experience replay pool to iteratively optimize the network parameters of the deep reinforcement learning agent.

4. An electronic device, characterized in that, The device includes a memory and a processor, the memory being used to store a computer program, and the processor running the computer program to cause the electronic device to perform the multi-parameter collaborative control method for feed processing based on reinforcement learning as described in any one of claims 1-2.

5. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the multi-parameter collaborative control method for feed processing based on reinforcement learning as described in any one of claims 1-2.