An end-to-end autonomous driving method based on a physical information fusion model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing an end-to-end autonomous driving method based on a physical information fusion model, a physical prediction head and a state transition network are used to construct an objective function that integrates physical penalties and optimize policy parameters. This addresses the problem of insufficient physical constraints in existing technologies and improves the safety and stability of end-to-end systems in complex scenarios.

CN122300550APending Publication Date: 2026-06-30SHENZHEN AUTOMOTIVE RES INST BEIJING INST OF TECH (SHENZHEN RES INST OF NAT ENG LAB FOR ELECTRIC VEHICLES) +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHENZHEN AUTOMOTIVE RES INST BEIJING INST OF TECH (SHENZHEN RES INST OF NAT ENG LAB FOR ELECTRIC VEHICLES)
Filing Date: 2026-04-22
Publication Date: 2026-06-30

Application Information

Patent Timeline

22 Apr 2026

Application

30 Jun 2026

Publication

CN122300550A

IPC: B60W60/00; G06N3/045; G06N3/042; G06N3/082; G06N3/098; G05B13/04

AI Tagging

Technology Topics

Ground truth Engineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Federative learning methods and systems that can reduce domain skew between clients
JP2026110542AGround truth Feature vector
Methods and systems for determining contribution value of content used to train machine learning models
US20260178904A1Neural learning methods Ground truth Data class
Pose estimation model training methods, devices, terminal equipment, and storage media
CN122313221APattern recognition Color image
A parachute three-dimensional reconstruction and parameter identification method
CN122312894AGround truthPhysical space
Model training method, image processing method, device, and program product
CN122243816AImprove realismrich diversityImage enhancement Image analysis Ground truth Imaging processing

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing end-to-end autonomous driving methods have shortcomings in considering physical constraints, mainly manifested in low sample utilization efficiency, limited policy generalization ability, and failure to participate in environmental state modeling of physical constraints, which makes it difficult for control strategies to guarantee safety and stability under complex operating conditions.

Method used

An end-to-end autonomous driving approach based on a physical information fusion model is adopted. By constructing an environmental perception and experience replay pool, a world model with physical information is trained to learn latent space policies constrained by physical indicators. A physical prediction head and a state transition network are used to construct an objective function that incorporates physical penalties, optimize policy parameters, and ensure the safety and stability of the policy in the latent space.

Benefits of technology

Explicitly mapping the latent state as a physical dynamic index improves the decision-making safety and control stability of end-to-end systems in complex scenarios, enables predictive risk avoidance, and solves the problems of reaction lag and control instability in traditional methods under extreme conditions.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122300550A_ABST

Patent Text Reader

Abstract

This invention discloses an end-to-end autonomous driving method based on a physical information fusion model, comprising: S1, collecting environmental observations, vehicle control actions, and calculated ground truth values of physical features, and storing them in an experience replay pool; S2, training a world model that fuses physical information, which jointly learns latent state evolution and physical indicator prediction through an encoder, a state transition network, and a physical prediction head, and optimizes by minimizing the joint loss incorporating physical prediction errors; S3, based on the trained world model, performing policy learning in the latent space that considers physical indicator constraints to achieve preventative risk avoidance and smooth control; S4, deploying the trained policy to the vehicle controller for online control. This invention significantly improves the safety, stability, and physical consistency of autonomous driving systems in complex scenarios by explicitly modeling and predicting physical indicators in the world model and using them as forward-looking constraints in policy learning.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of autonomous driving technology, and in particular to an end-to-end autonomous driving method based on a physical information fusion model. Background Technology

[0002] With the development of autonomous driving technology, end-to-end autonomous driving methods have gradually become a research hotspot. These methods utilize deep learning models to directly map perception inputs to control outputs, which can reduce the accumulation of errors between perception, prediction, and planning modules in traditional modular autonomous driving systems, thereby improving the overall system response speed and performance.

[0003] In recent years, with the development of model-driven reinforcement learning technology, world models have been widely used in continuous control tasks such as autonomous driving. World models can predict future states by learning the state transition patterns of the environment in a latent space, achieving "learning based on imagination." Compared with traditional data-driven methods based on real interactions, world models can significantly improve data utilization efficiency and have better generalization and environmental adaptability.

[0004] However, existing end-to-end autonomous driving methods have shortcomings in considering physical constraints, mainly manifested in the following two types of technical solutions and their drawbacks:

[0005] The first category is end-to-end autonomous driving methods based on world-less model reinforcement learning but considering physical constraints. These methods construct end-to-end policy models, directly mapping perceived inputs to vehicle control outputs. To satisfy constraints, penalty terms are typically introduced into the reward function, or post-processing safety filtering modules are used to modify control commands. The disadvantages of this type of method are: 1) Because a dynamic environmental model is not explicitly constructed, learning relies on actual interactions with the environment, resulting in low sample utilization efficiency; 2) The reward function relies on manual design, making it difficult to fully characterize complex constraints, thus limiting policy generalization ability; 3) Physical constraints are mainly introduced through the reward function or post-processing methods, without participating in the environmental state modeling process; 4) In long-term decision-making, it is difficult to guarantee the continuous satisfaction of constraints, leading to safety instability issues.

[0006] The second category is end-to-end autonomous driving methods based on world models. These methods typically involve using perception modules to acquire environmental information, mapping it to a potential space via encoders, constructing a state transition model to predict future states, and optimizing the control strategy based on the predictions. The drawbacks of this type of method are: 1) physical constraints related to vehicle operation are not explicitly introduced during the state transition model construction process; 2) the obtained state prediction results may be inconsistent with the actual physical process; 3) the control strategy learned based on these prediction results has uncertainties in terms of safety; and 4) it is difficult to guarantee the consistency and effectiveness of physical constraints during the prediction process at the model level.

[0007] In summary, existing technologies, in the process of environmental modeling of world models, mainly focus on predicting the spatiotemporal evolution of the external environment, lacking explicit characterization and modeling of key physical and dynamic characteristics during vehicle operation. This makes it difficult for the generated control strategies to balance safety, stability, and physical consistency under complex operating conditions. Summary of the Invention

[0008] To address the shortcomings of existing technologies, this invention provides an end-to-end autonomous driving method based on a physical information fusion model.

[0009] To achieve the above-mentioned objectives, the technical solution adopted by the present invention is as follows:

[0010] An end-to-end autonomous driving method based on a physical information fusion model includes the following steps:

[0011] S1. Environmental Perception and Experience Replay Pool Construction: Obtaining environmental observation information of vehicle operation through the environmental perception module. Vehicle control actions and the true values of physical characteristics calculated based on state observation data and sequence data Stored in the experience replay pool;

[0012] S2. Construction and Training of a World Model Integrating Physical Information: Historical trajectory sequences are extracted from the experience replay pool to train the world model; the world model includes an encoder and a state transition network. and physical prediction head The training includes: mapping the environmental observation sequence to a latent state sequence using the encoder; recursively predicting the current latent state using the state transition network based on the previous latent state and actions; and mapping the latent state sequence to a physical indicator prediction sequence using the physical prediction head. By minimizing the joint loss function Optimize model parameters; the joint loss function includes the world model base loss. and physical prediction loss ,in Calculate the predicted sequence of physical indicators and the true sequence of physical features. The error between;

[0013] S3. Latent Space Policy Learning Constrained by Physical Indicators: Based on the trained world model, policy optimization is performed within the latent space; the policy optimization includes: based on the current policy... The world model is used to perform multi-step visualization to generate a sequence of potential states and corresponding physical index prediction sequences in the future time domain; an objective function that incorporates physical penalties is constructed. The objective function incorporates a penalty term based on the predicted value of future time-domain physical indicators when calculating the cumulative return; the strategy parameters are updated by optimizing the objective function.

[0014] S4. Strategy Deployment and Online Control Execution: Deploy the trained strategy model to the vehicle controller and generate vehicle control commands based on real-time environmental observations.

[0015] Further, in step S1, the true value of the physical feature Including security indicators and control stability indicators ;

[0016] The security indicators Based on collision time The collision time was calculated. Based on the relative distance between your vehicle and the vehicle in front and relative velocity calculate;

[0017] The control stability index It is calculated based on the rate of change of vehicle control variables.

[0018] Furthermore, in step S2, the fundamental loss of the world model This includes a reconstruction loss to ensure that the latent state can reconstruct environmental observation information, and a KL divergence loss to regularize the distribution of the latent state.

[0019] Further, in step S2, the joint loss function is: ,in The weighting coefficients for the physical prediction loss.

[0020] Furthermore, in step S3, the objective function of the fusion physical penalty... Represented as:

[0021]

[0022] in, Basic task rewards, As a discount factor, , As a penalty weight, and These are the safety and stability indicators for future moments predicted by the world model.

[0023] Furthermore, in step S2, the state transition network adopts a recurrent neural network or a temporal model based on an attention mechanism.

[0024] Furthermore, in step S2, the physical prediction head outputs a deterministic sequence of physical index values, or outputs a probability distribution of physical indices.

[0025] Furthermore, in step S3, the constraint implementation method for policy optimization is as follows: introducing the predicted value of physical index as a penalty term into the reward function, or using the Lagrange multiplier method to dynamically adjust the constraint weights, or adding a physical rule-based safety projection layer at the output of the policy network.

[0026] The present invention also discloses a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the above-described end-to-end autonomous driving method.

[0027] The present invention also discloses a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described end-to-end autonomous driving method.

[0028] Compared with the prior art, the advantages of the present invention are as follows:

[0029] 1. This invention explicitly maps the recursively evolving latent states into specific physical dynamics indices (such as safety and stability) within the world model through independent physical prediction heads. This mechanism internalizes physical constraints into the environmental model, enabling the abstract latent space to describe physical boundaries. This provides physically meaningful supervisory signals for downstream policy learning, ensuring that the policy update process is effectively guided by physics, thereby significantly improving the decision-making safety, control stability, and consistency with real physical laws of the end-to-end system in complex scenarios.

[0030] 2. This invention introduces multi-step physical indicators (such as collision risk and control mutations) predicted by a world model as penalty terms into the reward function of policy learning. This enables the agent to proactively assess potential risks over long time domains during the "imagined" trajectory deduction stage and avoid them in advance during policy optimization. Compared to the delayed feedback mechanism in existing technologies that only imposes penalties after a danger occurs, this invention prompts the policy network to generate predictive and preventative driving actions, effectively solving the technical bottlenecks of traditional methods such as delayed response and unstable control under extreme conditions, and achieving a deep integration of task completion efficiency and driving quality. Attached Figure Description

[0031] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0032] Figure 1 This is a flowchart of an end-to-end autonomous driving method based on a physical information fusion model in an embodiment of the present invention. Detailed Implementation

[0033] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0034] I. Hardware and Software Environment Configuration

[0035] The present invention can be implemented in an autonomous vehicle equipped with corresponding sensors or on its simulation platform.

[0036] Hardware environment: The vehicle needs to be equipped with environmental perception sensors (such as forward-facing cameras, LiDAR, millimeter-wave radar), positioning and attitude determination units (such as GNSS / IMU integrated navigation systems), and vehicle bus interfaces (such as CAN bus) to acquire environmental observation and vehicle status data. The core computing unit can adopt a high-performance onboard computing platform (such as an industrial control computer equipped with a GPU).

[0037] Software environment: The operating system can be Linux (such as Ubuntu), the programming language is Python, the deep learning framework can be PyTorch or TensorFlow, and the corresponding autonomous driving simulation environment (such as CARLA, LGSVL) needs to be installed for algorithm training and verification.

[0038] II. Specific Implementation Steps

[0039] like Figure 1 As shown, this invention proposes an end-to-end autonomous driving method based on a physical information fusion model. The steps include environmental perception and experience replay pool construction, world model training with physical information fusion, latent space policy learning constrained by physical indicators, policy deployment and online control execution.

[0040] S1. The environmental perception and experience playback pool is constructed, and vehicle operating environment observation information is obtained through the environmental perception module. and vehicle control actions The true values of physical characteristics calculated based on state observation data (including radar ranging, vehicle speed, inertial acceleration, front wheel steering angle, etc.) ; to sequence data Stored in the experience replay pool for subsequent offline model training.

[0041] The environmental observation information This includes: bird's-eye view images, waypoint information for path planning, and the position and speed information of vehicles around the vehicle.

[0042] The vehicle control action Includes: acceleration commands and front wheel steering angle commands

[0043] The true value of the physical characteristics Includes: Safety metrics derived from Time-of-Collision (TTC) mapping and control stability index obtained based on the rate of change of control quantity. .

[0044] The security risk indicators The degree of collision risk at that moment can be characterized by the collision time. calculate:

[0045]

[0046] in, This represents the relative distance between the vehicle and the vehicle in front at time t. The relative speed between the vehicle and the vehicle in front is denoted as . To prevent the minimum value where the denominator is zero, It is a regulating factor.

[0047] The control stability index The characteristics of ride comfort and stability at that moment are constructed based on changes in the vehicle's motion state.

[0048] in, Let be the acceleration at time t. Let be the angular velocity of the front wheel at time t. , These are the weight parameters.

[0049] S2. The construction and training of the world model integrating physical information involves randomly selecting a length of [missing information] from the experience replay pool. Historical trajectory sequences, including encoders and state transition networks. and physical prediction head The world model is jointly trained. Specific steps include latent state encoding, recursive prediction modeling, and joint loss optimization.

[0050] S21. The latent state encoding utilizes an encoder to process the extracted environmental observation sequence. Mapping to a high-dimensional latent space generates the corresponding historical latent state sequence. .

[0051] S22. The recursive prediction modeling, i.e., the world model uses a recursive neural network structure to model the temporal evolution. Using sequence data extracted from the experience replay pool, the state transition network is... and physical prediction head Conduct joint training:

[0052] Recursive latent state evolution: at each time step State transition network Receive the latent state from the previous moment Actions at the previous moment As input, the prior latent state at the current time step is generated recursively. :

[0053] Therefore, through The next iteration generates the corresponding potential state sequence.

[0054] Physical Information Synchronization Mapping: Physical Prediction Head With the potential state sequence As input, the corresponding physical index prediction sequence is obtained through mapping calculation. The predicted value at each time step is

[0055] This mapping process establishes an explicit relationship between abstract latent states and concrete vehicle physical dynamics characteristics.

[0056] S23. The joint loss function optimization, in order to simultaneously ensure the modeling accuracy of the world model in terms of the environment and its ability to predict physical risks, involves constructing a joint loss function. :

[0057]

[0058] in, The fundamental loss for the world model includes the reconstruction loss used to ensure that latent states reproduce environmental information and the KL divergence loss used for regularization of latent state distribution. The weighting coefficients for physical prediction loss; The physical prediction loss is used to calculate the prediction sequence of the physical indicators. With physical feature truth sequence Cumulative mean square error between:

[0059]

[0060] S3. The latent space policy learning constrained by physical indicators, based on the imagination provided by the world model, performs policy optimization guided by physical indicators within the latent space:

[0061] S31. Multi-objective imagined trajectory generation: In the latent space environment simulated by the world model, the agent generates trajectory based on the current strategy. Execution of the imagined time domain Step-by-step action sequence. The world model, incorporating physical information, simultaneously generates a sequence of latent states and corresponding physical indicator predictions within the future imagined time domain:

[0062]

[0063] This process simulates a human driver's rehearsal of the future vehicle's safety and stability before making a decision.

[0064] S32. Physically Constrained Policy Learning: Constructing an Objective Function Incorporating Physical Penalties When calculating total returns, the predicted safety metrics will be used. and stability indicators Introduced as a dynamic soft constraint penalty term:

[0065] in For control strategies, Basic rewards (such as speed maintenance, path tracking, etc.) As a discount factor, , The penalty weights are assigned to physical constraints. The introduction of the penalty term establishes a "risk cost" mechanism: when the world model predicts a collision may occur at a certain future step (…), the penalty is applied. When the amplitude increases or there is violent shaking, the total reward of this trajectory will be significantly reduced. (Increase).

[0066] S33. Policy predictability constraints and parameters: The gradient ascent algorithm is used to update the policy network parameters. The physical constraint mechanism is as follows:

[0067] Preventative risk aversion: because the penalty is based on the future. The predicted values at each step mean that the parameter updates of the policy network at the current moment are suppressed by feedback from potential future risks. This prompts the policy to learn to avoid high-risk action areas during the training phase, rather than passively correcting itself after a collision occurs.

[0068] Physical boundary optimization: by adjusting weight coefficients and The forced strategy aims to satisfy the driving task (maximize) At the same time, it is necessary to find the sequence of actions that minimizes the physical risk indicators. This allows the final control strategy to achieve the optimal trade-off between safety, ride comfort, and mission performance within the limits of vehicle dynamics.

[0069] S4. Strategy Deployment and Online Control Execution: Deploy the trained strategy model to the vehicle controller and generate vehicle control commands based on real-time environmental observations.

[0070] The trained policy model Deployed to the vehicle controller. During online operation, the environmental perception module acquires real-time environmental observations. The current potential state is obtained through the encoder. The strategy model is based on Directly output vehicle control actions (Including acceleration commands and front wheel steering angle commands), and send them to the vehicle actuators to complete control.

[0071] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0072] In another embodiment of the present invention, a terminal device is provided, comprising a processor and a memory. The memory stores a computer program, which includes program instructions. The processor executes the program instructions stored in the computer storage medium. The processor may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. It is the computing and control core of the terminal, suitable for implementing one or more instructions, specifically suitable for loading and executing one or more instructions to achieve a corresponding method flow or corresponding function. The processor described in this embodiment of the present invention can be used in the operation of an end-to-end autonomous driving method based on a physical information fusion model.

[0073] In another embodiment of the present invention, a storage medium is provided, specifically a computer-readable storage medium (Memory), which is a memory device in a terminal device used to store programs and data. It is understood that the computer-readable storage medium here can include both the built-in storage medium in the terminal device and extended storage media supported by the terminal device. The computer-readable storage medium provides storage space that stores the terminal's operating system. Furthermore, the storage space also stores one or more instructions suitable for loading and execution by a processor; these instructions can be one or more computer programs (including program code). It should be noted that the computer-readable storage medium here can be high-speed RAM or non-volatile memory, such as at least one disk storage device.

[0074] One or more instructions stored in a computer-readable storage medium can be loaded and executed by a processor to implement the corresponding steps of an end-to-end autonomous driving method based on a physical information fusion model in the above embodiments; one or more instructions in the computer-readable storage medium are loaded and executed by a processor.

[0075] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. This computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0076] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.

[0077] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.

Claims

1. An end-to-end autonomous driving method based on a physical information fusion model, characterized in that, Includes the following steps: S1. Environmental Perception and Experience Replay Pool Construction: Obtaining environmental observation information of vehicle operation through the environmental perception module. Vehicle control actions and the true values of physical characteristics calculated based on state observation data and sequence data Stored in the experience replay pool; S2. Construction and Training of a World Model Integrating Physical Information: Historical trajectory sequences are extracted from the experience replay pool to train the world model; the world model includes an encoder and a state transition network. and physical prediction head The training includes: mapping the environmental observation sequence to a latent state sequence using the encoder; recursively predicting the current latent state using the state transition network based on the previous latent state and actions; and mapping the latent state sequence to a physical indicator prediction sequence using the physical prediction head. By minimizing the joint loss function Optimize model parameters; the joint loss function includes the world model base loss. and physical prediction loss ,in Calculate the predicted sequence of physical indicators and the true sequence of physical features. The error between; S3. Latent Space Policy Learning Constrained by Physical Indicators: Based on the trained world model, policy optimization is performed within the latent space; the policy optimization includes: based on the current policy... The world model is used to perform multi-step visualization to generate a sequence of potential states and corresponding physical index prediction sequences in the future time domain; an objective function that incorporates physical penalties is constructed. The objective function incorporates a penalty term based on the predicted value of future time-domain physical indicators when calculating the cumulative return; the strategy parameters are updated by optimizing the objective function. S4. Strategy Deployment and Online Control Execution: Deploy the trained strategy model to the vehicle controller and generate vehicle control commands based on real-time environmental observations.

2. The method according to claim 1, characterized in that, In step S1, the true value of the physical feature Including security indicators and control stability indicators ; The security indicators Based on collision time The collision time was calculated. Based on the relative distance between your vehicle and the vehicle in front and relative velocity calculate; The control stability index It is calculated based on the rate of change of vehicle control variables.

3. The method according to claim 1, characterized in that, In step S2, the world model's basic loss This includes a reconstruction loss to ensure that the latent state can reconstruct environmental observation information, and a KL divergence loss to regularize the distribution of the latent state.

4. The method according to claim 1, characterized in that, In step S2, the joint loss function is: ,in The weighting coefficients for the physical prediction loss.

5. The method according to claim 1, characterized in that, In step S3, the objective function of the fusion physical penalty Represented as: ； in, Basic task rewards, As a discount factor, , As a penalty weight, and These are the safety and stability indicators for future moments predicted by the world model.

6. The method according to any one of claims 1-5, characterized in that, In step S2, the state transition network adopts a recurrent neural network or a temporal model based on an attention mechanism.

7. The method according to any one of claims 1-5, characterized in that, In step S2, the physical prediction head outputs a deterministic sequence of physical index values or a probability distribution of physical indices.

8. The method according to any one of claims 1-5, characterized in that, In step S3, the constraint implementation methods for policy optimization are as follows: introducing the predicted value of physical index as a penalty term into the reward function, or using the Lagrange multiplier method to dynamically adjust the constraint weights, or adding a physical rule-based safety projection layer at the output of the policy network.

9. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method as described in any one of claims 1-8.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps of the method as described in any one of claims 1-8.