Path marking and automatic planning method for personalized virtual laparoscopic surgical robot

WO2026137488A1PCT designated stage Publication Date: 2026-07-02QINGDAO UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: QINGDAO UNIV
Filing Date: 2024-12-30
Publication Date: 2026-07-02

Application Information

Patent Timeline

30 Dec 2024

Application

02 Jul 2026

Publication

WO2026137488A1

IPC: B25J9/16; A61B34/10; A61B34/20; A61B34/30; G06N3/092; G06N3/08

AI Tagging

Technology Topics

Physical medicine and rehabilitation Surgical robot

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

An improved needle structure for facial microplasty
CN224387856UInfusion syringes Diagnostics Physical medicine and rehabilitation Physical therapy
Combined screen and screen shroud of a portable cognitive assessment device
USD1130750SPhysical medicine and rehabilitation Physical therapy
Massager-trainer
RU244415U1Physical medicine and rehabilitation Medical equipment
Electric over-bed travel device with human transfer and body position conversion and control method
CN122350961AHuman body Medical treatment
A pillow that provides all-around support for the cervical spine during sleep.
CN224440892ULifting in real timeeasy to relaxPhysical medicine and rehabilitation Anatomy

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN2024144010_02072026_PF_FP_ABST

Patent Text Reader

Abstract

A reinforcement learning method for path marking and automatic planning for a personalized virtual laparoscopic surgical robot, comprising: defining a state space and an action space of the virtual laparoscopic surgical robot in an abdominal cavity environment (S1), wherein the state space is used for describing current configurations of a surgical environment, and the action space is used for defining actions that can be executed by a robotic arm in the abdominal cavity environment; designing a reward function mapping the state space and the action space to a real-valued evaluation index, for representing an immediate reward obtained by executing action a in state S (S2); constructing a hierarchical DQN architecture, for decomposing a surgical path planning task into two levels: high-level policy planning and low-level action execution (S3); and performing reinforcement learning training on a path marking and automatic planning task of the virtual laparoscopic surgical robot on the basis of the hierarchical DQN architecture (S4). The method is a learning method capable of optimizing surgical path planning under the uncertainty of virtual laparoscopic surgery by incorporating systematic considerations of surgical safety.

Need to check novelty before this filing date? Find Prior Art

Description

Path marking and automatic planning methods for personalized virtual laparoscopic surgical robots Technical Field

[0001] This application relates to the field of virtual medical technology, specifically to a path marking and automatic planning method for an individualized virtual laparoscopic surgical robot. Background Technology

[0002] With the development of minimally invasive surgical techniques, laparoscopic surgical robots are playing an increasingly important role in clinical applications. However, traditional surgical path planning methods have many limitations when facing complex abdominal environments: firstly, they struggle to effectively handle the high uncertainty and dynamic changes of the surgical environment; secondly, they lack a systematic consideration of surgical safety; and thirdly, they cannot achieve real-time path optimization and adjustment. Existing technologies mainly rely on pre-set path planning algorithms, which are difficult to adapt to individualized patient anatomical differences and cannot effectively cope with unexpected situations during surgery. Furthermore, traditional methods perform poorly when dealing with multi-objective optimization problems (such as simultaneously considering safety, efficiency, and smoothness), making it difficult to meet the needs of precise minimally invasive surgery.

[0003] Chinese patent application CN119055330A discloses an ultrasound-guided intelligent puncture path planning method and system. In a reinforcement learning environment, it employs an improved Deep Q-Network algorithm for global path planning; based on the global path, a newly designed AdvantageActor-Critic algorithm is used for local path optimization; an adaptive reward function is designed according to the path's safety, efficiency, and accuracy; multiple agents are introduced, each responsible for different sub-tasks, such as safety, efficiency, and accuracy; a knowledge distillation mechanism is employed to enable knowledge sharing and collaborative learning among the agents; an integrated teacher strategy is constructed based on the strategies of all agents; the integrated teacher strategy guides the learning process of each agent; combined with the learned Q-value, an improved Rapidly-exploring Random Trees algorithm is used for path sampling and optimization; iterative optimization continues until convergence or a preset number of iterations is reached to obtain the optimal puncture path. However, this method cannot be directly applied to path marking and automatic planning for individualized virtual laparoscopic surgical robots, nor can it solve the problems of automatically adjusting planning strategies for individualized virtual anatomical models of different patients to achieve scene adaptation for individualized cases.

[0004] Therefore, there is an urgent need to develop a new reinforcement learning method for path labeling and automatic planning in personalized virtual laparoscopic surgical robots to achieve smarter and safer surgical path planning. Summary of the Invention

[0005] This application aims to at least partially address one of the technical problems in related technologies. To this end, this application provides a path marking and automatic planning method for an individualized virtual laparoscopic surgical robot, which can optimize surgical path planning to address the uncertainties of virtual laparoscopic surgery by incorporating systematic considerations of surgical safety.

[0006] To achieve the above objectives, in a first aspect, a reinforcement learning method for path labeling and automatic planning applied to virtual laparoscopic surgical robots includes:

[0007] Define the state space and motion space of the virtual laparoscopic surgical robot in the abdominal cavity environment; wherein, the state space is used to describe the current configuration of the surgical environment, including the current position of the robotic arm, the position of the target organ, and the relative position of the surrounding anatomical structures; the motion space is used to define the actions that the robotic arm can perform in the abdominal cavity environment, including discrete actions and continuous actions;

[0008] Design a reward function between the state space and action space and the real-valued evaluation index to represent the immediate reward obtained by performing action a in state S;

[0009] A high-order DQN architecture is constructed to decompose the surgical path planning task into two layers: high-level strategy planning and low-level action execution. In the high-order DQN architecture, the high-level Q network is responsible for formulating the overall surgical path strategy by combining global information, while the low-level Q network focuses on the execution of local fine actions.

[0010] The path marking and automatic planning tasks of the virtual laparoscopic surgical robot are trained using a high-order DQN architecture through reinforcement learning.

[0011] Preferably, the state space further includes environmental features of the virtual laparoscopic surgical robot arm, including current posture information, velocity, and acceleration information detected by the robot arm's sensors; the state S is represented by a multidimensional vector as follows: S = [P arm ,P organ ,P r [E0]

[0012] Among them, P arm =(x a ,y a ,z a P represents the current position of the robotic arm. organ =(x o ,y o ,z o P represents the location of the target organ. r E represents the position coordinates of the surrounding anatomical structures relative to the robotic arm. o It refers to the environmental characteristics of the robotic arm.

[0013] Preferably, the discrete actions are predefined actions, including forward, backward, left, right, up, and down operations, wherein each operation moves a fixed step size along a fixed direction; the continuous actions are parameterized as an action vector a = (d x ,d y ,d z ,I), where the direction vector d=(d x ,d y ,d z I represents the step size, and the direction vector and step size are dynamically adjusted according to the surgical scenario.

[0014] Preferably, the reward function includes a goal-oriented reward and a safety reward. The goal-oriented reward is used to characterize the comparative relationship between the surgical path planning process and the distance to the goal. The safety reward is used to characterize maintaining a safe distance from important anatomical structures during the surgical path planning process.

[0015] Preferably, the reward function is expressed as: R(s) t ,a t ,s t+1 )=w1R g (s t ,a t ,s t+1 )+w2R s (s t ,a t ,s t+1 )

[0016] Among them, s t Indicates the current state; s t+1 This represents the new state after performing action a; a t Represents the action vector; w1, w2, w3 are weight coefficients, R g For goal-oriented rewards, the following definition applies: R g (s t ,a t ,s t+1 )=‖p a (s t+1 )-p t ||2-||p a (s t )-p t ||2

[0017] Where, p a (s t+1 ) is the new position of the robotic arm after performing action a, p a (s t ) represents the original position of the robotic arm before action a is executed, p t For the target position, ||p a (st+1 )-p t ‖2 represents the distance from the new state to the target state in surgical path planning, ‖p a (s t )-p t ‖2 represents the distance from the current state to the target state.

[0018] R s Security rewards are defined as follows:

[0019] Where, p a (s t+1 ) is the new position of the robotic arm after performing action a, p i (s t ) represents the position of the i-th anatomical structure, N represents the total number of anatomical structures in the current abdominal environment, and σ is used to adjust the sensitivity of distance penalty in the safety reward.

[0020] Preferably, the higher-order DQN architecture is represented as follows: Q(s,a)=Q h (s,a h ;θ h )+Q l (s,a l ;θ l )

[0021] Where Q(s,a) represents the expected reward of taking action a in state S, Q h and Q l θ represents the expected return of high-level strategic planning and low-level action execution calculation, respectively. h and θ l For the corresponding network parameters, a h Actions representing high-level strategic planning, used to define the stages of the operation; a l This indicates the operations performed at a lower level, including the specific displacement and angle adjustments of the robotic arm.

[0022] Preferably, the high-order DQN architecture is configured with an attention module, enabling the high-order DQN architecture network to adaptively focus on the state information around the danger zone. The attention module is defined as follows: A(s) = softmax(W a tanh(W s s+b s )+b a Q risk (s,a)=Q(s,a)·A(s)

[0023] Where A(s) is the attention weight, s is the input state vector, and W... a and W sThese are the attention weight matrix and the state mapping weight matrix, respectively. a and b s Here, tanh is the attention bias and state mapping bias, softmax is the activation function, and Q is used to transform the output into a probability distribution. risk (s,a) represents the expected return that takes into account the perceived surgical risk.

[0024] Preferably, the steps of reinforcement learning training for path labeling and automatic planning tasks of the virtual laparoscopic surgical robot based on a high-order DQN architecture include:

[0025] Training initialization includes setting network parameters and hyperparameters;

[0026] Surgical interaction data is collected by simulating the actual operation of the robotic arm of a laparoscopic surgery robot in a virtual surgical environment, and collecting state, action, reward and transfer data during the operation. In the process of collecting surgical interaction data, an ε-greedy strategy is adopted to balance exploration and utilization. The ε-greedy strategy is defined as selecting a random action with probability ε and selecting the currently estimated optimal action with probability 1-ε at each decision.

[0027] During the experience replay training phase, batches of data are randomly sampled from surgical interaction data for learning. The high-order DQN architecture updates the network parameters by minimizing the mean square error between the predicted Q-value and the target Q-value.

[0028] Preferably, setting network parameters and hyperparameters includes:

[0029] Set initial values for the weights and biases of the high-order DQN architecture;

[0030] Create replicas of the high-level Q-network and the low-level Q-network to compute the Q-value;

[0031] Establish an experience replay buffer to store experiences during the interaction between the surgical robot and the environment. Each experience includes a state, action, reward, next state, and a marker indicating whether it has ended.

[0032] Hyperparameters refer to parameters that are not obtained through learning during the training process, including learning rate, discount factor, and exploration rate ε.

[0033] Preferably, the steps of conducting reinforcement learning training further include:

[0034] Use the dual DQN algorithm to reduce the overestimation problem of Q-values;

[0035] Prioritize experience playback to focus on more valuable surgical experiences;

[0036] Gradient clipping is used to ensure training stability.

[0037] Secondly, this application provides a path labeling and automatic planning model for an individualized virtual laparoscopic surgical robot. This model is trained using reinforcement learning based on a high-order DQN architecture. The high-order DQN architecture decomposes the surgical path planning task into two levels: high-level policy planning and low-level action execution. The high-level policy planning is responsible for formulating an overall surgical path strategy based on global information, while the low-level action execution focuses on the execution of fine-grained local actions. The path labeling and automatic planning model also includes a state space and an action space for the virtual laparoscopic surgical robot within the abdominal cavity environment. The state space describes the current configuration of the surgical environment, including the current position of the robotic arm, the target organ position, and the relative positions of surrounding anatomical structures. The action space defines the actions that the robotic arm can perform within the abdominal cavity environment, including discrete and continuous actions. The path labeling and automatic planning model is configured with a reward function between the state space, the action space, and a real-valued evaluation index, representing the immediate reward obtained by performing action a in state S.

[0038] Thirdly, this application provides a path marking and automatic planning method based on the above-mentioned path marking and automatic planning model, the path marking and automatic planning method comprising:

[0039] Initial path generation is based on a high-order DQN architecture;

[0040] Global path planning is performed by combining path search algorithms, and path nodes are optimized using dynamic programming methods.

[0041] B-spline interpolation algorithm is used for path optimization, and the path length is minimized by gradient descent and convex optimization methods.

[0042] Collision detection algorithms are used to ensure path safety.

[0043] Preferably, the path marking and automatic planning method further includes a real-time path adjustment strategy, which is configured as follows:

[0044] State estimation and prediction are performed using Kalman filters and recursive Bayesian estimation.

[0045] The particle filtering algorithm is used to process nonlinear state changes, and the motion prediction model is combined to estimate organ deformation.

[0046] The planned path is locally adjusted using model predictive control algorithm and rolling time-domain optimization method, combined with adaptive sampling strategy and dynamic window method. The adaptive sampling strategy dynamically adjusts the sampling frequency and distribution according to the current environmental state and target requirements.

[0047] Fourthly, this application provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the above-described path marking and automatic planning method.

[0048] Fifthly, this application provides a computer-readable storage medium including a computer program that, when run on an electronic device, causes the electronic device to perform the aforementioned path marking and automatic planning method.

[0049] Sixthly, this application provides a personalized virtual surgical operation device, including the aforementioned electronic device.

[0050] Based on the above technical solution, the path labeling and automatic planning reinforcement learning method for personalized virtual laparoscopic surgical robots proposed in this application has at least one of the following beneficial effects compared to the prior art:

[0051] 1. This application adopts a high-order DQN architecture, decomposing the surgical path planning task into two layers: a high-level policy planning layer and a low-level action execution layer. The high-level network is responsible for global path planning, while the low-level network ensures the accuracy of local operations. Based on this, a complete Markov Decision Process (MDP) framework is designed, including a comprehensive state space description, flexible action space design, and multi-objective reward functions, realizing an intelligent decision-making system.

[0052] 2. To ensure surgical safety, this application introduces a surgical risk perception mechanism, enhances the perception of dangerous areas through an attention module, and designs a distance-based safety reward function and collision detection algorithm. Through a multi-layered safety assurance mechanism, safer surgical path planning is achieved for individualized virtual laparoscopic surgical robots.

[0053] 3. This application can automatically adjust the planning strategy according to the virtual anatomical model of different patients, so as to achieve scenario adaptation for individualized cases.

[0054] Other features and advantages of this application will be set forth in the following description and will be apparent in part from the description, or may be realized by practicing the application. The purpose and other advantages of this application can be realized and obtained by means of the structures particularly pointed out in the written description and the accompanying drawings. Attached Figure Description

[0055] Figure 1 is a flowchart illustrating a path marking and automatic planning reinforcement learning method applied to a virtual laparoscopic surgical robot according to this application;

[0056] Figure 2 is a block diagram of the path marking and planning system for the virtual laparoscopic surgical robot based on reinforcement learning in this application;

[0057] Figure 3 is a schematic diagram of liver cutting path planning and risk point identification based on reinforcement learning in this application;

[0058] Figure 4 is a schematic diagram of the acquisition and planning of the optimal path for liver resection in this application. Detailed Implementation

[0059] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with specific embodiments and the accompanying drawings.

[0060] The terminology used in the embodiments of this application is for the purpose of describing particular embodiments only and is not intended to limit the embodiments of this application. The singular forms "a," "the," and "the" as used in the embodiments of this application are also intended to include the plural forms unless the context clearly indicates otherwise.

[0061] To address the shortcomings of existing technologies, this application aims to provide a reinforcement learning method for path labeling and automatic planning in virtual laparoscopic surgical robots. By employing a high-order DQN architecture, the surgical path planning task is decomposed into two layers: a high-level policy planning layer and a low-level action execution layer. The high-level network is responsible for global path planning, while the low-level network ensures local operational accuracy. Based on this, a complete Markov Decision Process (MDP) framework is designed, including a comprehensive state-space description, flexible action-space design, and a multi-objective reward function, thus realizing an intelligent decision-making system.

[0062] The basic idea of this application is to combine artificial intelligence (AI) with minimally invasive surgery and realize intelligent path planning for surgical robots through deep reinforcement learning algorithms. It is mainly applied to robot-assisted surgical systems in laparoscopic surgery, providing technical support for critical path marking, precise navigation and path optimization of surgical robots.

[0063] Example 1

[0064] As shown in Figure 1, in order to develop a safer and more intelligent path labeling and automatic planning reinforcement learning method for virtual laparoscopic surgical robots, the inventors have conducted in-depth research on artificial intelligence and machine learning technologies, and proposed a path labeling and automatic planning reinforcement learning method for virtual laparoscopic surgical robots, including:

[0065] S1. Define the state space and motion space of the virtual laparoscopic surgical robot in the abdominal cavity environment; wherein, the state space is used to describe the current configuration of the surgical environment, including the current position of the robotic arm, the position of the target organ, and the relative position of the surrounding anatomical structures; the motion space is used to define the actions that the robotic arm can perform in the abdominal cavity environment, including discrete actions and continuous actions;

[0066] S2. Design a reward function between the state space and action space and the real-valued evaluation index to represent the immediate reward obtained by performing action a in state S;

[0067] S3. Construct a high-order DQN architecture to decompose the surgical path planning task into two levels: high-level strategy planning and low-level action execution. In the high-order DQN architecture, the high-level Q network is responsible for combining global information to formulate the overall surgical path strategy, while the low-level Q network focuses on the execution of local fine actions.

[0068] S4. Reinforcement learning training is performed on the path marking and automatic planning tasks of the virtual laparoscopic surgical robot based on a high-order DQN architecture.

[0069] This application employs a high-order DQN architecture, decomposing the surgical path planning task into two layers: a high-level policy planning layer and a low-level action execution layer. The high-level network is responsible for global path planning, while the low-level network ensures local operational accuracy. Based on this, a complete Markov Decision Process (MDP) framework is designed, including a comprehensive state-space description, flexible action-space design, and multi-objective reward functions, thus realizing an intelligent decision-making system.

[0070] Preferably, the state space further includes environmental features of the virtual laparoscopic surgical robot arm, including current posture information, velocity, and acceleration information detected by the robot arm's sensors; the state S is represented by a multidimensional vector as follows: S = [P arm ,P organ ,P r [E0]

[0071] Each state S describes the current configuration of the surgical environment, where P arm =(x a ,y a ,z a P represents the current position of the robotic arm. organ =(x o ,y o ,z o P represents the location of the target organ. r E represents the position coordinates of the surrounding anatomical structures relative to the robotic arm. o It refers to the environmental characteristics of the robotic arm.

[0072] This application will include the environmental characteristics E of the robotic arm. o By incorporating it into the state space description, this state representation method can comprehensively describe the current configuration of the surgical environment, providing the necessary information for reinforcement learning algorithms to perform path planning.

[0073] This state-space design specifically considers the characteristics of virtual laparoscopic surgery, defining the current position P of the robotic arm.arm Target organ location P organ Relative positional relationship P r Environmental characteristics of the robotic arm E o By integrating these elements into a unified state representation, the system can comprehensively perceive the virtual surgical environment. This allows it to automatically adjust planning strategies based on the virtual anatomical models of different patients, achieving scenario adaptation for individualized cases.

[0074] Preferably, the discrete actions are predefined actions, including forward, backward, left, right, up, and down operations, wherein each operation moves a fixed step size along a fixed direction; the continuous actions are parameterized as an action vector a = (d x ,d y ,d z ,I), where the direction vector d=(d x ,d y ,d z I represents the step size, and the direction vector and step size are dynamically adjusted according to the surgical scenario.

[0075] This section defines the possible actions that a surgical robot (robotic arm) can perform in the abdominal cavity environment. This application categorizes actions into discrete and continuous types. In the discrete action space, the robotic arm's actions are divided into a set of predefined discrete actions, suitable for simple, fast decision-making scenarios (such as robotic arm locking, fixed surgical angle, task pause, etc.). These basic actions include forward, backward, left, right, up, and down movements, each moving a fixed step length in a fixed direction, represented by discrete identifiers, such as A = {a1, a2, a3, a4, a5, a6}. In the continuous action space, the robotic arm's actions can be performed in any direction and step length in three-dimensional space, suitable for scenarios requiring high precision and flexibility. Actions are represented parametrically, including those derived from the direction vector d = (d... x ,d y ,d z The action vector a = (d) is formed by the step size I and the step size I. x ,d y ,d z The direction vector and step size are dynamically adjusted according to the surgical scenario. This flexible motion space design enables the surgical robot to perform precise path planning and operation in the complex abdominal cavity environment, ensuring real-time response in the virtual laparoscopic surgical environment.

[0076] Preferably, the reward function includes a goal-oriented reward and a safety reward. The goal-oriented reward is used to characterize the comparative relationship between the surgical path planning process and the distance to the goal. The safety reward is used to characterize maintaining a safe distance from important anatomical structures during the surgical path planning process.

[0077] Preferably, the reward function is expressed as: R(s) t ,a t ,s t+1 )=w1R g (s t ,a t ,s t+1 )+w2R s (s t ,a t ,s t+1 )

[0078] Among them, s t Indicates the current state; s t+1 This represents the new state after performing action a; a t Represents the action vector; w1, w2, w3 are weight coefficients, R g Goal-oriented rewards are a key focus of automated surgical path planning and labelling systems, defined as follows: R g (s t ,a t ,s t+1 )=‖p a (s t+1 )-p t ||2-||p a (s t )-p t ||2

[0079] Where, p a (s t+1 ) is the new position of the robotic arm after performing action a, p a (s t ) represents the original position of the robotic arm before action a is executed, p t For the target position, ||p a (s t+1 )-p t ‖2 represents the distance from the new state to the target state in surgical path planning, ‖p a (s t )-p t ‖2 represents the distance from the current state to the target state.

[0080] R s To ensure safety, the surgical robot maintains a safe distance from surrounding important anatomical structures (such as blood vessels, organs, tumors, etc.) during motion path planning, preventing collisions or damage, as defined below:

[0081] Where, p a (s t+1 ) is the new position of the robotic arm after performing action a, p i (st ) represents the position of the i-th anatomical structure, N represents the total number of anatomical structures in the current abdominal environment, and σ is used to adjust the sensitivity of distance penalty in the safety reward.

[0082] This reward mechanism provides a negative reward (penalty) when the robotic arm approaches a danger zone. This reward function is specifically designed to meet the safety requirements of virtual laparoscopic surgery. By calculating the distance between the robotic arm and key anatomical structures in the virtual environment, it dynamically assesses operational risks and ensures that the generated path avoids danger zones.

[0083] In virtual surgical path planning, this application innovatively improves upon Deep Q-Network (DQN) to better adapt to the laparoscopic surgical environment. Therefore, a high-order DQN architecture is designed, which is expressed as follows: Q(s,a)=Q h (s,a h ;θ h )+Q l (s,a l ;θ l )

[0084] Where Q(s,a) represents the expected reward of taking action a in state S, Q h and Q l θ represents the expected return of high-level strategic planning and low-level action execution calculation, respectively. h and θ l For the corresponding network parameters, a h Actions representing high-level strategic planning are used to define the stages of surgery, such as selecting the main puncture point and determining the approach path to the organ; a l This indicates the operations performed at a lower level, including the specific displacement and angle adjustments of the robotic arm.

[0085] Preferably, the high-order DQN architecture is configured with an attention module, enabling the high-order DQN architecture network to adaptively focus on the state information around the danger zone. The attention module is defined as follows: A(s) = softmax(W a tanh(W s s+b s )+b a Q risk (s,a)=Q(s,a)·A(s)

[0086] Where A(s) is the attention weight, s is the input state vector, and W... a and W s These are the attention weight matrix and the state mapping weight matrix, respectively. a and b sHere, tanh is the attention bias and state mapping bias, softmax is the activation function, and Q is used to transform the output into a probability distribution. risk (s,a) represents the expected return that takes into account the perceived surgical risk.

[0087] As mentioned above, this application also introduces a surgical risk perception mechanism. By adding an attention module to the DQN, the network can adaptively focus on the state information around dangerous areas (such as important blood vessels and nerves), thereby generating a safer surgical path.

[0088] This application employs a reinforcement learning algorithm to plan surgical paths, learning a safe and efficient surgical path planning and labeling strategy. The generated path ensures that the surgical objective is reached while avoiding dangerous areas, and maintains smooth movement. This design fully considers the specific needs of laparoscopic surgical robot operation tasks, enabling reinforcement learning algorithms to better serve clinical surgical applications. Compared to traditional surgical path planning methods, this reinforcement learning method has stronger expressive and generalization capabilities, better adapting to individualized differences in abdominal anatomy and the complexity of the surgical environment, providing reliable path planning support for precise minimally invasive surgery.

[0089] Preferably, the steps of reinforcement learning training for path labeling and automatic planning tasks of the virtual laparoscopic surgical robot based on a high-order DQN architecture include:

[0090] Training initialization includes setting network parameters and hyperparameters; specifically:

[0091] Set initial values for the weights and biases of the high-order DQN architecture;

[0092] A high-level Q-network and a copy of the low-level Q-network are created to calculate the Q-value. The copy network is a network with the same structure as the main network but different parameters. It is mainly used to stabilize the training process. The parameters of the main network are continuously updated during training, while the parameters of the target network are updated at a lower frequency to reduce the instability of Q-value calculation caused by frequent parameter updates.

[0093] Establish an experience replay buffer to store experiences during the interaction between the surgical robot and the environment. Each experience includes a state, action, reward, next state, and a marker indicating whether it has ended.

[0094] Hyperparameters refer to parameters that are not obtained through learning during the training process, including learning rate, discount factor, and exploration rate ε.

[0095] Surgical interaction data is collected by simulating the actual operation of the robotic arm of a laparoscopic surgery robot in a virtual surgical environment, and collecting data on the state, actions, rewards and transfers during the surgery. In the process of collecting surgical interaction data, an ε-greedy strategy is adopted to balance exploration and utilization, that is, to achieve a balance between exploring new paths and utilizing existing knowledge. The ε-greedy strategy is defined as selecting a random action with probability ε and selecting the currently estimated optimal action with probability 1-ε at each decision.

[0096] During the experience replay training phase, batch data is randomly sampled from surgical interaction data for learning. The high-order DQN architecture updates the network parameters by minimizing the mean square error between the predicted Q value and the target Q value to adapt to the special needs of the surgical environment and ensure that the generated path is both safe and efficient.

[0097] Preferably, the steps of conducting reinforcement learning training further include:

[0098] Use the dual DQN algorithm to reduce the overestimation problem of Q-values;

[0099] Dual DQN is an improved DQN algorithm that reduces the overestimation problem of Q-values by separating the action selection and evaluation processes. In Dual DQN, one network (the main network) selects actions, while another network (the target network) evaluates the value of the selected actions. This separation helps reduce the risk of overestimating Q-values because the network evaluating actions is not affected by potential biases in action selection.

[0100] Prioritize experience playback to focus on more valuable surgical experiences;

[0101] Prioritized experience replay is a variation of experience replay that samples experiences based on their importance (priority). In this approach, experiences that lead to larger Q-value updates (e.g., those that differ significantly from the expected Q-value) are given higher priority. This allows the network to focus more on experiences that are most helpful to the learning process, thereby improving learning efficiency.

[0102] Gradient clipping is used to ensure training stability.

[0103] Gradient clipping is used to prevent gradients from becoming too large during training, which can lead to training instability. In gradient clipping, if the norm of the gradient exceeds a preset threshold, the gradient is scaled down to that threshold. This maintains gradient stability, avoids large jumps during optimization, and thus improves training stability.

[0104] To further improve training effectiveness, this application employs several optimization techniques, including using dual DQN to reduce Q-value overestimation, prioritizing experience replay to focus on more valuable surgical experiences, and using gradient pruning to ensure training stability. The aim is to better adapt the training process to the complexity and uncertainty of individualized laparoscopic surgical environments, ultimately learning more reliable surgical path planning strategies. The entire training process involves continuous iteration and optimization until the system can generate surgical paths that meet both safety and efficiency requirements.

[0105] Example 2

[0106] This application provides a path labeling and automatic planning model for a personalized virtual laparoscopic surgical robot. The model is trained using reinforcement learning based on a high-order DQN architecture. This architecture decomposes the surgical path planning task into two levels: high-level policy planning and low-level action execution. The high-level policy planning is responsible for formulating an overall surgical path strategy based on global information, while the low-level action execution focuses on the execution of local fine-grained actions. The model also includes a state space and an action space for the virtual laparoscopic surgical robot within the abdominal environment. The state space describes the current configuration of the surgical environment, including the current position of the robotic arm, the target organ position, and the relative positions of surrounding anatomical structures. The action space defines the actions that the robotic arm can perform within the abdominal environment, including discrete and continuous actions. The model is configured with a reward function between the state space, action space, and a real-valued evaluation index, representing the immediate reward obtained by performing action a in state S.

[0107] This path marking and automatic planning model adopts a high-order DQN architecture, decomposing the surgical path planning task into two levels: a high-level policy planning layer and a low-level action execution layer. The high-level network is responsible for global path planning, while the low-level network ensures local operational accuracy. Based on this, a complete Markov Decision Process (MDP) framework was designed, including a comprehensive state-space description, flexible action-space design, and a multi-objective reward function, thus realizing an intelligent decision-making system.

[0108] Preferably, the state space further includes environmental features of the virtual laparoscopic surgical robot arm, including current posture information, velocity, and acceleration information detected by the robot arm's sensors; the state S is represented by a multidimensional vector as follows: S = [P arm ,P organ ,P r [E0]

[0109] Among them, P arm =(x a ,y a ,z aP represents the current position of the robotic arm. organ =(x o ,y o ,z o P represents the location of the target organ. r E represents the position coordinates of the surrounding anatomical structures relative to the robotic arm. o It refers to the environmental characteristics of the robotic arm.

[0110] Preferably, the discrete actions are predefined actions, including forward, backward, left, right, up, and down operations, wherein each operation moves a fixed step size along a fixed direction; the continuous actions are parameterized as an action vector a = (d x ,d y ,d z ,I), where the direction vector d=(d x ,d y ,d z I represents the step size, and the direction vector and step size are dynamically adjusted according to the surgical scenario.

[0111] Preferably, the reward function includes a goal-oriented reward and a safety reward. The goal-oriented reward is used to characterize the comparative relationship between the surgical path planning process and the distance to the goal. The safety reward is used to characterize maintaining a safe distance from important anatomical structures during the surgical path planning process.

[0112] Preferably, the reward function is expressed as: R(s) t ,a t ,s t+1 )=w1R g (s t ,a t ,s t+1 )+w2R s (s t ,a t ,s t+1 )

[0113] Among them, s t Indicates the current state; s t+1 This represents the new state after performing action a; a t Represents the action vector; w1, w2, w3 are weight coefficients, R g For goal-oriented rewards, the following definition applies: R g (s t ,a t ,s t+1 )=‖p a (s t+1 )-p t ||2-||p a (s t )-p t||2

[0114] Where, p a (s t+1 ) is the new position of the robotic arm after performing action a, p a (s t ) represents the original position of the robotic arm before action a is executed, p t For the target position, ||p a (s t+1 )-p t ‖2 represents the distance from the new state to the target state in surgical path planning, ‖p a (s t )-p t ‖2 represents the distance from the current state to the target state.

[0115] R s Security rewards are defined as follows:

[0116] Where, p a (s t+1 ) is the new position of the robotic arm after performing action a, p i (s t ) represents the position of the i-th anatomical structure, N represents the total number of anatomical structures in the current abdominal environment, and σ is used to adjust the sensitivity of distance penalty in the safety reward.

[0117] Preferably, the higher-order DQN architecture is represented as follows: Q(s,a)=Q h (s,a h ;θ h )+Q l (s,a l ;θ l )

[0118] Where Q(s,a) represents the expected reward of taking action a in state S, Q h and Q l θ represents the expected return of high-level strategic planning and low-level action execution calculation, respectively. h and θ l For the corresponding network parameters, a h Actions representing high-level strategic planning, used to define the stages of the operation; a l This indicates the operations performed at a lower level, including the specific displacement and angle adjustments of the robotic arm.

[0119] Preferably, the high-order DQN architecture is configured with an attention module, enabling the high-order DQN architecture network to adaptively focus on the state information around the danger zone. The attention module is defined as follows: A(s) = softmax(W a tanh(W s s+bs )+b a Q risk (s,a)=Q(s,a)·A(s)

[0120] Where A(s) is the attention weight, s is the input state vector, and W... a and W s These are the attention weight matrix and the state mapping weight matrix, respectively. a and b s Here, tanh is the attention bias and state mapping bias, softmax is the activation function, and Q is used to transform the output into a probability distribution. risk (s,a) represents the expected return that takes into account the perceived surgical risk.

[0121] Example 3

[0122] As shown in Figure 2, this application provides a path labeling and planning system for a virtual laparoscopic surgical robot based on reinforcement learning. This system adopts a high-order DQN architecture, decomposing the surgical path planning task into two layers: high-level policy planning and low-level action execution. The high-level network is responsible for global path planning, while the low-level network ensures local operational accuracy. Based on this, a complete Markov Decision Process (MDP) framework is designed, including a comprehensive state-space description, flexible action-space design, and a multi-objective reward function, thus realizing an intelligent decision-making system.

[0123] First, Markov decision modeling (MDP) of the virtual surgical path is performed. In this part, the state space description and action space description of the surgical robot in the abdominal cavity environment are defined, and the corresponding reward function is set.

[0124] State Space: First, we define the state space description of the surgical robot in the abdominal environment. Each state S describes the current configuration of the surgical environment, including the current position of the robotic arm, the position of the target organ, and the relative positions of surrounding anatomical structures (such as blood vessels and tumors). State S can be represented by a multidimensional vector as follows: S = [P arm ,P organ ,P r [E0]

[0125] Among them, P arm =(x a ,y a ,z a P represents the current position of the robotic arm, expressed in three-dimensional coordinates. organ =(x o ,y o ,z o ) represents the target organ location, indicating the target organ for the virtual surgical robot arm; P rThe coordinates of the surrounding organs relative to the robotic arm along the robotic arm's path are given; additionally, this application defines the environmental characteristics E of the robotic arm. o The state representation incorporates current posture, velocity, and acceleration information detected by the robotic arm's sensors. This comprehensive state representation provides a complete picture of the current configuration of the surgical environment, offering essential information for reinforcement learning algorithms to plan paths.

[0126] This state-space design specifically considers the characteristics of virtual laparoscopic surgery, defining the current position P of the robotic arm. arm Target organ location P organ Relative positional relationship P r and environmental characteristics E o By integrating them into a unified state representation, the system can fully perceive the virtual surgical environment.

[0127] Action Space: Defines the possible actions that the surgical robot (robotic arm) can perform in the abdominal cavity environment. This application categorizes action types into discrete and continuous. In the discrete action space, the robotic arm's actions are divided into a set of predefined discrete actions, suitable for simple, fast decision-making scenarios (such as robotic arm locking, fixed surgical angle, task pause, etc.). These basic actions include forward, backward, left, right, up, and down movements, each moving a fixed step length in a fixed direction, represented by discrete identifiers, such as A = {a1, a2, a3, a4, a5, a6}. In the continuous action space, the robotic arm's actions can be performed in any direction and step length in three-dimensional space, suitable for scenarios requiring high precision and flexibility. Actions are represented parametrically, including those derived from the direction vector d = (d x ,d y ,d z The action vector a = (d) is formed by the step size I and the step size I. x ,d y ,d z The direction vector and step size are dynamically adjusted according to the surgical scenario. This flexible motion space design enables the surgical robot to perform precise path planning and operation in the complex abdominal cavity environment.

[0128] Reward function: The reward function includes goal-oriented reward and safety reward. Goal-oriented reward is used to characterize the comparison between the surgical path planning process and the distance to the goal; safety reward is used to characterize maintaining a safe distance from important anatomical structures during the surgical path planning process. The specific content is the same as in Example 1.

[0129] Next, a reinforcement learning algorithm is constructed: In virtual surgical path planning, this application innovatively improves the Deep Q-Network (DQN) to better adapt to the laparoscopic surgical environment. First, a high-order DQN architecture is designed, which is the same as that in Embodiment 1, and will not be described again here.

[0130] Secondly, a surgical risk perception mechanism is introduced. By adding an attention module to the DQN, the network can adaptively focus on the state information around dangerous areas (such as important blood vessels and nerves), thereby generating a safer surgical path. The definition of the surgical risk perception mechanism is the same as that in Example 1. The surgical risk perception mechanism enables the DQN to adaptively focus on dangerous areas, dynamically adjust decision weights, and generate a safer surgical path.

[0131] This application employs a reinforcement learning algorithm to plan surgical paths, learning safe and efficient surgical path planning and labeling strategies. The generated paths ensure reaching the surgical target while avoiding dangerous areas, and maintain smooth movement. This design fully considers the specific needs of laparoscopic surgical robot operations, enabling reinforcement learning algorithms to better serve clinical surgical applications. Compared to traditional surgical path planning methods, this reinforcement learning method has stronger expressive and generalization capabilities, better adapting to individualized differences in abdominal anatomy and the complexity of the surgical environment, providing reliable path planning support for precise minimally invasive surgery.

[0132] The training process then proceeds, specifically including the environment interaction phase, experience playback phase, and network update phase, the details of which are the same as in Example 1. Finally, the results are exported for path planning and execution. The trained Deep Q-Network (i.e., a high-order DQN architecture) generates an initial path and combines it with a path search algorithm for global path planning, while dynamic programming is used to optimize path nodes. To ensure path smoothness, the system uses B-spline interpolation for path optimization and minimizes the path length through gradient descent and convex optimization methods, while a collision detection algorithm ensures path safety.

[0133] During the real-time adjustment phase, the system employs Kalman filtering and recursive Bayesian estimation for state estimation and prediction, uses particle filtering to handle nonlinear state changes, and combines motion prediction models to estimate organ deformation. To achieve dynamic path planning, the system uses model predictive control algorithms and rolling time-domain optimization methods, combined with adaptive sampling strategies and dynamic window methods for local adjustments. The entire system integrates data through multi-sensor fusion algorithms and time-series data processing methods, and employs a hierarchical decision architecture and a rule-based expert system for decision control, ultimately achieving a safe and reliable surgical path planning and real-time adjustment system.

[0134] This application implements an intelligent and adaptive surgical path planning system, which can effectively improve the safety and precision of laparoscopic surgery and provide reliable technical support for precise minimally invasive surgery. This system can not only adapt to individualized patient anatomical differences but also effectively handle unexpected situations during surgery, achieving intelligent and precise surgical path planning.

[0135] Example 4

[0136] This application provides a path marking and automatic planning method based on the path marking and automatic planning model in Embodiment 3 above, wherein the path marking and automatic planning method includes:

[0137] Generate an initial path based on the trained high-order DQN architecture;

[0138] Global path planning is performed by combining path search algorithms, and path nodes are optimized using dynamic programming methods.

[0139] B-spline interpolation algorithm is used for path optimization, and the path length is minimized by gradient descent and convex optimization methods.

[0140] Collision detection algorithms are used to ensure path safety.

[0141] Specifically, the path search algorithm can employ the A-Star algorithm, while dynamic programming is an optimization method that breaks down complex problems into simpler subproblems. In path planning, dynamic programming can be used to optimize each node on the path, making the entire path more efficient. Collision detection algorithms can employ methods such as geometric collision detection.

[0142] Figure 3 shows a schematic diagram of liver cutting path planning and risk point identification based on reinforcement learning in this application, which illustrates the risk point identification of the apex of the hepatic vein surface.

[0143] Figure 4 shows a schematic diagram of the acquisition and planning of optimal path information for liver resection in this application, illustrating the complete process for liver surgery planning and simulation.

[0144] Regarding the creation of individualized virtual simulation models: This is a 3D model created based on the patient's specific anatomical structure to simulate the surgical procedure. Different colors in the model represent different tissues and organs, such as the liver and blood vessels.

[0145] 3D Surgical Path Visualization: In this step, the surgical path is visualized in a 3D model. This helps surgeons plan and understand the surgical steps before the actual operation.

[0146] 3D surgical path information acquisition: Extract detailed information about the surgical path from the 3D model, including the path length, angle, and relationship with surrounding structures.

[0147] Automated vascular region identification: Automated image processing techniques are used to identify and label vascular regions. This is crucial for surgical planning, as it is necessary to avoid damaging important blood vessels.

[0148] Vascular region marking information: Vascular regions are marked in the model to provide a reference for surgery. This information helps surgeons avoid key blood vessels during surgery and reduce the risk of bleeding.

[0149] Automatic identification of tumor areas: Automatically identifies and marks tumor areas, which is crucial for determining the extent of surgical resection.

[0150] Proximity to tumor area marker information: Marking tumor areas in the model helps surgeons accurately locate and remove tumors during surgery.

[0151] Export a readable file containing the marking information: Export the 3D model containing all the marking information as a readable file, which can be used in surgical navigation systems or shared with other medical team members.

[0152] The entire process demonstrates how advanced image processing and 3D modeling techniques can be used to assist surgical planning, improving the accuracy and safety of the operation. By planning the path and key structures in detail before surgery, better preparation for the operation can be achieved, reducing the risks during the procedure.

[0153] Preferably, the path marking and automatic planning method further includes a real-time path adjustment strategy, which is configured as follows:

[0154] State estimation and prediction are performed using Kalman filters and recursive Bayesian estimation.

[0155] The Kalman filter is a recursive algorithm used to estimate the state of a dynamic system from a series of incomplete and noisy measurements. In surgical robots, it can be used to estimate the current state of robotic arms and organs. Recursive Bayesian estimation is a method based on Bayes' theorem used to update beliefs or estimates of unknown variables after new evidence is obtained. In the surgical environment, it can be used to update estimates of organ position and shape.

[0156] Nonlinear state changes are handled using a particle filtering algorithm, which is combined with a motion prediction model to estimate organ deformation. Particle filtering is a recursive Bayesian estimation algorithm based on the Monte Carlo method, used for state estimation of nonlinear and non-Gaussian systems. In surgical robots, it can be used to handle the nonlinear motion and deformation of organs. The motion prediction model is used to predict organ motion and deformation during surgery, thus predicting the organ's response under surgical manipulation.

[0157] The planned path is locally adjusted using Model Predictive Control (MPC) algorithm and rolling time-domain optimization method, combined with adaptive sampling strategy and dynamic window method. The adaptive sampling strategy dynamically adjusts the sampling frequency and distribution according to the current environmental state and target requirements.

[0158] MPC is a control strategy that computes control actions by solving an optimization problem at each time step. In surgical robots, MPC can be used to predict future state changes and optimize paths.

[0159] Rolling time-domain optimization is a strategy that solves an optimization problem at each time step, then "rolls" forward to the next time step and repeats this process. This allows the system to take into account the latest information at each time step.

[0160] The dynamic window method defines a "window" around the robot and performs path planning within this window. As the robot moves, this window moves and adjusts accordingly.

[0161] Through these methods, the path planning system of surgical robots can better cope with uncertainties and dynamic changes during surgery, thereby improving the safety and accuracy of the surgery.

[0162] Example 5

[0163] This application provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the program to implement the path marking and automatic planning method in Embodiment 4 above, and to achieve the following functions: A high-order DQN architecture is adopted to decompose the surgical path planning task into two levels: high-level policy planning and low-level action execution. The high-level network is responsible for global path planning, while the low-level network ensures local operation accuracy. Based on this, a complete Markov Decision Process (MDP) framework is designed, including comprehensive state-space description, flexible action-space design, and multi-objective reward functions, realizing an intelligent decision-making system. To ensure surgical safety, this application introduces a surgical risk perception mechanism, enhances the perception of dangerous areas through an attention module, and designs a distance-based safety reward function and collision detection algorithm. Through a multi-layered safety assurance mechanism, safer surgical path planning for individualized virtual laparoscopic surgical robots is achieved.

[0164] Example 6

[0165] Based on the same technical concept, this application also provides a computer-readable storage medium storing a computer program. When the computer program runs on a computer or processor, it causes the computer or processor to execute the path marking and automatic planning method in Embodiment 4 above. It achieves the following functions: Employing a high-order DQN architecture, the surgical path planning task is decomposed into two levels: high-level policy planning and low-level action execution. The high-level network is responsible for global path planning, while the low-level network ensures local operational accuracy. Based on this, a complete Markov Decision Process (MDP) framework is designed, including comprehensive state-space description, flexible action-space design, and multi-objective reward functions, realizing an intelligent decision-making system. To ensure surgical safety, this application introduces a surgical risk perception mechanism, enhancing the perception of dangerous areas through an attention module, and designing a distance-based safety reward function and collision detection algorithm. Through a multi-layered safety assurance mechanism, safer surgical path planning for individualized virtual laparoscopic surgical robots is achieved.

[0166] Example 7

[0167] This application provides a personalized virtual surgical operation device, including the electronic device described in Embodiment 5 above. This electronic device is the controller within the personalized virtual surgical operation device. The electronic device includes: a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the program to implement the path marking and automatic planning method described in Embodiment 4 above. It achieves the following functions: employing a high-order DQN architecture, the surgical path planning task is decomposed into two levels: high-level policy planning and low-level action execution. The high-level network is responsible for global path planning, while the low-level network ensures local operational accuracy. Based on this, a complete Markov Decision Process (MDP) framework is designed, including comprehensive state-space description, flexible action-space design, and multi-objective reward functions, realizing an intelligent decision-making system. To ensure surgical safety, this application introduces a surgical risk perception mechanism, enhancing the perception of dangerous areas through an attention module, and designing a distance-based safety reward function and collision detection algorithm. Through a multi-layered safety assurance mechanism, safer surgical path planning for personalized virtual laparoscopic surgical robots is achieved. This personalized virtual surgical device allows for path marking and automatic planning of a personalized virtual laparoscopic surgical robot before actual surgery. It can adapt to the differences in anatomical structure of individual patients, achieving intelligent and precise surgical path planning.

[0168] The foregoing has described specific embodiments of the present application. Furthermore, the processes depicted in the accompanying drawings do not necessarily require a specific or sequential order to achieve the desired results. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0169] In the description of the embodiments of this application, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of the embodiments of this application. In the embodiments of this application, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples. Furthermore, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in the embodiments of this application, as well as the features of different embodiments or examples.

[0170] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features, excluding any ordering. Thus, features defined with "first" and "second" may explicitly or implicitly include at least one of those features and are used to distinguish them from one another. In the description of embodiments of this application, "multiple" means at least two, such as two, three, etc., unless otherwise explicitly specified.

[0171] Any process or method description in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing custom logic functions or processes, and the scope of preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order according to the functions involved, as should be understood by those skilled in the art to which embodiments of this application pertain.

[0172] The above description is only a preferred embodiment of the present application and is not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present application should be included within the scope of protection of the present application.

Claims

1. A path labeling and automatic planning reinforcement learning method for an individualized virtual laparoscopic surgical robot, characterized in that, include: Define the state space and motion space of the virtual laparoscopic surgical robot in the abdominal cavity environment; wherein, the state space is used to describe the current configuration of the surgical environment, including the current position of the robotic arm, the position of the target organ, and the relative position of the surrounding anatomical structures; the motion space is used to define the actions that the robotic arm can perform in the abdominal cavity environment, including discrete actions and continuous actions; Design a reward function between the state space and action space and the real-valued evaluation index to represent the immediate reward obtained by performing action a in state S; A high-order DQN architecture is constructed to decompose the surgical path planning task into two layers: high-level strategy planning and low-level action execution. In the high-order DQN architecture, the high-level Q network is responsible for formulating the overall surgical path strategy by combining global information, while the low-level Q network focuses on the execution of local fine actions. The path marking and automatic planning tasks of the virtual laparoscopic surgical robot are trained using a high-order DQN architecture through reinforcement learning.

2. The path marking and automatic planning reinforcement learning method according to claim 1, characterized in that, The state space also includes environmental features of the virtual laparoscopic surgical robot arm, including current posture information, velocity, and acceleration information detected by the robot arm's sensors; the state S is represented by a multidimensional vector as follows: S = [P arm ,P organ ,P r [,E0]; Among them, P arm =(x a ,y a ,z a P represents the current position of the robotic arm. organ =(x o ,y o ,z o P represents the location of the target organ. r E represents the position coordinates of the surrounding anatomical structures relative to the robotic arm. o It refers to the environmental characteristics of the robotic arm.

3. The path marking and automatic planning reinforcement learning method according to claim 1, characterized in that, The discrete actions are predefined actions, including forward, backward, left, right, up, and down operations, where each operation moves a fixed step size along a fixed direction; the continuous actions are parameterized as an action vector a = (d x ,d y ,d z ,I), where the direction vector d=(d x ,d y ,d z I represents the step size, and the direction vector and step size are dynamically adjusted according to the surgical scenario.

4. The path marking and automatic planning reinforcement learning method according to claim 1, characterized in that, The reward function includes a goal-oriented reward and a safety reward. The goal-oriented reward is used to characterize the comparison between the surgical path planning process and the distance to the goal. The safety reward is used to characterize maintaining a safe distance from important anatomical structures during surgical path planning.

5. The path marking and automatic planning reinforcement learning method according to claim 4, characterized in that, The reward function is expressed as: R(s) t ,a t ,s t+1 )=w1R g (s t ,a t ,s t+1 )+w2R s (s t ,a t ,s t+1 ); Among them, s t Indicates the current state; s t+1 This represents the new state after performing action a; a t Represents the action vector; w1, w2, w3 are weight coefficients, R g Goal-oriented rewards are defined as follows: R g (s t ,a t ,s t+1 )=||p a (s t+1 )-p t ||2-||p a (s t )-p t ||2 Where, p a (s t+1 ) is the new position of the robotic arm after performing action a, p a (s t ) represents the original position of the robotic arm before action a is executed, p t For the target position, ||p a (s t+1 )-p t ||2 represents the distance from the new state to the target state in the surgical path planning,||p a (s t )-p t ||2 represents the distance from the current state to the target state; R s Security rewards are defined as follows: Where, p a (s t+1 ) is the new position of the robotic arm after performing action a, p i (s t ) represents the position of the i-th anatomical structure, N represents the total number of anatomical structures in the current abdominal environment, and σ is used to adjust the sensitivity of distance penalty in the safety reward.

6. The path marking and automatic planning reinforcement learning method according to claim 1, characterized in that, The higher-order DQN architecture is represented as follows: Q(s,a)=Q h (s,a h ;θ h )+Q l (s,a l ;θ l ); Where Q(s,a) represents the expected reward of taking action a in state S, Q h and Q l θ represents the expected return of high-level strategic planning and low-level action execution calculation, respectively. h and θ l For the corresponding network parameters, a h Actions representing high-level strategic planning, used to define the stages of the operation; a l This indicates the operations performed at a lower level, including the specific displacement and angle adjustments of the robotic arm.

7. The path marking and automatic planning reinforcement learning method according to claim 6, characterized in that, The high-order DQN architecture is configured with an attention module, enabling the high-order DQN network to adaptively focus on the state information around the danger zone. The attention module is defined as follows: A(s) = softmax(W a tanh(W s s+b s )+b a ); Q risk (s,a)=Q(s,a)·A(s); Where A(s) is the attention weight, s is the input state vector, and W... a and W s These are the attention weight matrix and the state mapping weight matrix, respectively. a and b s Here, tanh is the attention bias and state mapping bias, softmax is the activation function, and Q is used to transform the output into a probability distribution. risk (s,a) represents the expected return that takes into account the perceived surgical risk.

8. The path marking and automatic planning reinforcement learning method according to any one of claims 1-7, characterized in that, The steps for reinforcement learning training of path labeling and automatic planning tasks for virtual laparoscopic surgical robots based on a high-order DQN architecture include: Training initialization includes setting network parameters and hyperparameters; Surgical interaction data is collected by simulating the actual operation of the robotic arm of a laparoscopic surgery robot in a virtual surgical environment, and collecting state, action, reward and transfer data during the operation. In the process of collecting surgical interaction data, an ε-greedy strategy is adopted to balance exploration and utilization. The ε-greedy strategy is defined as selecting a random action with probability ε and selecting the currently estimated optimal action with probability 1-ε at each decision. During the experience replay training phase, batches of data are randomly sampled from surgical interaction data for learning. The high-order DQN architecture updates the network parameters by minimizing the mean square error between the predicted Q-value and the target Q-value.

9. The path marking and automatic planning reinforcement learning method according to claim 8, characterized in that, The steps for setting network parameters and hyperparameters include: Set initial values for the weights and biases of the higher-order DQN architecture; Create replicas of the high-level Q-network and the low-level Q-network to compute the Q-value; Establish an experience replay buffer to store experiences during the interaction between the surgical robot and the environment. Each experience includes a state, action, reward, next state, and a marker indicating whether it has ended. Hyperparameters refer to parameters that are not obtained through learning during the training process, including learning rate, discount factor, and exploration rate ε.

10. The path marking and automatic planning reinforcement learning method according to claim 8, characterized in that, The steps involved in reinforcement learning training also include: Use the dual DQN algorithm to reduce the overestimation problem of Q-values; Prioritize experience playback to focus on more valuable surgical experiences; Gradient clipping is used to ensure training stability.

11. A path marking and automatic planning model for an individualized virtual laparoscopic surgical robot, characterized in that, The path labeling and automatic planning model is trained using reinforcement learning based on a high-order DQN architecture. This architecture decomposes the surgical path planning task into two levels: high-level policy planning and low-level action execution. The high-level policy planning is responsible for formulating the overall surgical path strategy based on global information, while the low-level action execution focuses on the execution of fine-grained local actions. The path labeling and automatic planning model also includes a state space and action space for the virtual laparoscopic surgical robot within the abdominal environment. The state space describes the current configuration of the surgical environment, including the current position of the robotic arm, the target organ position, and the relative positions of surrounding anatomical structures. The action space defines the actions that the robotic arm can perform within the abdominal environment, including discrete and continuous actions. The path labeling and automatic planning model is configured with a reward function between the state space, action space, and a real-valued evaluation index, representing the immediate reward obtained by performing action a in state S.

12. A path marking and automatic planning method based on the path marking and automatic planning model of claim 11, characterized in that, Includes the following steps: Initial path generation is based on a high-order DQN architecture; Global path planning is performed by combining path search algorithms, and path nodes are optimized using dynamic programming methods. B-spline interpolation algorithm is used for path optimization, and the path length is minimized by gradient descent and convex optimization methods. Collision detection algorithms are used to ensure path safety.

13. The path marking and automatic planning method according to claim 11, characterized in that, The path marking and automatic planning method also includes a real-time path adjustment strategy, which is configured as follows: State estimation and prediction are performed using Kalman filters and recursive Bayesian estimation. The particle filtering algorithm is used to process nonlinear state changes, and the motion prediction model is combined to estimate organ deformation. The planned path is locally adjusted using model predictive control algorithm and rolling time-domain optimization method, combined with adaptive sampling strategy and dynamic window method; wherein, the adaptive sampling strategy dynamically adjusts the sampling frequency and distribution according to the current environmental state and target requirements.

14. An electronic device, characterized in that, It includes a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer program to implement the path marking and automatic planning method as described in claim 12.

15. A computer-readable storage medium storing a computer program, characterized in that, When the computer program runs on a computer or processor, it causes the computer or processor to perform the path marking and automatic planning method as described in claim 12.

16. A personalized virtual laparoscopic surgical operating device, characterized in that, Including the electronic device as described in claim 14.