Underground pipeline generation method, device and equipment based on virtual scene visual navigation

By using virtual scene visual navigation and DDPG learning to generate smooth 3D pipeline paths, the problem of large computational load and non-smooth paths in complex spaces by traditional algorithms is solved. This achieves efficient and automatic multi-objective optimization and environmental adaptation, reducing construction difficulty and resource waste.

CN122199880APending Publication Date: 2026-06-12SHANGHAI YINGYI URBAN PLANNINGDESIGN CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI YINGYI URBAN PLANNINGDESIGN CO LTD
Filing Date
2026-05-18
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Traditional underground pipeline route planning algorithms suffer from high computational costs, uneven paths, difficulty in multi-objective optimization, and lack of generalization ability in complex 3D spaces, leading to engineering construction difficulties and resource waste.

Method used

A virtual scene-based visual navigation method is adopted, which generates smooth 3D pipeline paths through policy networks and value networks, uses the DDPG learning system for autonomous exploration and iterative optimization, and combines engineering specifications for multi-objective automatic balancing optimization.

🎯Benefits of technology

It generates pipeline paths with high continuity and smooth curvature, reducing construction difficulty and material waste. It has strong generalization ability, can be efficiently applied in new environments, automatically balances multiple design objectives, and reduces the workload of manual verification.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199880A_ABST
    Figure CN122199880A_ABST
Patent Text Reader

Abstract

The application provides a kind of underground pipeline generation method, device and equipment based on virtual scene visual navigation, belongs to engineering pipeline design field, specifically includes obtaining original pipeline data, constructs the three-dimensional virtual scene of underground pipeline;Pipeline path in three-dimensional virtual scene is continuously sampled according to predetermined step length, and constitutes offline training data set;State-action data pair is subjected to offline supervised learning, and initial strategy model is generated;Based on DDPG learning system, three-dimensional virtual scene is autonomously explored, and initial strategy model is iteratively generated pipeline navigation model;Based on target starting point and target terminal point, real-time perception information within the current predetermined range collected by virtual camera is input into pipeline navigation model, and coherent target three-dimensional underground pipeline path is generated.Through the processing scheme of the present application, smooth angle adjustment can be made with slight changes in local terrain and environmental features, thereby generating a three-dimensional pipeline path with high continuity and smooth curvature.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of engineering pipeline design, and in particular to a method, apparatus and computer equipment for generating underground pipelines based on virtual scene visual navigation. Background Technology

[0002] Underground pipeline servers are a crucial component of urban infrastructure. With accelerating urbanization, underground space is becoming increasingly congested, making route planning for new underground pipelines exceptionally complex. Traditional underground pipeline route planning typically relies on manual design or conventional heuristic search algorithms (such as the A* algorithm).

[0003] However, traditional rule-based algorithms have certain limitations in practical applications, including: 1. In complex 3D continuous space, traditional grid search algorithms are prone to the "curse of dimensionality," resulting in an explosive increase in computational cost; 2. Traditional algorithms typically generate paths that are limited by the grid shape, resulting in pipeline paths that are often rigid, “zigzag” or “stepped”. In actual engineering construction, such discrete paths often lead to problems such as concentrated pipeline stress, increased fluid resistance, and difficulty in bending pipes, making it difficult to meet the actual engineering requirements for pipeline smoothness. 3. Lack of generalization ability: Traditional algorithms are mostly for searching a single map, and when the underground environment changes, a global recalculation is required; 4. Difficulty in performing multi-objective optimization: Underground pipeline design usually needs to simultaneously meet multiple conflicting objectives such as shorter path, fewer turns, burial depth limit, obstacle avoidance and retreat. Traditional rule-based algorithms have difficulty finding the best balance among these objectives.

[0004] Therefore, there is an urgent need for a method that can understand design rules, has strong generalization ability, and can output smooth and reasonable pipeline paths. Summary of the Invention

[0005] Therefore, in order to overcome the shortcomings of the prior art, the present invention provides a method, apparatus and computer equipment for generating underground pipelines based on virtual scene visual navigation, which has generalization ability and can smoothly adjust the angle according to the slight changes in local terrain and environmental features, thereby generating a three-dimensional pipeline path with high continuity and smooth curvature.

[0006] To achieve the above objectives, this invention provides a method for generating underground pipelines based on virtual scene visual navigation, comprising: acquiring original pipeline data, constructing a three-dimensional virtual scene of the underground pipeline and corresponding ground surface information; continuously sampling the pipeline path in the three-dimensional virtual scene using a virtual camera at a predetermined step size, collecting local state data and corresponding action data at each step to form state-action data pairs required for supervised learning, and forming an offline training dataset; performing offline supervised learning on the state-action data pairs using a policy network and a value network to generate an initial policy model with basic obstacle avoidance and target approach capabilities; autonomously exploring the three-dimensional virtual scene using the policy network and value network based on the DDPG learning system, and iteratively generating a pipeline navigation model based on at least the ground surface information of the initial policy model; and inputting real-time perception information within the current predetermined range collected by the virtual camera into the pipeline navigation model based on the target start point and target end point to generate a coherent target three-dimensional underground pipeline path that avoids all potential obstacles.

[0007] In one embodiment, the process of calling a virtual camera to continuously sample the pipeline path in the 3D virtual scene at predetermined step lengths, collecting local state data and corresponding action data at each step to form state-action data pairs required for supervised learning, and forming an offline training dataset, includes: calling a virtual camera to sample the pipeline path in the 3D virtual scene at predetermined step lengths, calculating and generating a local image of the current viewpoint corresponding to each step length sampling in real time; obtaining local spatial coordinates corresponding to the local image based on the current coordinate information of the virtual camera; obtaining the pipeline direction, depth from the ground, and relative position of the target point of the current pipe segment corresponding to the current viewpoint from the 3D virtual scene based on the local spatial coordinates, forming vector features, and storing the vector features and the local image as state data; obtaining the relative azimuth and tilt angle of the next node connected to the current pipe segment, forming action data; and storing the state data and the action data as state-action data pairs corresponding to the local image to form an offline training dataset.

[0008] In one embodiment, the offline supervised learning of the state-action data pairs using a policy network and a value network to generate an initial policy model with basic obstacle avoidance and target approach capabilities includes: training the policy network with state data from the state-action data pairs as input and action data from the state-action data pairs as output; using the input and output of the policy network as input features of the value network and obtaining the corresponding cumulative expected reward; calculating the error between the predicted action distribution output by the policy network and the corresponding real action data in the dataset; minimizing the error using a backpropagation algorithm, iteratively updating the network parameters of the policy network, achieving behavior cloning to complete the initialization of the basic policy, and generating an initial policy model with basic obstacle avoidance and target approach capabilities.

[0009] In one embodiment, the policy network is trained by taking state data from the state-action data pair as input and action data from the state-action data pair as output, including: extracting local images from the state data pairs in the offline training dataset and converting the local images into visual feature vectors; converting the vector features in the state data pairs into navigation feature vectors; combining the visual feature vectors and the navigation feature vectors into a fused feature vector; and training the policy network by taking the fused feature vector as input and action data from the state-action data pair as output.

[0010] In one embodiment, the step of employing the policy network to autonomously explore the 3D virtual scene based on the DDPG learning system, and iteratively generating a pipeline navigation model from the initial policy model based at least on the ground surface information, the policy network, and the value network, includes: the policy network continuously autonomously exploring the 3D virtual scene and outputting action a based on the current state s at the current time step; the 3D virtual scene executing action a, updating the state to s', and calculating an immediate reward r based on a composite reward function containing multi-objective constraints including the ground surface information, generating a transformation tuple (s,a,r,s',done); then forming an experience replay pool for iterative training from the transformation tuples (s,a,r,s',done) at different time steps, where done is a Boolean value corresponding to the termination flag; randomly sampling a small batch of transformation tuples from the experience replay pool, alternately training and updating the value network and the policy network, thereby continuously optimizing the navigation strategy in the initial policy model; stopping the iteration when the number of training rounds reaches a preset upper limit, or when the average reward value of consecutive predetermined rounds converges, thus realizing the iteration of the initial policy model into a pipeline navigation model.

[0011] In one embodiment, the step of randomly sampling small batches of the transformation tuples from the experience replay pool, and alternately training and updating the value network and the policy network to continuously optimize the navigation strategy in the initial policy model includes: initializing corresponding policy target networks and value target networks for the policy network and the value network, respectively; calculating a temporal difference objective based on the immediate reward r, the termination flag done, and the evaluation value of the value target network; updating the network parameters of the value network by minimizing the mean square error between the temporal difference objective and the current value network prediction value; updating the network parameters of the policy network using a deterministic policy gradient mechanism to maximize the evaluation value of the current value network for the policy network's output action; and simultaneously updating the parameters of the value target network and the policy target network using a soft update method to continuously optimize the navigation strategy in the initial policy model.

[0012] In one embodiment, the method further includes: exporting the three-dimensional continuous coordinate points of the target three-dimensional underground pipeline path into a standardized data table, and simultaneously rendering and generating an HTML scene file for intuitive review.

[0013] A device for generating underground pipelines based on virtual scene visual navigation includes: a data acquisition module for acquiring raw pipeline data and constructing a three-dimensional virtual scene of the underground pipeline and corresponding ground surface information; an offline dataset construction module for continuously sampling the pipeline path in the three-dimensional virtual scene using a virtual camera at a predetermined step size, collecting local state data and corresponding action data at each step to form state-action data pairs required for supervised learning, forming an offline training dataset; an offline learning module for performing offline supervised learning on the state-action data pairs using a policy network and a value network to generate an initial policy model with basic obstacle avoidance and target approach capabilities; an online learning module for autonomously exploring the three-dimensional virtual scene using a policy network and a value network based on the DDPG learning system, and iteratively generating a pipeline navigation model based at least on ground surface information; and a path generation module for inputting real-time perception information within a predetermined range collected by the virtual camera into the pipeline navigation model based on the target start point and target end point to generate a coherent target three-dimensional underground pipeline path that avoids all potential obstacles.

[0014] A computer device includes a memory and a processor, the memory storing a computer program, characterized in that the processor executes the computer program to implement the steps of the above-described method.

[0015] A computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a processor, implements the steps of the above-described method.

[0016] Compared with the prior art, the advantages of the present invention are as follows: (1) The pipeline generated through the policy network and value network is smoother and more natural. Its output in three-dimensional space is a dual-channel continuous action value representing the horizontal and vertical deflection angles, which closely matches the physical reality of engineering laying and breaks through the limitations of discrete action space. This decision-making mechanism based on continuous action distribution enables the model to make smooth angle adjustments with small changes in local terrain and environmental features, thereby generating a three-dimensional pipeline path with high continuity and smooth curvature. This not only fundamentally eliminates the visual jaggedness, but also greatly reduces the difficulty of engineering construction and the pipe breakage rate, fully meeting the rigid engineering requirements of fluid dynamics performance and pipeline physical laying.

[0017] (2) It possesses excellent spatial environment generalization ability and can be applied across scenarios without retraining. Moreover, during the reinforcement learning training process, the agent deeply refines the general dynamic decision-making logic based on local visual perception to identify obstacles and autonomously make avoidance actions. By transforming the complex global planning problem into a real-time response problem based on the current local state, the model is endowed with strong environmental adaptability. Therefore, when facing a completely unfamiliar underground environment (such as underground spaces in different cities, complex strata containing unknown old pipelines, etc.), as long as the geometric features and spatial distribution patterns of obstacles in the new environment have a certain similarity to the training set, the model can still be directly deployed and efficiently and robustly complete the pipeline generation work, greatly saving the time and computing power cost of re-collecting data and re-training the model for new projects.

[0018] (3) Deeply integrating engineering field standards, the system achieves complex multi-objective automatic balance optimization. Through a value network and a carefully designed composite reward function in the reinforcement learning training mechanism, abstract design conflicts are scientifically quantified. This composite reward function not only covers basic tendency objectives and collision penalties, but also deeply integrates underground pipeline planning field standard knowledge extracted through a knowledge base (such as retrieval-enhanced generation servers) (such as minimum turning radius requirements for specific pipelines, soil cover depth restrictions, and economic constraints of prioritizing the use of existing pipe corridors). In the online strategy optimization stage, the agent explores through massive simulation trial and error, and can automatically weigh the weights of various penalties and rewards in a multi-dimensional constraint space that includes obstacle avoidance safety, the economy of shorter paths, the convenience of construction with fewer turns, and burial depth requirements such as close to the ground. Ultimately, the model can automatically find a pipeline design scheme that strictly conforms to industry design standards and has relatively better comprehensive benefits among multiple conflicting design objectives, thereby significantly reducing the workload of repeated manual verification and modification. Attached Figure Description

[0019] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 This is a flowchart illustrating the method for generating underground pipelines based on virtual scene visual navigation in an embodiment of the present invention. Figure 2 This is a structural block diagram of an underground pipeline generation device based on virtual scene visual navigation in an embodiment of the present invention; Figure 3 This is a schematic diagram of a computer device in an embodiment of the present invention. Detailed Implementation

[0021] The embodiments of this application will now be described in detail with reference to the accompanying drawings.

[0022] The following specific examples illustrate the implementation of this application. Those skilled in the art can easily understand other advantages and effects of this application from the content disclosed in this specification. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. This application can also be implemented or applied through other different specific embodiments, and the details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of this application. It should be noted that, in the absence of conflict, the following embodiments and features in the embodiments can be combined with each other. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0023] It should be noted that the following description covers various aspects of embodiments within the scope of protection of this invention. It will be apparent that the aspects described herein can be embodied in a wide variety of forms, and any particular structure and / or function described herein is merely illustrative. Based on this application, those skilled in the art will understand that one aspect described herein can be implemented independently of any other aspect, and two or more of these aspects can be combined in various ways. For example, any number and aspects set forth herein can be used to implement the device and / or practice the method. Additionally, this device and / or method can be implemented using other structures and / or functionalities besides one or more of the aspects set forth herein.

[0024] It should also be noted that the illustrations provided in the following embodiments are only schematic representations of the basic concept of this application. The drawings only show the components related to this application and are not drawn according to the actual number, shape and size of the components in the actual implementation. In the actual implementation, the form, quantity and proportion of each component can be arbitrarily changed, and the layout of the components may also be more complex.

[0025] Furthermore, specific details are provided in the following description to facilitate a thorough understanding of the examples. However, those skilled in the art will understand that the described aspects can be practiced without these specific details.

[0026] This application provides a method for generating underground pipelines based on virtual scene visual navigation, applied to a server or terminal. The terminal can be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable smart devices. The server can be a standalone server or a server cluster consisting of multiple servers. Figure 1 As shown, the method for generating underground pipelines based on virtual scene visual navigation includes the following steps: Step 101: Obtain the original pipeline data and construct a three-dimensional virtual scene of the underground pipeline and the corresponding ground surface information.

[0027] The server retrieves raw pipeline data and constructs a 3D virtual scene of the underground pipelines and corresponding ground surface information. The server reads raw pipeline data from external data sources. These external data sources include, but are not limited to: underground pipeline survey databases, pipeline ledgers in CSV or Excel format, DWG / DXF vector graphics from AutoCAD, and LiDAR point cloud data. The raw pipeline data must include at least: the 3D coordinates (X, Y, Z) of the pipeline centerline, the burial depth of the pipeline segment's start / end point, pipeline material, pipe diameter or cross-sectional dimensions, and pipeline type identification (e.g., water supply, drainage, gas, electricity, communication, etc.). The server executes a data reading script to read a table file (e.g., in Excel spreadsheet format) containing the raw pipeline data. Based on the detailed engineering data recorded in the table, such as pipeline dimensions, number of holes, and burial depth data, the server reconstructs the 3D virtual scene of the underground pipelines.

[0028] For example, the server can read records from an Excel spreadsheet containing the coordinates of pipeline segment nodes (X1, Y1, Z1; X2, Y2, Z2), pipe diameter DN300, material ductile iron, and type "water supply". For segments lacking burial depth information, the server can perform interpolation calculations based on the difference between the bottom elevation of the manholes at both ends and the ground elevation.

[0029] Based on the raw pipeline data read, the server constructs a 3D scene of the underground pipeline using a 3D graphics engine (such as OpenGL, DirectX, or Unity3D). Specifically, the construction method is as follows: using the central axis of each pipeline segment as a reference, a 3D mesh model of the pipeline is generated according to the obtained pipe diameter (or rectangular cross-sectional dimensions); for circular cross-section pipes, a cylindrical model is formed by rotating the polygonal cross-section around the central axis; for rectangular ditches, a box-shaped model is directly formed by extruding the cross-section. The 3D virtual underground pipeline scene contains at least one complete pipeline path, which consists of several pipe segments and connecting nodes (such as elbows, tees, and manholes). Each pipe segment has a clearly defined direction, length, diameter, burial depth, and material properties.

[0030] All pipeline models utilize a real-world coordinate server (such as CGCS2000 National Geodetic Coordinate System) to achieve spatial positioning with meter-level accuracy. Simultaneously, the server binds attribute data to each pipeline model, including but not limited to: unique pipeline ID, type, material, laying date, and ownership unit.

[0031] In a 3D virtual scene, the server can also assign visual features to different types of pipelines based on pipeline type identifiers to differentiate them: 1. Use different color mappings to distinguish different types of pipelines such as water supply, drainage, electricity, and communication. For example, water supply pipelines are displayed as "blue", drainage pipelines as "brown", gas pipelines as "yellow", electricity pipelines as "red", and communication pipelines as "green".

[0032] 2. Scale the diameter or width of the pipeline in the 3D model to a 1:1 scale based on the pipe diameter or cross-sectional dimensions. For pipelines with a diameter less than 200mm, a minimum display width (e.g., 2 pixels or 5cm model size) can be set during visual rendering to avoid them being difficult to identify due to excessive thinness. The geometric dimensions of the pipeline are strictly modeled in 3D according to the recorded values ​​in the data table. For data lacking direct dimensions, the server can automatically estimate the cross-sectional dimensions based on preset rules (e.g., based on the recorded number of cable holes).

[0033] 3. The burial depth and direction of the pipeline are accurately presented according to the absolute or relative elevation system recorded in the table, thereby realizing spatial location restoration. Specifically, the depth information can be expressed in one or more of the following ways based on the absolute elevation value (Z value) of the pipeline centerline or outer ridge: (a) Display a depth ruler or a dynamic floating label in a 3D scene; (b) Color temperature mapping: warm colors (red-orange) are used for shallow buried pipelines (burial depth < 1.5m), and cool colors (dark blue) are used for deep buried pipelines (burial depth > 3m). (c) Supports depth profile views, which allow you to view the vertical distance of pipelines relative to the ground surface after cutting.

[0034] 4. Automatically add marker entities such as manholes at pipeline intersections or key nodes to mark auxiliary facilities.

[0035] 5. Generate ground surface information corresponding to underground pipelines within the same 3D scene. The server fits and generates ground surface information containing elevation undulations to serve as the basis for subsequent calculations of "depth from ground level". Sources of ground surface information include: the server can generate a terrain mesh by reading digital elevation model (DEM) or digital surface model (DSM) data from the original pipeline data; or, the server can perform Delaunay triangulation on ground points (such as manhole cover elevation points, scattered ground points) attached to the pipeline survey data to generate an irregular triangular network (TIN) as the ground model.

[0036] Ground surface information is presented as a semi-transparent mesh (transparency set to 30%-50%) or a solid model with realistic textures, and reference elements such as road boundaries, building base outlines, and green belts are overlaid on the surface model. The surface model and the underground pipeline model strictly share the same vertical reference plane (such as the 1985 National Elevation Datum) to ensure the accuracy of pipeline burial depth calculations.

[0037] In one embodiment, the server will also perform collision detection on the three-dimensional virtual scene of underground pipelines and the corresponding ground surface information, thereby further ensuring the accuracy of the input basic data and intuitively displaying the current underground pipeline environment.

[0038] The server can perform at least one of the following collision detection methods: pipeline-to-pipeline collision detection, pipeline-to-surface collision detection, and pipeline-to-underground structure collision detection.

[0039] Pipeline-to-pipeline collision detection traverses the 3D mesh model of all pipelines, employing a bounding box hierarchy (BVH) acceleration algorithm to detect spatial intersections or gaps less than a safety threshold between different pipelines. The safety threshold is set according to pipeline type: for example, the minimum clearance between gas and electrical pipes is ≥0.5m, and the minimum clearance between water and drainage pipes is ≥0.3m. When insufficient clearance or direct intersection is detected, the server highlights the collision area in the 3D scene and generates a collision report containing the collision point coordinates, the colliding pipeline ID, and the clearance value.

[0040] Pipeline-to-surface collision detection calculates the difference (i.e., burial depth) between the top elevation of each pipeline segment and the elevation of the corresponding surface model. If this difference is less than the preset minimum cover depth (for example, water supply pipes under a roadway require ≥1.0m), it is judged as "shallow burial collision" or "exposed risk", an early warning is issued, and the pipeline segment is displayed flashing in the 3D scene.

[0041] Pipeline-underground structure collision detection refers to converting the structure models, such as pile foundations, diaphragm walls, and basements, into triangular meshes when they are imported into the scene, and then performing precise triangular intersection tests with the pipeline models to detect whether there is any intrusion or penetration.

[0042] After the detection is completed, the server outputs a collision detection report (HTML or PDF format), locating each collision point using 3D scene markers. Users can click on the markers in the attribute panel interface that communicates with the server to automatically focus on the collision area. The attribute panel also displays collision details and modification suggestions (e.g., "It is recommended to move the gas pipeline 0.6m south"). The attribute panel interface receives the user's input adjustments and adjusts the 3D virtual scene and corresponding ground surface information based on these adjustments, then saves the adjusted content as a new version.

[0043] Step 102: In the pipeline path of the 3D virtual scene, the virtual camera is called to continuously sample according to the predetermined step size, and the local state data and corresponding action data of each step are collected to form the state-action data pairs required for supervised learning, forming an offline training dataset.

[0044] The server calls the virtual camera to continuously sample the pipeline path in the 3D virtual scene according to a predetermined step size, and collects the local state data and corresponding action data of each step to form the state-action data pairs required for supervised learning, which constitute the offline training dataset.

[0045] To generate high-quality training data for the Behavior Cloning (BC) phase, the server samples existing, manually validated successful pipeline paths in the 3D virtual scene, extracting continuous segments as routes. For example, demonstrating a "information cable" type pipeline in the 3D virtual scene, with a total length of approximately 4000 meters, the server can slice the pipeline path at predetermined steps (1 meter increments). In one embodiment, to expand the data volume and enhance the model's orientation-aware stability, the server can traverse the path both forward and backward, obtaining approximately 8000 offline training slice samples.

[0046] Within the 3D scene object, the server uses a virtual camera to calculate and generate a local image of the current viewpoint in real time. Simultaneously, it extracts the current node's state (X) and action (Y) to form the data pairs required for supervised learning. Specifically, within the 3D scene object, the server instantiates a virtual camera with a fixed resolution (e.g., 256×256 pixels) and a specific horizontal field of view (e.g., 120 degrees) at the current sampling node. The virtual camera has corresponding current coordinate information. Based on the absolute spatial orientation of the current node, the server performs perspective rendering on the filtered local obstacles, simultaneously generating two types of visual perception images, and extracting the current node's state (X) and action (Y) to form the data pairs required for supervised learning.

[0047] (1) The state data X contains visually perceived images (local images) and vector features.

[0048] ① Visual Perception Image: A visual depth map and a real-world image (Type Image colored according to pipeline type) within a certain range (e.g., 20 meters) of the current node. The virtual camera can calculate and generate the depth map and real-world image in real time, and their resolution is uniformly standardized to 256×256 pixels in the demonstration server. The visual depth map records the physical depth of obstacles within the field of view relative to the camera. To optimize the gradient feature extraction of the subsequent neural network, the server performs a reverse mapping process on the original depth map (e.g., using the formula: Reverse Depth = Maximum Perception Distance 20.0 - Actual Depth), making the pixel values ​​of obstacles closer to the camera larger (up to 20.0), while the values ​​of extremely distant background areas without obstacles are zero. The larger the actual depth, the smaller the displayed grayscale value, i.e., close to pure black; the smaller the actual depth, the larger the displayed grayscale value, gradually approaching pure white. ② Vector Features: The server extracts and calculates macro navigation features to generate vector features, which include: relative azimuth (Target Rel Azimuth) and relative pitch (Target Rel Pitch) towards the final destination, absolute spatial straight-line distance (Target Dis) from the current point to the final path endpoint, vertical height difference (RelZ Dis Final), relative elevation difference (Dis To Ground) from the current coordinates to the fitted ground, and current absolute pitch (Abs Pitch).

[0049] (2) Action data Y includes: the relative azimuth and tilt angles of the next adjacent node relative to the current point. The server can extract the relative azimuth (Rel Azimuth To Next) and relative tilt angle (Rel Pitch To Next) of the next connected node relative to the local coordinate system of the current node and store them as two-dimensional continuous value variables. If the current point is very close to the endpoint (e.g., less than 0.1 meters) or is already the last node, the extraction stops.

[0050] The server collects local state data and corresponding action data for each step to form the state-action data pairs required for supervised learning. All the sampled and extracted images and scalar data are packaged, compressed, and output as a standard dataset file. The depth map, type map, and scalar data are not directly assembled into a multidimensional matrix in physical storage. Instead, to save GPU memory and maintain data type accuracy, they are compressed and stored as independent array sets aligned by sample index (N). The compressed dataset file serves as the offline training dataset for the behavior cloning phase, aiming to quickly obtain an initial policy with basic obstacle avoidance and target approach capabilities.

[0051] Step 103: Use a policy network and a value network to perform offline supervised learning on the state-action data pairs to generate an initial policy model with basic obstacle avoidance and target approach capabilities.

[0052] The server employs a policy network and a value network to perform offline supervised learning on state-action data pairs, generating an initial policy model with basic obstacle avoidance and target approach capabilities. This initial policy model can be based on the Python language and the Tensorflow / Keras deep learning framework, constructing an Actor-Critic (AC) reinforcement learning architecture as its basic framework to perceive multimodal data. Multimodal data includes state data and action data. The server uses a policy network (Actor network) to fit a policy function. The input is the current state information X (States), and the output is continuous dual-channel action data Y values ​​(Actions, including ΔAzimuth and ΔPitch). The action data Y represents the horizontal deflection angle (ΔAzimuth) and vertical deflection angle (ΔPitch) of the pipeline when it moves forward 1 meter (the predetermined step size is 1m).

[0053] The Critic network is responsible for fitting the value function. Its network structure extends the Actor network structure, combining the input and output values ​​of the Actor network to form new input values ​​X (States, Actions), with an output of a 1-dimensional scalar Y (Value). This 1-dimensional scalar Y (Value) is used to predict the expected return that can be obtained after performing a specific action in the current state, thus serving as a baseline to guide the gradient updates of the Actor network and accelerate model convergence. The server uses the input and output of the policy network to form the input features X of the value network and obtains the corresponding expected return as the 1-dimensional scalar Y.

[0054] The server first executes an offline training script. Without actual exploration, the Actor network undergoes supervised learning using the offline training dataset of behavior clones. The initial policy model learns to make reasonable avoidance and endpoint-approaching output actions based on the input obstacle images it sees by fitting expert trajectories. For example, the offline training configuration uses an RTX 5000 GPU with 16GB of VRAM and a batch size of 256. After approximately 500 epochs and about 3 hours of iterative training, the initial policy model achieves initial convergence, obtaining the initial weight model with the minimum loss function. This approach addresses the difficulty of a cold start in reinforcement learning.

[0055] Step 104: Using a policy network and a value network based on the DDPG learning system, the three-dimensional virtual scene is explored autonomously, and a pipeline navigation model is generated iteratively based on at least ground surface information from the initial policy model.

[0056] The server uses a policy network and a value network based on the DDPG learning system to autonomously explore the 3D virtual scene, and iteratively generates a pipeline navigation model based on at least ground surface information from the initial policy model.

[0057] After the initial policy model acquires basic "common sense," the server executes an online evolution script, allowing the agent to enter a dynamic 3D simulation environment for "real-world training." This stage employs the DDPG (Deep Deterministic Policy Gradient) learning system, where the server replicates the Actor and Critic networks obtained from offline training as target networks (target actor, target critic) to address the "moving target" problem commonly encountered in reinforcement learning training.

[0058] During this phase, the model is placed in a simulation environment within a 3D virtual scene for extensive autonomous exploration. Training is divided into multiple episodes, each containing multiple steps. The server extracts the route starting point as the initial state of the episode. Each process includes: (1) State perception and action prediction: Extract the given Route starting point as the initial state of the round. At the front end of the current pipeline construction, the server collects local visual images and navigation parameters in real time, and uses the Actor network to output the predicted deflection angle. In one embodiment, in order to encourage the model to explore the unknown space, the server can also artificially add appropriate random noise to the output action results, and then drive the virtual pipeline to extend 1 meter in the environment (at this time, the specific step size is 1m); (2) Value estimation: The Critic network is used to calculate the estimated value based on the current state and the actions taken. (3) Reward Determination and Standard Integration: The server calls a custom r (Reward) function to calculate the r value fed back from the real environment. The server can iteratively generate a pipeline navigation model based solely on ground surface information from the initial strategy model, or it can iteratively generate a pipeline navigation model based on a composite reward function formed by ground surface information, target-oriented rewards, state and standard penalties, and economic cost rewards. Specifically, the RAG (Retrieval Enhanced Generation) server can deeply integrate the tedious spatial calculations with the design standards of the engineering field. The specific composite reward function (RewardFunction) is designed as follows: ① Goal-oriented reward: Successfully reaching the highest goal will result in a one-time high score reward (+Reward); ② Error Penalty: If a player collides with an existing pipe gallery or foundation during movement, or if a player breaks through the ground in a way that violates the laws of physics, a severe penalty (-Penalty) will be imposed, and the current round will be terminated prematurely. ③State and specification penalties: Based on the specification knowledge base transmitted from the RAG server, a small state penalty is applied at each step for local states in the path that are too large in turning angle (does not meet the radius of curvature) or non-compliant in burial depth (such as too shallow and easily damaged by external forces, too deep and increase the amount of earthwork). ④ Economic cost incentives: Continuous cost-saving incentives will be applied to behaviors that prioritize the use of existing idle utility tunnel space and minimize the total length of the integrated path while meeting the requirements of regulations.

[0059] (4) Parameter update: Calculate the mean square error between the time-series difference between the estimated value and the actual feedback Q value and the current value network prediction value. Update the Critic network through backpropagation and use a soft update strategy to update the weights of the Target network.

[0060] After each step is completed, the model proceeds to the next step and continues looping until the endpoint is reached or a fatal exception occurs. The server has completed large-scale online trial-and-error training of up to 10,000 episodes. Through continuous self-correction, the performance of the pipeline navigation model in handling extreme and poor collision situations not encountered in the offline training set has been greatly improved, ultimately outputting a pipeline navigation model with strong generalization capabilities. This pipeline navigation model can be a model trained by a single network or an agent trained through the above steps.

[0061] Step 105: Based on the target starting point and target ending point, input the real-time perception information collected by the virtual camera within the current predetermined range into the pipeline navigation model to generate a coherent target three-dimensional underground pipeline path that avoids all potential obstacles.

[0062] Based on the target start and end points, the server inputs real-time perception information collected by the virtual camera within the current predetermined range into the pipeline navigation model, generating a coherent 3D underground pipeline path that avoids all potential obstacles. In one embodiment, in a novel underground engineering planning task, the engineer only needs to arbitrarily specify new start and end coordinates in 3D space. The server can then load the pipeline navigation model and, without any global path planning guidance, rely solely on real-time perception information collected by its front-end virtual camera within a local 20-meter range to automatically determine the current situation of the movement from the target start point to the target end point, gradually outputting smooth and continuous deflection angles. Through continuous progress, a coherent underground pipeline path that avoids all potential obstacles is ultimately "grown" in 3D space.

[0063] The pipeline navigation model generated by the aforementioned method produces smoother and more natural pipelines. Its output in three-dimensional space is a dual-channel continuous action value representing horizontal and vertical deflection angles, highly consistent with the physical reality of engineering laying and overcoming the limitations of discrete action space. This decision-making mechanism based on continuous action distribution allows the model to smoothly adjust angles according to minor changes in local terrain and environmental features, thereby generating a three-dimensional pipeline path with high continuity and smooth curvature. This not only fundamentally eliminates visual jaggedness but also significantly reduces the difficulty of engineering construction and the pipe breakage rate, fully meeting the rigid engineering requirements of fluid dynamics performance and pipeline physical laying. The pipeline navigation model possesses excellent spatial environment generalization capabilities, requiring no retraining for cross-scenario applications. Furthermore, during reinforcement learning training, the agent deeply refines the general dynamic decision-making logic based on local visual perception to identify obstacles and autonomously make avoidance actions. By transforming the complex global planning problem into a real-time response problem based on the current local state, the model is endowed with strong environmental adaptability. Therefore, when facing a completely unfamiliar underground environment (such as underground spaces in different cities, complex strata containing unknown old pipelines, etc.), as long as the geometric features and spatial distribution patterns of obstacles in the new environment have a certain similarity to the training set, the model can still be directly deployed and efficiently and robustly complete pipeline generation, greatly saving the time and computing power costs of re-collecting data and retraining the model for new projects. During the pipeline navigation model training process, engineering domain standards are deeply integrated to achieve complex multi-objective automatic balancing optimization. A carefully designed composite reward function is implemented in the reinforcement learning training mechanism through a value network, scientifically quantifying abstract design conflicts. This composite reward function not only covers basic tendency objectives and collision penalties but also deeply integrates underground pipeline planning domain standard knowledge extracted through a knowledge base (such as retrieval-enhanced generation servers) (e.g., minimum turning radius requirements for specific pipelines, soil cover depth restrictions, and economic constraints on prioritizing the use of existing utility tunnels). During the online strategy optimization phase, the intelligent agent, through massive simulation trial and error exploration, can automatically weigh the weights of various penalties and rewards within a multi-dimensional constraint space that includes obstacle avoidance safety, the economy of shorter paths, the convenience of construction with fewer turns, and burial depth requirements such as close to the ground. Ultimately, the model can automatically find a pipeline design scheme that balances industry design specifications and has relatively better overall benefits among multiple conflicting design objectives, thereby significantly reducing the workload of repeated manual verification, modification, and optimization.

[0064] In one embodiment, a virtual camera is invoked along the pipeline path in a 3D virtual scene at predetermined step sizes to calculate and generate a local image of the current viewpoint corresponding to each step size sampling in real time. Based on the current coordinate information of the virtual camera, local spatial coordinates corresponding to the local image are obtained. Based on these local spatial coordinates, the pipeline direction, depth from the ground, and relative position of the target point of the current pipe segment corresponding to the current viewpoint are obtained from the 3D virtual scene to form vector features. These vector features are then stored as state data, corresponding to the local image. The relative azimuth and tilt angles of the next node connected to the current pipe segment are obtained to form action data. The state data and action data are stored as state-action data pairs corresponding to the local image, forming an offline training dataset.

[0065] In the 3D virtual scene, a virtual camera is invoked along the pipeline path to continuously sample at predetermined step sizes. The predetermined step size is set according to the needs of pipeline inspection or navigation tasks, and can range from 0.5 meters to 2.0 meters. When there are bends or branch nodes in the pipeline path, the sampling step size can be adaptively reduced (e.g., reduced to 0.2 meters) to ensure data density at critical locations. At each sampling position, the server invokes the virtual camera deployed in the 3D virtual scene to calculate and generate a local image corresponding to the current sampling position and current viewpoint in real time.

[0066] Specifically, this includes: 1. Virtual camera parameter configuration: The virtual camera adopts a first-person perspective or a third-person following perspective. Camera intrinsic parameters include: field of view (FOV) set to 60°~90°, resolution set to 224×224 or 256×256 pixels, and image format as RGB three-channel color image or RGB-D depth image (including depth channel).

[0067] 2. Viewpoint setting: The current viewpoint is set according to the pipeline navigation task requirements, including but not limited to: a) forward view along the pipeline direction; b) top view (used to observe the pipeline direction and ground reference objects); c) multi-view fusion (simultaneously acquiring three images from the forward, left, and right directions and stitching them together as status input).

[0068] 3. Image generation timing: Each step sampling triggers a virtual camera rendering, generating one or more local images corresponding to the current sampling position.

[0069] Based on the current coordinate information of the local image captured by the virtual camera in the above steps, obtain the local spatial coordinates corresponding to the local image.

[0070] Based on these local spatial coordinates, a visual perception model, obtained through image processing algorithms or a pre-trained model, is used to extract vector features related to the current sampling position from the 3D virtual scene. These vector features are then stored as state data, corresponding to the local image. The vector features include one or more of the following: i) Pipeline orientation: By detecting the vanishing point of the pipeline wall or the direction of the bottom / top edge of the pipe in the local image, calculate the orientation angle (0°~360°) of the current pipe segment in the horizontal plane, as well as the deflection angle (left / right) relative to the direction of travel. ii) Depth from ground level: When the local image is an RGB-D image, the depth value corresponding to the center region of the image is directly read, and combined with the installation height of the virtual camera, the vertical depth of the top of the pipeline or the center line of the pipeline at the current sampling point from the ground surface is calculated. If only RGB images are used, the boundary between the ground surface and the pipe wall is identified through semantic segmentation, and the depth is estimated by combining the known camera parameters; iii) Relative position of the target point: Obtain the relative position information of the final target point or the next key node (such as a maintenance well, valve, or branch point) of the current pipeline path relative to the current sampling point. The relative position information includes: horizontal distance, azimuth difference (the angular deviation of the target relative to the current direction of travel), and vertical elevation difference; iv) Supplemental status data: the pipe diameter of the current pipe segment (estimated by the distance between pipe walls in the image), the pipeline type (identified by color or texture, such as blue for water supply and yellow for gas), and the cumulative travel length of the current sampling point.

[0071] The pipeline route, its depth from the ground, and the relative position of the target point together constitute the current sampling point's state data, denoted as s. Additional state data in s is optional.

[0072] Obtain information about the next node connected to the current pipe segment. The next node can be: the end node of the current pipe segment, a branch node (tee, cross), or the next waypoint in the path planning.

[0073] Based on the spatial position of the next node relative to the current sampling point, motion data 'a' is calculated and generated. Motion data includes relative azimuth, tilt, and other optional motion components. The optional motion components in 'a' are optional. Motion data 'a' contains scalar data such as relative target azimuth / pitch, remaining distance, current attitude, and altitude above ground, providing macroscopic navigation guidance; this motion data constitutes a navigation feature vector.

[0074] The relative azimuth angle is the angle between the projection direction of the next node onto the horizontal plane of the current sampling point and the current direction of travel. It ranges from -180° to +180°, where negative values ​​indicate a left turn and positive values ​​indicate a right turn.

[0075] The inclination angle is the angle between the line connecting the next node and the current sampling point in the vertical plane and the horizontal plane. The value ranges from -90° to +90°, where negative values ​​indicate downward (entering a deeper section of the pipe) and positive values ​​indicate upward (climbing towards the ground surface).

[0076] Optional action components can be forward speed control signals (such as maintaining speed, decelerating, or stopping), or node type identifiers (such as "straight ahead", "left turn", "right turn", "up", or "down").

[0077] The relative azimuth and tilt angles constitute motion data 'a', representing the directional control commands required to navigate from the current sampling point to the next node.

[0078] The state data s obtained in the above steps are associated with the action data a to form a state-action data pair (s, a) corresponding to the current local image. In one embodiment, the original local image I can be selectively used. t It is also stored in the data pair for subsequent end-to-end (image to action) supervised learning.

[0079] State-action data pairs can be stored using the following data structure: Sample t ={Local Image t Status data t Motion data t}, where the state data s={pipeline direction, depth from ground, relative position of target point, optional supplementary data}, and the action data a={relative azimuth angle, tilt angle, optional action components}.

[0080] The state-action data pairs generated at all sampling points (t=1,2,…,N) are aggregated to form an offline training dataset. To ensure the quality and coverage of the dataset, the dataset can encompass the following diversity conditions: Different types of pipelines (water supply, drainage, gas, electricity); Different pipe diameters (DN100~DN1000); Different burial depths (0.5m~5.0m); Different node types (straight-through, elbow, tee, cross, reducer); Different lighting / texture conditions (simulating imaging effects from different sensors).

[0081] In one embodiment, after the offline training dataset is constructed, at least one of the following post-processing operations can be performed: data cleaning, data augmentation, and normalization. Data cleaning removes data pairs with missing states or obvious errors caused by virtual camera rendering anomalies. Data augmentation applies random transformations to local images, including random rotation (±10°), brightness adjustment, and Gaussian noise addition, to enhance the generalization ability of the training data. Normalization normalizes the components of the state and action data to a uniform numerical range (e.g., [-1, 1] or [0, 1]), facilitating the subsequent training of supervised learning models.

[0082] The aforementioned method systematically generates state-action data pairs by sampling along the path in a 3D virtual pipeline scene at step lengths, combining virtual camera imaging and spatial computation, thus constructing a high-quality offline training dataset. It can automatically and efficiently complete large-scale data acquisition in a simulation environment without manual intervention; it can traverse various scenes in the pipeline path (bends, branches, diameter changes, depth variations, etc.), ensuring dataset diversity and achieving high scene coverage; and the entire process is based on precise 3D spatial calculations, avoiding subjective errors from manual annotation and guaranteeing annotation accuracy, thereby significantly improving the convergence speed and final navigation accuracy of the navigation model.

[0083] In one embodiment, a policy network and a value network are used to perform offline supervised learning on state-action data pairs to generate an initial policy model with basic obstacle avoidance and target approach capabilities. This includes: training the policy network with state data from the state-action data pair as input and action data from the state-action data pair as output; using the input and output of the policy network as input features for the value network and obtaining the corresponding cumulative expected reward; calculating the error between the predicted action distribution output by the policy network and the corresponding real action data in the dataset; minimizing the error using a backpropagation algorithm, iteratively updating the network parameters of the policy network, achieving behavior cloning to complete the initialization of the basic policy, and generating an initial policy model with basic obstacle avoidance and target approach capabilities.

[0084] The server uses a policy network to learn and train by taking state data from state-action data pairs as input and action data from state-action data pairs as output.

[0085] The initial policy model training process includes the following sub-steps: (1) Multimodal input data reading and preprocessing The server extracts and constructs the two-branch state input (X) and action supervision labels (Y) required for the policy network from the offline dataset: ① Image status input (X) img): Extract depth map and type map. Normalize the depth map values ​​to the [0, 1] range, and scale the type map values ​​according to a preset ratio (e.g., divide by 20). Then stack the two in the channel dimension to form a two-channel image tensor with shape (N, 256, 256, 2), which is responsible for providing spatial perception of local geometry and obstacles.

[0086] ② Scalar state input (X) scalar The system extracts scalar features (set to 6 dimensions in this embodiment) responsible for global navigation information. These features include: the relative azimuth (Target Rel Azimuth) and relative tilt (TargetRel Pitch) towards the final destination, the absolute spatial straight-line distance (Target Dis) from the current point to the final path endpoint, the vertical height difference (RelZ Dis Final), the relative elevation difference (Dis To Ground) from the current coordinates to the fitted ground, and the current absolute tilt (Abs Pitch). The server standardizes / normalizes these 6-dimensional scalars to form a tensor of shape (N, 6).

[0087] ③ Action Label (Y): Extract the horizontal relative deflection angle and vertical relative tilt angle of the action label, scale it by a fixed ratio (e.g., divide by 5) so that the value range of the action label is adapted to the output boundary of the neural network (approximately in the [-1,1] interval), and filter out abnormal samples containing invalid values ​​(NaN) to form an action label tensor of shape (N, 2).

[0088] (2) Constructing the core neural network module The server can then build a core network architecture based on the Tensorflow / Keras framework, which specifically includes three sub-processing units: ① Visual encoder As a shared structural template, it receives the aforementioned dual-channel image state input. In this example, the main body of the network consists of three consecutive cascaded layers: "2D Convolutional Layer (Conv2D) → Max Pooling Layer (MaxPooling) → Bach Normalization Layer (Bach Normalization)", and a random deactivation layer (Dropout) is introduced. Finally, through a flattening layer (Flaten) and a fully connected layer (Dense), the high-dimensional image is compressed and output as a low-dimensional (e.g., 32-dimensional) visual feature vector.

[0089] ②Actor Network (Policy Network) A multimodal dual-branch input architecture is adopted. The image branch is connected to the visual encoder to obtain 32-dimensional visual features, and the scalar branch is encoded into 16-dimensional features through a fully connected layer with a Tanh activation function. After the two features are concatenated, they pass through a hidden layer containing 128 neurons and a dropout layer in sequence. Finally, the output layer with a Tanh activation function outputs 2-dimensional continuous actions (azimuth and tilt).

[0090] ③ Critic Network (Value Network) The server uses the input and output of the policy network as the value approximation Q(s,a) for the state-action pair. It then uses these as the input features of the value network and obtains the corresponding cumulative expected reward Q-value. The state branch concatenates the features output by the visual encoder with the navigation vector features and integrates them into 64 dimensions. The action branch separately encodes the input action features into 64 dimensions, then fuses and concatenates the state and action features. After dimensionality reduction mapping through multiple fully connected network layers, a single neuron is finally output as the estimated Q-value.

[0091] (3) Target construction and Critic Network update based on engineering approximation value In each training step (Train Sep), the server constructs an engineered approximate value target (Target Q) with specific physical meaning. The server calculates Target Q based on the current state and the action taken, using the following reward and penalty terms: ① Base Reward: A reward is given for moving toward the target by calculating the cosine similarity between the model's output action direction and the real target direction; ② Slope Penalty: Calculates the coherence of pitch angle changes and penalizes violent jitter. ③ Safety collision assessment: Extract depth features of the central region of the local image to predict the collision risk (the server retains the interface for this collision penalty term, and the corresponding penalty weight coefficient can be configured according to the strength of the safety constraints during actual training).

[0092] Based on the synthesized Target Q, the Critic network updates its policy network parameters using the backpropagation algorithm by minimizing the mean squared error (MSE) between its predicted value (PredictedQ) and Target Q. MSE = Error between predicted and actual actions + Weight coefficients * Value term loss, where the value term loss is typically negative and represents the accumulated expected reward.

[0093] (4) Hybrid loss-driven Actor Network update The parameter updates of the Actor network employ a hybrid loss function, primarily driven by supervised labels and secondarily by value guidance. Loss Actor = Loss BC + λ×Loss Q Among them, Loss BC The behavior server loss, i.e., the mean squared error between the model's predicted action and the action label, prompts the model to imitate expert trajectories; Loss Q The value-based loss (usually a negative of the Critic's predicted Q-value) prompts the model to output actions that yield high Q-values; λ is the value-guided weight coefficient (set to a lightweight weight of 0.01 in this embodiment). This hybrid loss ensures that the model can quickly fit expert obstacle avoidance common sense while using the value assessment style of reinforcement learning for preliminary policy optimization.

[0094] In each training epoch, the model alternately executes the update logic for the Actor and Critic as described above, and accumulates the loss value. In this demonstration example, the batch size is set to 256. After approximately 500 epochs and about 3 hours of iterative training, the model achieves initial convergence, obtaining the initial weight model with the minimum loss function, which is used as a high-quality initial policy for the subsequent Online Policy Optimization (Online RL) stage.

[0095] The above method, by fitting a known trajectory, enables the model to learn to make reasonable avoidance and approach-to-the-end actions based on the input obstacle image it sees. This can solve the problem of cold start difficulty in the early stages of reinforcement learning.

[0096] In one embodiment, a policy network is used for learning and training, taking state data from state-action data pairs as input and action data from state-action data pairs as output. This includes: extracting local images from state data pairs in an offline training dataset and converting the local images into visual feature vectors; converting the vector features from the state data in state-action data pairs into navigation feature vectors; combining the visual feature vectors and navigation feature vectors into a fused feature vector; and using a policy network for learning and training, taking the fused feature vector as input and action data from state-action data pairs as output.

[0097] The server acquires a local image corresponding to each state-action data pair in the offline training dataset. The local image can be a local depth image and a type image within a certain range of the current location. The server inputs the acquired local images into a pre-trained convolutional neural network (CNN) for feature extraction, mapping the image data into a fixed-dimensional visual feature vector. This visual feature vector represents the geometric distribution of perceived obstacles such as existing pipelines or building foundations, outputting a flattened 32-dimensional high-dimensional visual feature vector.

[0098] The server extracts vector features from state-action data pairs, obtaining state data s = {pipeline direction, depth above ground, relative position of target point}. Using a multilayer perceptron (MLP) or linear embedding layer, these discrete or continuous state data are encoded into low-dimensional dense (e.g., 16-dimensional) navigation feature vectors, which are used to provide macroscopic pipeline navigation guidance.

[0099] The server concatenates or fuses the visual feature vector and the navigation feature vector element by element along the feature dimensions to construct a fused feature vector (unified input states). The first dimension of this fused feature vector corresponds to the visual semantic feature channel, and the second dimension corresponds to the navigation state feature channel, thereby achieving the joint expression of perception and decision features while preserving the independence of multimodal information.

[0100] The policy network employs deep reinforcement learning networks (such as PPO, SAC, or DDPG models) or imitation learning networks (such as behavior cloning models). The server inputs fused feature vectors into the policy network and uses action data from state-action data pairs as supervision signals or target outputs for end-to-end learning and training. During training, the network parameters of the policy network are updated by minimizing the loss function between predicted and actual actions (e.g., mean squared error or cross-entropy loss), enabling it to output optimal navigation actions (e.g., linear velocity and angular velocity commands) given the current perception and state features. Iterative optimization continues until the loss function converges, thereby constructing an autonomous navigation model that can directly map local observations to motion control.

[0101] Through the above methods, the policy network not only integrates visual perception information and navigation state information, but also uses a structured fusion feature vector as a unified representation, which significantly improves the model's generalization ability and real-time decision-making in complex dynamic environments.

[0102] In one embodiment, a policy network and a value network are used to autonomously explore a 3D virtual scene based on the DDPG learning system, and an initial policy model is iteratively generated into a pipeline navigation model based on ground surface information. This includes: the policy network continuously explores the 3D virtual scene autonomously and outputs action a based on the current state s at the current time step; the 3D virtual scene executes action a, updates the state to s', and the value network calculates the immediate reward r based on a composite reward function with multi-objective constraints including ground surface information, generating a transition tuple (s,a,r,s',done). The transition tuples (s,a,r,s',done) at different time steps are then combined into an experience replay pool for iterative training, where done is the Boolean value corresponding to the termination flag; small batches of transition tuples are randomly sampled from the experience replay pool, and the value network and policy network are alternately trained and updated, thereby continuously optimizing the navigation strategy in the initial policy model; when the number of training rounds reaches a preset upper limit, or the average reward value of consecutive predetermined rounds converges, the iteration stops, thus iterating the initial policy model into a pipeline navigation model.

[0103] The server builds and initializes the DDPG (Deep Deterministic Policy Gradient) learning system, in which the policy network (Actor network) outputs deterministic navigation actions, and the value network (Critic network) evaluates the Q-values ​​of state-action pairs. The server uses the initial policy model as the initial network parameters for the policy network.

[0104] The server can first initialize the 3D simulation environment of underground pipelines, load the scene topology, and filter out meaningless routes with excessively short physical lengths (e.g., less than 2 meters) to ensure the effectiveness of training samples. Subsequently, the server loads the weight parameters of the offline-trained, converged optimal policy network (best_actor) and optimal value network (best_critic). To address the network oscillation problem caused by "moving target" common in reinforcement learning, the server hard copies two identical shadow networks in memory: the target policy network (target actor) and the target value network (target critic), and initializes an experience replay buffer to store interaction data.

[0105] At the start of each training episode, a trainable path is randomly selected from the environment. The agent is initialized to its starting position and preset pose, and a final target point is set. To prevent data leakage, the server actively masks the edge entities of the "target pipeline" itself. In each iteration, the server dynamically constructs a dual-branch state input: ①Image Status The virtual camera is invoked to query candidate pipelines at the current 3D coordinate point and render a local depth map and a type map. The depth map is inverted and cropped to a preset range (e.g., 0~20 meters, with larger pixel activation values ​​for closer distances), normalized, and then superimposed with the type map to form a two-channel image tensor of (256, 256, 2).

[0106] ② Scalar state The six-dimensional macroscopic navigation scalar (relative azimuth, relative tilt, distance to target, vertical height difference, distance to ground, and current absolute tilt) that are completely consistent with the offline stage are extracted in real time and spliced ​​into a one-dimensional tensor.

[0107] (3) Noise-based motion exploration and physical environment stepping The current observed state is input into the Actor network for forward inference, and the output is an initial action within the normalized interval [-1, 1]. To encourage the agent to explore the unknown state space, the server adds Gaussian noise to the output action and then re-prunes the noisy action to the valid interval.

[0108] Subsequently, the server inversely maps the network action to real physical angle increments (horizontal azimuth increment and vertical pitch increment) by a fixed ratio (e.g., multiplied by 5). Using the Angles to Vector formula, the state angle is converted into a unit direction vector in three-dimensional space, driving the agent to take a fixed step (e.g., move forward 1 meter) along this vector in the three-dimensional virtual environment, thereby updating the agent's absolute coordinates and current state. The policy network continuously explores autonomously in the three-dimensional virtual scene. At each discrete time step t, the policy network outputs action a based on the current perceived state s. State data s = {pipeline direction, depth from ground, relative position of target point}, action data a = {relative azimuth angle, tilt angle}. To avoid getting trapped in local optima, random noise can be added to action a to form an exploratory action a. _exp .

[0109] Execute action a in a 3D virtual scene _exp The robot's pose and environmental state are updated to the next time step s'. The server invokes the value network to calculate the immediate reward r for the current action based on the ground surface information.

[0110] The server triggers a state check based on the agent's new coordinates and returns a composite reward signal and a termination status (Done): ① Goal-oriented reward: If the coordinates reach the endpoint tolerance range, a very high positive reward (such as +300) is given, and the current round ends; ② Fatal violation penalty: If it is detected that the pipeline is broken through the ground surface (violating the common sense of underground pipeline installation), a severe penalty (such as -150) will be imposed; if it collides with other existing underground pipelines, a collision penalty (such as -100) will be imposed. In the event of the above, the round will be terminated immediately. ③ Process-guided rewards and penalties: During normal, unterminated steps, a positive distance reward is given based on the reduction in the relative distance between the agent and the target point; simultaneously, a small base step penalty is applied at each step (to encourage finding the shortest path), and a large slope penalty is applied (to limit sharp bends in the vertical direction of the pipeline). The terminated state "done" is represented by a Boolean value of 0 or 1.

[0111] The server assembles the current state s, the action a, the immediate reward r, the next state s', and the termination flag done into a transition tuple (s, a, r, s', done), and stores the transition tuples from different time steps together into a transition tuple for iterative training in the experience replay pool.

[0112] Once the amount of data accumulated in the experience replay pool reaches a preset threshold, the server randomly selects a small batch (e.g., 64 records) of transition tuples from the pool at each step to update the network parameters. ①Critic Value Network Update The server uses the target actor network to calculate the action a' of the next state, and the target critic network evaluates the expected value Q' of the next state. Based on the Bellman equation, a temporal difference target (TD Target) is constructed: y = r + γ × Q'(s', a') × (1 - done). Where y represents the expected value that the Critic network needs to fit during this training; r represents the immediate reward, which is the real physical feedback score given by the simulation environment immediately after performing action a (extending the pipeline forward by 1 meter and deflecting at a certain angle) in the current state s; γ is the discount factor, representing the weight of the future estimated reward, which is a constant between 0 and 1; s' represents the next state, specifically the new position the environment moves to after performing action a, and the new local depth map, new distance, etc. observed; a' represents the next action, the next deflection angle action that the model plans to perform in the next state s', which is predicted by the target actor network; Q'(s', a') represents the estimated future cumulative total score that can be obtained by performing the next action a' in the next state s', which is predicted by the target critic network; (1 - done) is a termination state cutoff term, and done is a Boolean value (0 or 1). If the current step causes the pipeline to hit a wall, break through the ground, or reach the end point, done=1, indicating that the round ends (Game Over).

[0113] The current Critic network updates its weights by minimizing the mean squared error (MSE) between its predicted value Q(s,a) and the target value y. Since DDPG employs an offline policy learning approach, this experience replay pool allows for repeated sampling of historical experience, improving sample utilization.

[0114] During the training phase, the server randomly samples mini-batches of transformed tuples from the experience replay pool, each tuple containing (s, a, r, s'). The server computes the target Q-value using the target value network (target critic) and the target policy network (target actor). y i = r i + γ·Q target (s', μ target (s' | θ^μ target ) | θ^Q target ) Here, γ is the discount factor. The network parameters of the value network are updated by minimizing the temporal difference error (TD Error) between the current Q-value output of the value network and the target Q-value. Then, the network parameters of the policy network are updated using the policy gradient ascent method, aiming to maximize the Q-value of the value network's action output. After every few rounds of updates, a soft update method is used to proportionally copy the parameters of the current network to the target network at a rate τ, maintaining training stability. θ^μ target ← τ·θ^μ + (1-τ)·θ^μtarget Similarly, update θ^Q target .

[0115] ②Actor Policy Network Update With the parameters of the Critic network fixed, and based on the deterministic policy gradient theorem, the optimization objective of the Actor network is to maximize the Critic network's evaluation of its output actions. The server optimizes the Actor network weights by calculating the loss function -Q(s, Actor(s)) and backpropagating.

[0116] After the main network is updated, the server uses a soft update strategy to slowly and exponentially average the weights of the main network to the target networks with a very small smoothing coefficient (such as τ=0.001), thereby ensuring the stability of the training process.

[0117] In this way, in each episode, the policy network outputs actions based on the current network, and the value network provides evaluation signals based on ground surface information. The two are trained alternately and promote each other, thereby continuously optimizing the navigation strategy in the initial policy model, enabling it to gradually learn to travel along the pipeline surface area, avoid impassable terrain, and efficiently reach the target point.

[0118] Iterative training stops when the number of training rounds in the "exploration-stepping-pooling-sampling-update" cycle reaches a preset limit (e.g., 10,000 rounds), or when the average reward value over a predetermined number of rounds (e.g., 100 rounds) is less than a preset threshold (i.e., convergence). During training, the model gradually overcomes the limitations of the offline dataset's layout and learns to autonomously handle various dynamically generated situations in complex underground spaces. The policy network successfully iterates the initial policy model into a pipeline navigation model, which can output smooth, safe, and efficient navigation action sequences in 3D pipeline scenes, relying solely on current local perception information.

[0119] The above method deeply refines the general dynamic decision-making logic based on local visual perception to identify obstacles and autonomously make avoidance actions, as well as the macroscopic approach to the target point based on navigation vectors. It also transforms the complex global planning problem into a real-time response problem based on the current local state, giving the model strong environmental adaptability.

[0120] In one embodiment, mini-batches of transformation tuples are randomly sampled from the experience replay pool, and the value network and policy network are trained and updated alternately, thereby enabling the policy network to continuously optimize the navigation policy in the initial policy model, including: Initialize the corresponding policy objective network and value objective network for the policy network and value network, respectively; The temporal difference objective is calculated based on the immediate reward r, the termination flag done, and the evaluation value of the value objective network; The network parameters of the value network are updated by minimizing the mean square error between the temporal difference objective and the current value network prediction. A deterministic policy gradient mechanism (DDPG) is adopted to update the network parameters of the policy network by maximizing the evaluation value of the current value network on the output action of the policy network. A soft update approach is adopted to synchronously update the parameters of the value target network and the policy target network, thereby enabling the policy network to continuously optimize the navigation policy in the initial policy model.

[0121] The server initializes the corresponding policy objective network and value objective network for the policy network and value network, respectively; and calculates the temporal difference objective based on the immediate reward r, the termination flag done, and the evaluation value of the value objective network.

[0122] The server randomly samples a mini-batch of transformed tuples from the experience replay pool, each tuple being represented as (s, a, r, s', done). This mini-batch is used to reduce the correlation between training samples, improving the stability of parameter updates and sample utilization efficiency.

[0123] The server inputs the state s and action a from the sampled transition tuple into the current value network (Critic network) of the value network, denoted as θ^Q, to obtain the current Q-value Q(s, a | θ^Q). Simultaneously, the next state s' is input into the target policy network (target actor network), denoted as θ^μ. target The next action a' = μ is obtained. target (s' | θ^μ target Then input (s', a') into the target value network (target critic network), with parameters denoted as θ^Q. target Calculate the target Q value: y i = r i + γ·Q target (s', μ target (s' | θ^μ target ) | θ^Q target ) Here, γ is a discount factor used to balance the weights of immediate rewards and future rewards. Then, the temporal difference error between the current Q-value and the target Q-value is calculated as the loss function.

[0124] The server minimizes the loss function described above using gradient descent (such as Adam or SGD optimizer) and updates the network parameters θ^Q of the value network along the gradient direction, so that the value network's evaluation of state-action pairs gradually approaches the true cumulative reward.

[0125] After updating the value network, the server can also update the current policy network (Actor network, denoted by parameters θ^μ). The optimization objective of the policy network is to maximize the Q-value of the value network output, thereby learning deterministic actions that yield higher cumulative rewards. The server can use Deterministic Policy Gradient (DPG) for parameter updates. The server can update the Actor and Critic networks through backpropagation.

[0126] To prevent Q-value estimation from diverging or fluctuating drastically during training, the server employs a soft update strategy to synchronize the parameters of both the target critic and target actor networks. Specifically, in each training round, the system does not directly copy the current network parameters, but rather incrementally updates the target network parameters according to a preset soft update coefficient τ (typically 0.001 or 0.005, much less than 1). Update the target policy network parameters: θ^μ' ← τ·θ^μ + (1-τ)·θ^μ'; Update the target value network parameters: θ^Q' ← τ·θ^Q + (1-τ)·θ^Q', Here, θ^μ and θ^Q are the parameters of the current policy network and the current value network, respectively, while θ^μ' and θ^Q' are the corresponding target network parameters. Through this soft update method, the changes in the target network are effectively smoothed, significantly improving the convergence stability of the training process.

[0127] Within each training cycle, the system repeats the above steps: first updating the value network (minimizing Bellman error) on the same mini-batch of samples, then updating the policy network (maximizing the deterministic policy gradient), and finally soft-updating both target networks. The value network and policy network form an "evaluation-improvement" closed loop: the value network provides the policy network with accurate action evaluation signals, and the policy network continuously adjusts its action generation strategy based on these signals to pursue higher rewards. Through thousands to tens of thousands of alternating training cycles, the policy network gradually masters efficient and safe navigation behavior in 3D pipeline scenes, continuously optimizing the navigation strategy in the initial policy model, ultimately converging into a mature pipeline navigation model.

[0128] The above method effectively smooths out the changes in the target network, significantly improving the convergence stability of the training process.

[0129] In one embodiment, the method further includes: Export the continuous 3D coordinates of the target 3D underground pipeline path into a standardized data table, and simultaneously render and generate an HTML scene file for intuitive review.

[0130] The server uses a post-processing script to export the continuous 3D coordinates of the target 3D underground pipeline path into a standardized data table, and simultaneously renders and generates an HTML scene file for intuitive review. The exported coordinate data has good smoothness and spatial rationality, and can be directly imported into subsequent traditional BIM modeling software or engineering design calculation software as a reference drawing for generative-aided planning and design.

[0131] After obtaining the automatically generated pipeline routes, technicians can combine qualitative visualization with quantitative calculation indicators to conduct a comprehensive evaluation and analysis of the output pipelines. (1) 3D visualization review and hard collision verification Technical personnel can visually assess whether the overall spatial orientation of pipelines conforms to conventional engineering logic by reviewing HTML scene files or 3D views based on the BIM platform. At the same time, they can call the interference / collision check module in the BIM software to strictly verify whether there are potential risks of spatial intersection or insufficient safety clearance between the generated path entities and the existing underground pipeline network and the foundation of the structure. (2) Physical morphology and pipeline curvature assessment Based on the exported continuous coordinate matrix data, technicians or evaluation scripts can calculate the derivative of the deflection angle between adjacent nodes, verify whether the spatial bend angle and maximum curvature of the entire path meet the minimum physical bending radius requirements allowed by specific laying pipe materials (such as flexible information cables or rigid water supply and drainage pipes), and ensure the bendability of the path in actual construction. (3) Quantification of compliance and economic benefits The Z-axis elevation coordinates of each node of the pipeline are compared with the surface elevation data (mesh) to verify whether the soil cover depth of the entire line strictly conforms to the lower and upper limits of the burial depth specifications for frost-resistant soil and protection against external force damage. In addition, by calculating the ratio of the actual total length of the generated path to the theoretical straight-line distance between the start and end points (path efficiency), and by counting the number of turning nodes (such as elbow components), the material economy and construction convenience of the pipeline layout are quantitatively evaluated.

[0132] Based on the multi-dimensional evaluation feedback mentioned above, if a deviation is found in the generation effect on a specific indicator, technicians can adjust the weight coefficient of the corresponding penalty term in the composite reward function of the online reinforcement learning stage in a targeted manner, thereby achieving a closed-loop iterative optimization of "generation-evaluation-tuning".

[0133] In one embodiment, such as Figure 2 As shown, an underground pipeline generation device based on virtual scene visual navigation is provided. The device includes a data acquisition module 201, an offline dataset construction module 202, an offline learning module 203, an online learning module 204, and a path generation module 205.

[0134] The data acquisition module 201 is used to acquire raw pipeline data and construct a three-dimensional virtual scene of underground pipelines and corresponding ground surface information.

[0135] The offline dataset construction module 202 is used to continuously sample the pipeline path in the 3D virtual scene according to a predetermined step size, collect the local state data and corresponding action data of each step to form the state-action data pairs required for supervised learning, and form an offline training dataset.

[0136] The offline learning module 203 is used to perform offline supervised learning on state-action data pairs using a policy network and a value network to generate an initial policy model with basic obstacle avoidance and target approach capabilities.

[0137] The online learning module 204 is used to autonomously explore a 3D virtual scene based on the DDPG learning system using policy networks and value networks, and to iteratively generate a pipeline navigation model based on at least ground surface information from the initial policy model.

[0138] The path generation module 205 is used to input real-time perception information collected by the virtual camera within the current predetermined range into the pipeline navigation model based on the target start point and target end point, and generate a coherent target three-dimensional underground pipeline path that avoids all potential obstacles.

[0139] Specific limitations regarding the underground pipeline generation device based on virtual scene visual navigation can be found in the limitations of the underground pipeline generation method based on virtual scene visual navigation mentioned above, and will not be repeated here. Each module in the aforementioned underground pipeline generation device based on virtual scene visual navigation can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0140] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 3As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and the database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores data such as 3D virtual scenes. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program implements a method for generating underground pipelines based on virtual scene visual navigation.

[0141] Those skilled in the art will understand that Figure 3 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0142] In one embodiment, a computer device is provided, including a memory and a processor. The memory stores a computer program, and the processor executes the computer program to perform the following steps: acquiring raw pipeline data, constructing a three-dimensional virtual scene of the underground pipeline and corresponding ground surface information; continuously sampling the pipeline path in the three-dimensional virtual scene according to a predetermined step size, collecting local state data and corresponding action data at each step to form state-action data pairs required for supervised learning, forming an offline training dataset; using a policy network and a value network to perform offline supervised learning on the state-action data pairs, generating an initial policy model with basic obstacle avoidance and target approach capabilities; using the policy network and value network based on the DDPG learning system to autonomously explore the three-dimensional virtual scene, and iteratively generating a pipeline navigation model based at least on the ground surface information of the initial policy model; and inputting real-time perception information within the current predetermined range collected by the virtual camera into the pipeline navigation model based on the target start point and target end point to generate a coherent target three-dimensional underground pipeline path that avoids all potential obstacles.

[0143] In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When executed by a processor, the computer program performs the following steps: acquiring raw pipeline data, constructing a three-dimensional virtual scene of the underground pipeline and corresponding ground surface information; continuously sampling the pipeline path in the three-dimensional virtual scene according to a predetermined step size, collecting local state data and corresponding action data at each step to form state-action data pairs required for supervised learning, forming an offline training dataset; using a policy network and a value network to perform offline supervised learning on the state-action data pairs, generating an initial policy model with basic obstacle avoidance and target approach capabilities; using the policy network and value network based on the DDPG learning system to autonomously explore the three-dimensional virtual scene, and iteratively generating a pipeline navigation model based at least on the ground surface information of the initial policy model; and inputting real-time perception information within the current predetermined range collected by the virtual camera into the pipeline navigation model based on the target start point and target end point to generate a coherent target three-dimensional underground pipeline path that avoids all potential obstacles.

[0144] To balance the high computational demands of large-scale model parameter optimization with the efficiency and convenience of practical engineering applications, the system of this invention preferably adopts a distributed hardware and software architecture that separates model training from inference execution in actual business deployment. Specifically: for the heavy-load model training phase, it is preferred to configure it on a local high-performance workstation equipped with a high-performance graphics processing unit (GPU) to run local applications, ensuring the large-scale concurrent computing power required for massive 3D virtual scene interactions; while for the lightweight final model after training convergence (i.e., the model inference phase), it is preferred to encapsulate and deploy it on a cloud server based on a browser-server (BS) architecture. The advantages of this separated deployment mechanism are: on the one hand, it avoids the heavy hardware dependence of end users; on the other hand, the pipeline generation model deployed in the cloud can be deeply integrated and interact in real time with existing engineering project databases, domain design specification knowledge bases (such as the underlying vector library of the RAG system), BIM platforms, etc. Engineering planning and design personnel only need to input engineering constraints and planning start and end coordinates through a web browser or lightweight front-end interface to call cloud services anytime, anywhere. This flexible integrated deployment approach significantly lowers the barrier to entry for AI-assisted design.

[0145] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application.

Claims

1. A method for generating underground pipelines based on virtual scene visual navigation, characterized in that, include: Acquire raw pipeline data and construct a 3D virtual scene of underground pipelines and corresponding ground surface information; In the pipeline path of the three-dimensional virtual scene, the virtual camera continuously samples according to a predetermined step size, and collects the local state data and corresponding action data of each step to form the state-action data pairs required for supervised learning, which constitute an offline training dataset. Offline supervised learning is performed on the state-action data pairs using a policy network and a value network to generate an initial policy model with basic obstacle avoidance and target approach capabilities. The system employs a policy network and a value network based on the DDPG learning framework to autonomously explore a 3D virtual scene, and iteratively generates a pipeline navigation model based on at least ground surface information from the initial policy model. Based on the target starting point and target ending point, the real-time perception information collected by the virtual camera within the current predetermined range is input into the pipeline navigation model to generate a coherent target three-dimensional underground pipeline path that avoids all potential obstacles.

2. The method for generating underground pipelines according to claim 1, characterized in that, The pipeline path in the 3D virtual scene invokes a virtual camera to continuously sample at predetermined step sizes, collecting local state data and corresponding action data for each step to form state-action data pairs required for supervised learning, constituting an offline training dataset, including: In the three-dimensional virtual scene, the pipeline path calls the virtual camera according to a predetermined step size, and calculates and generates a local image of the current viewpoint corresponding to each step size sampling in real time; Based on the current coordinate information of the virtual camera, obtain the local spatial coordinates corresponding to the local image; Based on the local spatial coordinates, the pipe direction, depth from the ground, and relative position of the target point of the current pipe segment corresponding to the current viewpoint are obtained from the three-dimensional virtual scene to form vector features. The vector features are then stored as state data corresponding to the local image. Obtain the relative azimuth and inclination angles of the next node connected to the current pipe segment to form action data; The state data and the action data are stored as state-action data pairs corresponding to the local image, forming an offline training dataset.

3. The method for generating underground pipelines according to claim 1, characterized in that, The process of using a policy network and a value network to perform offline supervised learning on the state-action data pairs to generate an initial policy model with basic obstacle avoidance and target approach capabilities includes: A policy network is used to learn and train, taking the state data from the state-action data pair as input and the action data from the state-action data pair as output. The inputs and outputs of the policy network are used as input features of the value network, and the corresponding cumulative expected reward is obtained. Calculate the error between the predicted action distribution output by the policy network and the corresponding real action data in the dataset; The error is minimized by backpropagation algorithm, and the network parameters of the policy network are iteratively updated to achieve behavior cloning to complete the initialization of the basic policy and generate an initial policy model with basic obstacle avoidance and target approach capabilities.

4. The method for generating underground pipelines according to claim 3, characterized in that, The policy network is trained by taking the state data from the state-action data pair as input and the action data from the state-action data pair as output, including: Extract local images from the state-action data pairs in the offline training dataset, and convert the local images into visual feature vectors; The vector features in the state data of the state-action data pair are converted into navigation feature vectors; The visual feature vector and the navigation feature vector are combined into a fused feature vector; A policy network is used to learn and train, taking the fused feature vector as input and the action data in the state-action data pair as output.

5. The method for generating underground pipelines according to claim 1, characterized in that, The process of autonomously exploring the 3D virtual scene using the policy network and the value network based on the DDPG learning system, and iteratively generating a pipeline navigation model based on the initial policy model at least based on the ground surface information, the policy network, and the value network, includes: The policy network continuously explores autonomously in the three-dimensional virtual scene and outputs action a based on the current state s at the current time step; The three-dimensional virtual scene executes action a, updates the state to s', and calculates the immediate reward r based on the composite reward function containing the ground surface information and multi-objective constraints, generating a transition tuple (s,a,r,s',done). Then, the transition tuples (s,a,r,s',done) at different time steps are combined into an experience replay pool for iterative training, where done is the Boolean value corresponding to the termination flag. Randomly sample small batches of the transformation tuples from the experience replay pool, and alternately train and update the value network and the policy network to continuously optimize the navigation policy in the initial policy model; When the number of training rounds reaches the preset limit, or when the average reward value of consecutive predetermined rounds converges, the iteration stops, thus transforming the initial policy model into a pipeline navigation model.

6. The method for generating underground pipelines according to claim 5, characterized in that, The step of randomly sampling small batches of the transformed tuples from the experience replay pool, and alternately training and updating the value network and the policy network, thereby enabling the policy network to continuously optimize the navigation policy in the initial policy model, includes: Initialize the corresponding policy target network and value target network for the policy network and the value network, respectively; The temporal difference objective is calculated based on the instant reward r, the termination flag done, and the evaluation value of the value target network; The network parameters of the value network are updated by minimizing the mean square error between the temporal difference objective and the current value network prediction. A deterministic policy gradient mechanism is adopted to update the network parameters of the policy network by maximizing the evaluation value of the current value network on the output action of the policy network. A soft update approach is adopted to synchronously update the parameters of the value target network and the policy target network, thereby enabling the policy network to continuously optimize the navigation policy in the initial policy model.

7. The method for generating underground pipelines according to claim 1, characterized in that, The method further includes: The three-dimensional continuous coordinate points of the target three-dimensional underground pipeline path are exported as a standardized data table, and an HTML scene file is generated simultaneously for intuitive review.

8. An underground pipeline generation device based on virtual scene visual navigation, characterized in that, The device includes: The data acquisition module is used to acquire raw pipeline data and construct a three-dimensional virtual scene of underground pipelines and corresponding ground surface information. The offline dataset construction module is used to call the virtual camera to continuously sample the pipeline path in the three-dimensional virtual scene according to a predetermined step size, collect the local state data and corresponding action data of each step to form the state-action data pairs required for supervised learning, and form an offline training dataset. The offline learning module is used to perform offline supervised learning on the state-action data pairs using a policy network and a value network to generate an initial policy model with basic obstacle avoidance and target approach capabilities. The online learning module is used to autonomously explore a 3D virtual scene based on the DDPG learning system using policy networks and value networks, and to iteratively generate a pipeline navigation model based on at least ground surface information from the initial policy model. The path generation module is used to input the real-time perception information of the current predetermined range collected by the virtual camera into the pipeline navigation model based on the target start point and target end point, and generate a coherent target three-dimensional underground pipeline path that avoids all potential obstacles.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.