Panoramic picture three-dimensional space information display method and device based on deep learning, and medium
By using a deep learning model with a spherical Transformer rearrangement module and a structural parameter feedback update module, the problems of spherical distortion and multi-scale feature consistency in the display of panoramic 3D spatial information are solved, achieving high-precision 3D reconstruction and dynamic interactive display, and improving the expressive power and interactive response capability of the display system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI YUGE INTELLIGENT TECHNOLOGY CO LTD
- Filing Date
- 2026-03-11
- Publication Date
- 2026-06-12
AI Technical Summary
Existing panoramic image processing technologies suffer from several problems when generating 3D spatial information, including difficulty in accurately modeling spherical distortion, lack of consistency constraints on multi-scale features, and difficulty in effectively associating 3D structural information with 2D features. These problems result in high noise in depth estimation, unstable semantic segmentation, and the inability to dynamically update 3D displays with insufficient interactive feedback.
A deep learning-based spherical depth vision and structural feedback enhancement model is adopted. Through the spherical Transformer rearrangement module and the structural parameter feedback update module, a depth atlas, a semantic label atlas, and a spatial structural parameter set are generated to construct a 3D mesh model. Combined with a 3D rendering engine, dynamic interactive display is achieved.
It achieves high-precision 3D spatial reconstruction and interactive display. The 3D mesh model has a complete structure, strong semantic association, and excellent interactive response. It can dynamically update according to changes in the user's viewpoint, improving the realism and comprehensibility of the display effect.
Smart Images

Figure CN122199866A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of three-dimensional spatial display technology, and in particular to a method, device and medium for displaying panoramic three-dimensional spatial information based on deep learning. Background Technology
[0002] Existing panoramic image processing techniques typically rely on 2D image projection and traditional convolutional neural networks for feature extraction, generating 3D point clouds or coarse models through depth estimation or simple geometric mapping. However, these methods generally face challenges when processing panoramic images, such as difficulty in accurately modeling spherical distortion, lack of consistency constraints on multi-scale features, and difficulty in effectively associating 3D structural information with 2D features. This results in high noise in depth estimation, unstable semantic segmentation, and ultimately incomplete or significantly deformed 3D scene structures, making it difficult to support high-quality 3D display and interactive applications.
[0003] Meanwhile, existing 3D mesh-based display systems often rely solely on geometric reconstruction results, lacking deep integration with semantic information, object identification, and interactive event types. This results in the 3D display failing to dynamically update according to changes in the user's viewpoint and failing to provide fine-grained prompts and interactive feedback for scene elements. Furthermore, traditional rendering workflows typically lack structural feedback mechanisms and cannot utilize 3D structures to update network features, leading to inconsistencies between the displayed effect and the actual structure.
[0004] Therefore, how to provide a method, device, and medium for displaying panoramic 3D spatial information based on deep learning is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0005] One objective of this invention is to propose a method, device, and medium for displaying panoramic 3D spatial information based on deep learning. This invention is based on a spherical depth vision and structural feedback enhancement model to achieve panoramic 3D reconstruction and interactive display, and has the advantages of high reconstruction accuracy, strong semantic association, and excellent interactive response.
[0006] A method for displaying 3D spatial information of a panoramic image based on deep learning, according to an embodiment of the present invention, includes the following steps:
[0007] Acquire panoramic images of the target scene, perform preprocessing, and generate a standardized panoramic image set;
[0008] A standardized panoramic image set is input into an improved MultiNet network for panoramic feature extraction and structural association processing, generating a depth map set, a semantic label map set, and a spatial structure parameter set.
[0009] An initial 3D point set is constructed based on a depth atlas, a semantic tag atlas, and a set of spatial structure parameters, and the spatial structure is reconstructed to generate a 3D mesh model of the target scene.
[0010] In the 3D mesh model, spatial information labels are bound to each mesh cell and each preset region of interest to construct a 3D spatial information data structure.
[0011] The three-dimensional spatial information data structure is input into the three-dimensional rendering engine. Based on the current viewpoint parameters, the three-dimensional mesh model is projected and rasterized, and the interface graphics corresponding to the spatial information labels are superimposed to generate a three-dimensional spatial display screen.
[0012] It receives user interaction commands for the 3D spatial display screen, performs data update processing on the viewpoint parameters and spatial information label set, updates the 3D spatial display screen based on the adjusted 3D spatial information data structure, and outputs it to the display terminal.
[0013] Optionally, the preprocessing includes geometric distortion correction, color normalization, and preset resolution adjustment.
[0014] Optionally, the generation of the depth atlas, semantic label atlas, and spatial structure parameter set specifically includes:
[0015] A spherical projection mapping is performed on a standardized panoramic image set. The projected panoramic image data is then input into the spherical Transformer rearrangement module of the improved MultiNet network. The input data is then globally rearranged and structurally adjusted according to the spherical feature distribution to generate image data rearranged by spherical features. The improved MultiNet network is based on the original MultiNet multi-task network structure with the addition of a spherical Transformer rearrangement module and a structural parameter feedback update module.
[0016] In the backbone encoder of the improved MultiNet network, multi-scale convolution processing is performed sequentially on the image data rearranged by spherical features to generate a multi-scale feature map set including multiple scales, multiple channels and spatial distribution structure;
[0017] Read the corresponding set of spatial structure parameters from the 3D mesh model generated by the previous round of inverse projection and obtained through topological processing, and establish a structure mapping table based on the pixel index of the current panoramic image for the spatial coordinates of the mesh vertices, the topological relationship of the mesh patches, and the relevant spatial structure information.
[0018] In the structural parameter feedback update module, the spatial structural parameters of the three-dimensional mesh model in the structural mapping table are projected item by item to the corresponding feature positions in the multi-scale feature map set according to the current imaging model, and the correspondence between the multi-scale feature positions and the projected three-dimensional structural information is established based on the projection results.
[0019] Based on the correspondence, in the structural parameter feedback update module, the difference between the predicted structural feature and the projected target structural feature is calculated for each feature location in the multi-scale feature map. The difference is used as the structural residual and converted into structural correction coefficients according to preset rules to generate a structural correction information set.
[0020] Based on the structural correction information set, structural consistency correction processing is performed on each feature position in the multi-scale feature map set. The corrected multi-scale feature map set is generated by weighted superposition of the original feature response and the structural correction component position by position.
[0021] The corrected multi-scale feature maps are input into the depth feature decoding branch, the semantic feature segmentation branch, and the structural feature encoding branch, respectively. After deconvolution regression, pixel classification, and structural encoding, the depth map, semantic label map, and spatial structure parameter set corresponding to the standardized panoramic image are generated.
[0022] Optionally, the generation of the 3D mesh model of the target scene specifically includes:
[0023] The depth value, semantic label, and spatial structure parameter of each pixel in the standardized panoramic image set are read from the depth atlas, semantic label atlas, and spatial structure parameter set. The planar pixel coordinates of each pixel are combined with the depth value according to the preset imaging geometry model to convert them into three-dimensional spatial coordinates, generating an initial three-dimensional point set including three-dimensional spatial coordinates, semantic labels, and texture information.
[0024] The consistency of the depth values of each point in the initial 3D point set with the depth distribution and spatial position relationship of other points in the spatial neighborhood is checked. Points whose depth changes exceed the preset threshold and whose spatial relationship with the neighboring points does not meet the preset geometric constraints are marked as noise points and removed, thus obtaining the processed 3D point set.
[0025] Based on the three-dimensional spatial coordinates of each point in the processed three-dimensional point set and the spatial distance between them, the three-dimensional point set is subjected to point set clustering processing according to the preset clustering scale and proximity relationship. The three-dimensional points are divided into multiple point clusters according to the spatial proximity relationship. Within each point cluster, the points are merged and sorted by combining the semantic labels and spatial structure parameters of the points to obtain the clustered three-dimensional point set.
[0026] Based on the distribution of point clusters in the clustered 3D point set, the connection relationship between each point and its neighboring points is determined according to the preset adjacency search strategy under the unified 3D spatial coordinate system. Based on the combination of multiple 3D points, the connection record between the mesh vertex and the mesh patch is generated. The connection record is then corrected and filtered for consistency by combining the topological relationship constraints contained in the spatial structure parameter set to obtain the mesh connection data.
[0027] A 3D mesh model is constructed in a unified 3D spatial coordinate system based on mesh connection data. The 3D points in the clustered 3D point set are used as mesh vertices, the connection records in the mesh connection data are used as mesh patches, and the corresponding semantic labels and texture information are attached to the corresponding mesh vertices and mesh patches respectively to generate a 3D mesh model of the target scene.
[0028] Optionally, the construction of the three-dimensional spatial information data structure specifically includes:
[0029] Read the mesh vertices, mesh patches and corresponding 3D spatial coordinate information of each mesh unit from the 3D mesh model, and divide the mesh unit set according to the spatial range parameters of the preset region of interest;
[0030] Based on the positional relationship of the mesh units in three-dimensional space and the semantic labels and spatial structure parameters corresponding to the mesh vertices, the value of the semantic category field is determined for each mesh unit. The semantic identifier representing the object category and scene region category is written into the semantic category field, and the same semantic category field record is assigned to mesh units that are in the same object or the same scene object.
[0031] Based on the topological connection relationship of the mesh cells in the 3D mesh model and the spatial structure parameters associated with the mesh cells, an object identification field is determined for each mesh cell, and the mesh cells are grouped according to the object identification field in each preset region of interest.
[0032] Based on the interaction configuration parameters of the preset region of interest and the spatial position of the grid cell in the preset region of interest, an interaction event type field is determined for each grid cell and each preset region of interest. The semantic category field, object identifier field, and interaction event type field are combined according to the organization method of the grid cell index and the preset region of interest index to construct a three-dimensional spatial information data structure consisting of grid cells, preset region of interest index, and corresponding spatial information labels.
[0033] Optionally, the generation of the three-dimensional spatial display specifically includes:
[0034] Input the three-dimensional spatial information data structure into the three-dimensional rendering engine, establish the viewpoint position, line of sight and projection view plane according to the current viewpoint parameters, and perform rendering initialization processing on the three-dimensional mesh model;
[0035] In the 3D rendering engine, the 3D spatial coordinates of the mesh vertices in the 3D mesh model are transformed from 3D to 2D according to the projection view plane. The mesh patches are rasterized according to the topological relationship and patch filling rules. The obtained pixel color values and depth values are written into the frame buffer structure and the basic image frames that constitute the 3D scene frame image sequence are output in the rendering order.
[0036] Pixel data is read from the base image frame. Based on the spatial information labels in the 3D spatial information data structure, the superposition position and display style of the navigation prompt graphics, object outline marker graphics and information window interface graphics in the current image frame are determined. The above graphics are written into the pixel data of the current image frame according to the superposition rules corresponding to the pixel positions to obtain the 3D scene frame image.
[0037] The 3D scene frame images are organized according to the rendering order and combined into a continuous visual display stream in time sequence to generate a 3D spatial display screen.
[0038] Optionally, the update of the three-dimensional spatial display screen specifically includes:
[0039] Receive and parse user interaction commands for the 3D spatial display screen, and extract the corresponding viewpoint change, viewpoint displacement, and spatial information trigger target identifier;
[0040] The viewpoint position, line of sight, and field of view in the current viewpoint parameters are updated based on the change in viewing angle. The current viewpoint position is translated and updated based on the viewpoint displacement. The updated viewpoint position, line of sight, and field of view are recorded as the updated viewpoint parameters.
[0041] Based on the spatial information triggering target identifier, search for the spatial information tag record corresponding to the target identifier in the three-dimensional spatial information data structure, add the corresponding tag record to the spatial information tag set in the active state, and mask the spatial information tag records in the three-dimensional spatial information data structure that are not related to the current rendering area according to the updated viewpoint parameters and rendering area parameters, so as to obtain the adjusted three-dimensional spatial information data structure.
[0042] The adjusted 3D spatial information data structure, updated viewpoint parameters, and rendering area parameters are input into the 3D rendering engine. Based on the adjusted 3D spatial information data structure, a new 3D scene frame image sequence is generated, and the updated 3D spatial display screen is output to the display terminal.
[0043] A deep learning-based panoramic 3D spatial information display device according to an embodiment of the present invention includes: a processor and a memory; the memory is used to store a computer program, and the processor calls the computer program stored in the memory to execute a deep learning-based panoramic 3D spatial information display method.
[0044] According to an embodiment of the present invention, a computer-readable storage medium stores a computer program that, when executed by a processor, enables the processor to perform a method for displaying panoramic three-dimensional spatial information based on deep learning.
[0045] The beneficial effects of this invention are:
[0046] This invention introduces spherical projection mapping, a spherical Transformer rearrangement module, and a structural parameter feedback update mechanism into the panoramic image processing flow, enabling the network to obtain geometrically consistent, directionally consistent, and scale-consistent feature representations within a panoramic spatial coordinate system. Compared to traditional feature extraction methods based solely on planar projection or ignoring spherical distortion, this invention significantly reduces depth estimation bias and semantic misclassification caused by projection deformation, maintaining a stable spatial correspondence between the depth atlas, semantic label atlas, and spatial structural parameter set. Furthermore, relying on the back projection and residual correction mechanism of structural parameters, the network features can achieve enhanced structural consistency across multiple scales and locations, solving the problem in existing technologies where 3D structures are difficult to constrain 2D features, leading to insufficient scene reconstruction accuracy. This results in a more complete, more precise, and semantically coherent 3D mesh model.
[0047] In the 3D display stage, this invention constructs a 3D spatial information data structure containing semantic category fields, object identifier fields, and interaction event type fields. This achieves deep binding between 3D structure, semantic relationships, and interaction logic, ensuring that each grid cell and its associated region of interest possesses spatial information labels that can be recognized and invoked by the rendering engine. This data structure not only dynamically overlays navigation prompts, object outline markers, and information window interface graphics during rendering, but also maintains consistency and accuracy in label presentation across different viewpoints and regions of interest, resulting in a final 3D spatial display with higher information density and comprehensibility. Compared to existing 3D displays with only static annotations or limited prompts, this invention significantly enhances the expressive power and spatial information visualization capabilities of the display system.
[0048] In terms of interactive updates, this invention, by parsing the user's viewpoint rotation commands, viewpoint movement commands, and spatial information trigger commands, can update the viewpoint parameters and spatial information label set in real time. Based on the updated 3D spatial information data structure, it regenerates a sequence of 3D scene frame images, achieving dynamic adjustment and instant feedback of the 3D spatial display. This mechanism solves the pain points of existing systems, such as low response speed, weak interactive feedback, and the inability to adjust the displayed content in real time according to user operations, enabling users to explore 3D space in a more natural and intuitive way. In summary, this invention achieves significant improvements in panoramic 3D scene reconstruction accuracy, spatial information fusion depth, rendering consistency, and interactive response capabilities, resulting in a more realistic, continuous, and intelligent overall display effect, demonstrating significant technological advancement and application value. Attached Figure Description
[0049] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:
[0050] Figure 1 This is a flowchart of a method for displaying 3D spatial information of panoramic images based on deep learning, as proposed in this invention.
[0051] Figure 2 This is a schematic diagram of the improved MultiNet network structure of a deep learning-based panoramic 3D spatial information display method proposed in this invention. Detailed Implementation
[0052] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.
[0053] refer to Figure 1-2 A method for displaying 3D spatial information of panoramic images based on deep learning includes the following steps:
[0054] Acquire panoramic images of the target scene, perform preprocessing, and generate a standardized panoramic image set;
[0055] A standardized panoramic image set is input into an improved MultiNet network for panoramic feature extraction and structural association processing, generating a depth map set, a semantic label map set, and a spatial structure parameter set.
[0056] An initial 3D point set is constructed based on a depth atlas, a semantic tag atlas, and a set of spatial structure parameters, and the spatial structure is reconstructed to generate a 3D mesh model of the target scene.
[0057] In the 3D mesh model, spatial information labels are bound to each mesh cell and each preset region of interest to construct a 3D spatial information data structure.
[0058] The three-dimensional spatial information data structure is input into the three-dimensional rendering engine. Based on the current viewpoint parameters, the three-dimensional mesh model is projected and rasterized, and the interface graphics corresponding to the spatial information labels are superimposed to generate a three-dimensional spatial display screen.
[0059] It receives user interaction commands for the 3D spatial display screen, performs data update processing on the viewpoint parameters and spatial information label set, updates the 3D spatial display screen based on the adjusted 3D spatial information data structure, and outputs it to the display terminal.
[0060] In this embodiment, the preprocessing includes geometric distortion correction, color normalization, and preset resolution adjustment.
[0061] In this embodiment, the generation of the depth atlas, semantic tag atlas, and spatial structure parameter set specifically includes:
[0062] A spherical projection mapping is performed on a standardized panoramic image set. The projected panoramic image data is then input into the spherical Transformer rearrangement module of the improved MultiNet network. The input data is then globally rearranged and structurally adjusted according to the spherical feature distribution to generate image data rearranged by spherical features. The improved MultiNet network is based on the original MultiNet multi-task network structure with the addition of a spherical Transformer rearrangement module and a structural parameter feedback update module.
[0063] The spherical projection mapping includes: performing coordinate transformation on the pixels of each panoramic image in the standardized panoramic image set according to a preset spherical projection relationship, converting the planar pixel positions represented by row and column indices into spherical coordinate positions composed of longitude and latitude angles, and writing the pixel values of each pixel into the corresponding spherical sampling positions according to the arrangement order of the spherical coordinates on the spherical sampling grid, forming spherical projection image data organized by the spherical sampling grid, so that each pixel has a unique spherical coordinate index;
[0064] The generation of image data rearranged by spherical features includes: inputting the image data after spherical projection into the spherical Transformer rearrangement module; in the spherical partitioning unit, slicing the input data into spherical partitions according to the spherical coordinate index to obtain feature segments divided according to the spherical regions; in the local feature extraction unit, extracting local feature vectors for each spherical region according to the preset adjacency relationship and sending them to the local self-attention unit; in the local self-attention unit, reordering and structurally adjusting the local feature values according to the correlation between the local feature vectors in the region; in the cross-regional attention fusion unit, performing cross-regional attention calculation on the feature relationship between different spherical regions and forming a global feature representation after cross-regional rearrangement; and in the feature recombination unit, recombining the features after local rearrangement and cross-regional fusion according to the spatial order of the spherical sampling grid to form spherical feature rearranged image data.
[0065] In the backbone encoder of the improved MultiNet network, multi-scale convolution processing is performed sequentially on the image data rearranged by spherical features to generate a multi-scale feature map set including multiple scales, multiple channels and spatial distribution structure;
[0066] The generation of the multi-scale feature map set includes: inputting image data rearranged by spherical features into the backbone encoder; performing channel expansion and local feature extraction on the input data through multi-layer convolutional units; setting convolution operations with varying spatial strides between each convolutional unit to form feature representations at different resolutions; performing feature splicing or feature fusion on feature data from different resolutions in the feature fusion unit to form a cross-scale feature structure; and then organizing the convolutional features and feature fusion results at each scale according to a preset scale organization method to generate a multi-scale feature map set.
[0067] Read the corresponding set of spatial structure parameters from the 3D mesh model generated by the previous round of inverse projection and obtained through topological processing, and establish a structure mapping table based on the pixel index of the current panoramic image for the spatial coordinates of the mesh vertices, the topological relationship of the mesh patches, and the relevant spatial structure information.
[0068] In the structural parameter feedback update module, the spatial structural parameters of the three-dimensional mesh model in the structural mapping table are projected item by item to the corresponding feature positions in the multi-scale feature map set according to the current imaging model, and the correspondence between the multi-scale feature positions and the projected three-dimensional structural information is established based on the projection results.
[0069] The establishment of the correspondence includes: in the structural projection unit of the structural parameter feedback update module, the spatial coordinates of the grid vertices, the topological relationship of the grid patches and the related spatial structural information in the structural mapping table are projected item by item to the corresponding feature positions in the multi-scale feature map set according to the current imaging model to obtain the projected three-dimensional structural information; in the structural association construction unit, the projected three-dimensional structural information is organized according to the spatial index of the feature position, and each feature position and the corresponding projected three-dimensional structural information are recorded as a structural association record in the index association method; in the structural association merging unit, all structural association records are merged according to the feature position index order to establish the correspondence between the multi-scale feature position and the projected three-dimensional structural information.
[0070] Based on the correspondence, in the structural parameter feedback update module, the difference between the predicted structural feature and the projected target structural feature is calculated for each feature location in the multi-scale feature map. The difference is used as the structural residual and converted into structural correction coefficients according to preset rules to generate a structural correction information set.
[0071] The structural correction coefficients are obtained by: in the structural difference calculation unit of the structural parameter feedback update module, reading the predicted structural features and the projected target structural features item by item at each feature position in the multi-scale feature map set according to the correspondence, and performing difference calculation on the two to generate the corresponding structural residuals; in the structural residual sorting unit, sorting and merging all structural residuals according to the feature position index; in the structural correction conversion unit, converting each structural residual into the corresponding structural correction coefficient according to the preset structural correction rules, and assembling all structural correction coefficients arranged by feature position index into a structural correction information set.
[0072] Based on the structural correction information set, structural consistency correction processing is performed on each feature position in the multi-scale feature map set. The corrected multi-scale feature map set is generated by weighted superposition of the original feature response and the structural correction component position by position.
[0073] The generation of the corrected multi-scale feature map set includes: in the structural correction application unit of the structural parameter feedback update module, the original feature response and the corresponding structural correction component are read item by item at each feature position in the multi-scale feature map set according to the structural correction information set, and the original feature response and the structural correction component are weighted and superimposed position by position according to the preset weighting rules to generate corrected feature values; in the feature correction organization unit, the corrected feature values at each feature position are organized and grouped according to the scale division method according to the scale organization structure of the multi-scale feature map set to form a corrected scale set including all corrected feature values; in the feature map set construction unit, the corrected scale sets are recombined into the corrected multi-scale feature map set according to the spatial arrangement method of the multi-scale feature map set.
[0074] The corrected multi-scale feature map is input into the depth feature decoding branch, the semantic feature segmentation branch and the structural feature encoding branch respectively. After deconvolution regression, pixel classification and structural encoding respectively, a depth map, a semantic label map and a spatial structure parameter set corresponding to the standardized panoramic image are generated.
[0075] The deconvolutional regression includes: inputting the corrected multi-scale feature map set into the depth feature decoding branch, passing it sequentially through a depth regression structure composed of multi-level deconvolution units and upsampling units, performing deconvolution operations on each level of feature map to restore spatial resolution, performing feature integration operations between the deconvolution output and the features of the previous scale to supplement depth information, and then sequentially passing the integrated features through the upsampling units for scale enlargement and spatial rearrangement, and finally arranging all the upsampled features into a depth map set according to the depth feature output format;
[0076] The pixel classification includes: inputting the corrected multi-scale feature map set into the semantic feature segmentation branch, passing it sequentially through the semantic segmentation structure composed of multi-level feature classification units and pixel-level classification layers, performing channel classification and pixel-level feature reading on the feature maps input from each layer, inputting the feature vector of each pixel position into the pixel-level classification layer for pixel-by-pixel category determination, and combining the classification labels of each pixel into a semantic label map set according to a specific pixel arrangement order after the classification result is generated;
[0077] The structural encoding specifically includes: inputting the corrected multi-scale feature map set into the structural feature encoding branch, sequentially passing through the structural encoding structure composed of structural feature extraction units and structural encoding layers, performing position-by-position structural feature extraction processing on the input feature map, inputting the extracted structural features into the structural encoding layer for encoding according to the preset structural parameter dimensions, generating structural parameter records according to the feature position order of the encoded structural parameters, and organizing and assembling all structural parameter records according to spatial indexes to finally form a spatial structural parameter set.
[0078] In this embodiment, the generation of the 3D mesh model of the target scene specifically includes:
[0079] The depth value, semantic tag, and spatial structure parameter of each pixel in the standardized panoramic image set are read from the depth atlas, semantic tag atlas, and spatial structure parameter set. The planar pixel coordinates of each pixel are combined with the depth value according to the preset imaging geometry model to convert them into three-dimensional spatial coordinates. An initial three-dimensional point set including three-dimensional spatial coordinates, semantic tags, and texture information is generated. The generation of the initial three-dimensional point set includes performing coordinate transformation on each pixel and writing the corresponding position record in a unified three-dimensional spatial coordinate system.
[0080] The consistency of the depth values of each point in the initial 3D point set with the depth distribution and spatial position relationship of other points in the spatial neighborhood is checked. Points whose depth changes exceed the preset threshold and whose spatial relationship with the neighboring points does not meet the preset geometric constraints are marked as noise points and removed, thus obtaining the processed 3D point set.
[0081] Based on the three-dimensional spatial coordinates of each point in the processed three-dimensional point set and the spatial distance between them, the three-dimensional point set is subjected to point set clustering processing according to the preset clustering scale and proximity relationship. The three-dimensional points are divided into multiple point clusters according to the spatial proximity relationship. Within each point cluster, the points are merged and sorted by combining the semantic labels and spatial structure parameters of the points to obtain the clustered three-dimensional point set.
[0082] Based on the distribution of point clusters in the clustered 3D point set, the connection relationship between each point and its neighboring points is determined according to the preset adjacency search strategy under the unified 3D spatial coordinate system. Based on the combination of multiple 3D points, the connection record between the mesh vertex and the mesh patch is generated. The connection record is then corrected and filtered for consistency by combining the topological relationship constraints contained in the spatial structure parameter set to obtain the mesh connection data.
[0083] A 3D mesh model is constructed in a unified 3D spatial coordinate system based on mesh connection data. The 3D points in the clustered 3D point set are used as mesh vertices, the connection records in the mesh connection data are used as mesh patches, and the corresponding semantic labels and texture information are attached to the corresponding mesh vertices and mesh patches respectively to generate a 3D mesh model of the target scene.
[0084] In this embodiment, the construction of the three-dimensional spatial information data structure specifically includes:
[0085] The mesh vertices, mesh patches, and corresponding 3D spatial coordinate information of each mesh unit are read from the 3D mesh model. The mesh unit set is divided according to the spatial range parameters of the preset region of interest. The region of interest is a 3D spatial sub-region defined in the 3D mesh model according to the preset spatial range parameters, including region index, spatial range parameters, and interactive configuration parameters. The spatial range parameters include the region spatial boundary and region boundary description data.
[0086] Based on the positional relationship of the mesh units in three-dimensional space and the semantic labels and spatial structure parameters corresponding to the mesh vertices, the value of the semantic category field is determined for each mesh unit. The semantic identifier representing the object category and scene region category is written into the semantic category field, and the same semantic category field record is assigned to mesh units that are in the same object or the same scene object.
[0087] Based on the topological connection relationship of the mesh cells in the 3D mesh model and the spatial structure parameters associated with the mesh cells, an object identification field is determined for each mesh cell, and the mesh cells are grouped according to the object identification field in each preset region of interest.
[0088] Based on the interaction configuration parameters of the preset interest region and the spatial position of the grid unit in the preset interest region, an interaction event type field is determined for each grid unit and each preset interest region. The semantic category field, object identifier field, and interaction event type field are combined according to the organization method of the grid unit index and the preset interest region index to construct a three-dimensional spatial information data structure composed of grid units, preset interest region indexes, and corresponding spatial information labels. The spatial information labels are composed of semantic category field, object identifier field, and interaction event type field.
[0089] In this embodiment, the generation of the three-dimensional spatial display screen specifically includes:
[0090] Input the three-dimensional spatial information data structure into the three-dimensional rendering engine, establish the viewpoint position, line of sight and projection view plane according to the current viewpoint parameters, and perform rendering initialization processing on the three-dimensional mesh model;
[0091] In the 3D rendering engine, the 3D spatial coordinates of the mesh vertices in the 3D mesh model are transformed from 3D to 2D according to the projection view plane. The mesh patches are rasterized according to the topological relationship and patch filling rules. The obtained pixel color values and depth values are written into the frame buffer structure and the basic image frames that constitute the 3D scene frame image sequence are output in the rendering order.
[0092] Pixel data is read from the base image frame. Based on the spatial information labels in the 3D spatial information data structure, the superposition position and display style of the navigation prompt graphics, object outline marker graphics and information window interface graphics in the current image frame are determined. The above graphics are written into the pixel data of the current image frame according to the superposition rules corresponding to the pixel positions to obtain the 3D scene frame image.
[0093] The 3D scene frame images are organized according to the rendering order and combined into a continuous visual display stream in time sequence to generate a 3D spatial display screen.
[0094] In this embodiment, the updating of the three-dimensional spatial display screen specifically includes:
[0095] The system receives and parses user interaction commands for the 3D spatial display screen, extracts the corresponding viewpoint change, viewpoint displacement, and spatial information trigger target identifier. The interaction commands consist of viewpoint rotation commands, viewpoint movement commands, and spatial information trigger commands.
[0096] The viewpoint position, line of sight, and field of view in the current viewpoint parameters are updated based on the change in viewing angle. The current viewpoint position is translated and updated based on the viewpoint displacement. The updated viewpoint position, line of sight, and field of view are recorded as the updated viewpoint parameters.
[0097] Based on the spatial information triggering target identifier, search for the spatial information tag record corresponding to the target identifier in the three-dimensional spatial information data structure, add the corresponding tag record to the spatial information tag set in the active state, and mask the spatial information tag records in the three-dimensional spatial information data structure that are not related to the current rendering area according to the updated viewpoint parameters and rendering area parameters, so as to obtain the adjusted three-dimensional spatial information data structure.
[0098] The adjusted 3D spatial information data structure, updated viewpoint parameters, and rendering area parameters are input into the 3D rendering engine. Based on the adjusted 3D spatial information data structure, a new 3D scene frame image sequence is generated, and the updated 3D spatial display screen is output to the display terminal.
[0099] A deep learning-based panoramic 3D spatial information display device includes a processor and a memory; the memory stores computer programs, and the processor calls the computer programs stored in the memory to execute a deep learning-based panoramic 3D spatial information display method.
[0100] A computer-readable storage medium storing a computer program that, when executed by a processor, enables the processor to perform a method for displaying panoramic 3D spatial information based on deep learning.
[0101] Example 1:
[0102] To verify the feasibility of this invention in practice, it was applied to a 3D inspection and navigation display system for an underground integrated utility tunnel in a smart city. This scenario consists of multiple interconnected equipment compartments, cable compartments, and hydraulic pumping stations, with a complex internal structure, diverse spatial levels, and dense monitoring points. It requires presenting a 3D visualization with navigation guidance, equipment identification, and hazard warning functions to different maintenance personnel. However, traditional inspection systems often rely on single-view camera footage or manually drawn simplified 3D models, which are insufficient in terms of spatial reconstruction accuracy, semantic annotation accuracy, and interactive experience to meet the needs of real-time inspection and dynamic navigation. This is especially true under panoramic information acquisition conditions, where issues such as image distortion, depth estimation errors, and unstable structural correlations are particularly prominent. This invention provides a complete panoramic image-driven 3D spatial display solution to address these problems.
[0103] In this application scenario, multiple panoramic images collected by the inspection robot are first preprocessed to ensure they possess uniform geometric structure and illumination distribution characteristics. Then, the standardized panoramic images are input into an improved MultiNet network for spherical projection and feature rearrangement, enabling the model to more stably understand spatial relationships within the panoramic images. Through multi-scale feature extraction in the backbone network and structural projection and correction processing in the structural parameter feedback update module, the joint generation of depth maps, semantic label maps, and spatial structural parameters is achieved. Next, a high-precision 3D point set is constructed in a unified 3D coordinate system, and a complete 3D mesh model is obtained through point set cleaning, clustering, and mesh reconstruction. This 3D mesh model can effectively reflect the actual spatial layout of the underground utility tunnel, including the routing of equipment pipelines, compartment distribution, and passageway structure.
[0104] After the 3D mesh model is constructed, regions of interest are defined for pipeline areas, equipment node areas, and safety maintenance locations. Semantic category, object identifier, and interaction event type fields are then bound to each mesh cell to generate a 3D spatial information data structure. When this data structure is input into the 3D rendering engine, the system can perform projection and rasterization rendering based on the current viewpoint, ensuring the image realistically reflects spatial depth and occlusion relationships. During rendering, navigation arrows, equipment outlines, and information window interfaces are automatically overlaid, allowing inspection personnel to directly see the identification name of a device, its current status label, and related operation entry points, significantly improving inspection efficiency.
[0105] During use, maintenance personnel can interact with the 3D spatial display in real time via tablet terminals. For example, they can rotate the viewpoint to observe pipeline layout, move the viewpoint to inspect the structure behind equipment, or trigger a specific area of interest to obtain detailed information. The system automatically updates viewpoint parameters based on user actions and filters the spatial information data structure, ensuring the display is quickly regenerated and remains consistent. Practical application has proven that this interactive method effectively helps personnel understand spatial relationships and achieve rapid location in complex environments.
[0106] Feedback from actual field use indicates that this invention significantly improves the stability of panoramic image-driven 3D display in this scenario, enhances the consistency between depth estimation and structural reconstruction, and makes the correspondence between the rendered image and the real space more reliable. Simultaneously, the overlay display of navigation prompts, semantic tags, and interactive information allows inspection personnel to complete the inspection and information retrieval of multiple compartments within a limited time, effectively improving work efficiency and overall task flow speed. In summary, this invention demonstrates high application value in complex spatial environments, providing a universal and scalable technical approach for more future 3D visualization inspection and navigation scenarios.
[0107] Table 1. Performance Comparison of the Invention and Traditional Panoramic 3D Spatial Information Display Methods
[0108] Indicator Categories Traditional methods Method of the present invention Panoramic depth estimation error (cm) 12.4 3.1 3D structure reconstruction accuracy (%) 82.7 96.4 Semantic tag accuracy (%) 75.2 93.8 Interactive response latency (ms) 185 72 Spatial information tag recognition rate (%) 68.9 95.1 3D rendering frame rate (frames / s) 28 54 User target location time (s) 16.8 6.3
[0109] As can be clearly seen from Table 1, the method of the present invention is superior to the traditional method in many indicators.
[0110] In panoramic depth estimation, traditional methods have a depth error of 12.4 cm, while the method of this invention has an error of only 3.1 cm. This difference represents a significant order-of-magnitude improvement. This invention employs spherical Transformer feature rearrangement and multi-scale structural parameter feedback correction, enabling the depth estimation network to maintain structural consistency under panoramic distortion conditions, reducing depth jumps and misjudgments of curved wall positions. Therefore, the stability and reliability of depth estimation are significantly improved.
[0111] Regarding the accuracy of 3D structure reconstruction, traditional methods achieve an accuracy of 82.7%, while this invention achieves 96.4%, an improvement of over 13 percentage points. This invention introduces a projection association mechanism for the set of 3D structural parameters during the structure reconstruction stage, binding each feature location to the real structural information from the previous 3D mesh. This avoids common problems such as skeleton breakage, patch drift, and incorrect adjacency relationships, making the 3D mesh model more consistent with the real spatial layout.
[0112] Regarding semantic label accuracy, traditional methods only achieve 75.2%, while this invention improves it to 93.8%. This is because, in the feature generation stage, this invention integrates spherical feature slicing with cross-regional attention, enabling the model to better identify spatial object boundaries in panoramic images. Furthermore, the spatial topological constraints provided by structural association records give the model "spatial awareness" during semantic classification, significantly reducing misclassification issues across object boundaries.
[0113] Regarding interactive response latency, traditional methods require 185 milliseconds to respond to interactive operations, while this invention requires only 72 milliseconds. This invention reduces the computational load required for reconstructing the entire image sequence by masking inactive tags in the spatial information tag set and performing local rendering updates within the rendering engine, thus significantly reducing the latency caused by interactive operations.
[0114] In terms of spatial information label recognition rate, this invention significantly improves the accuracy from the traditional 68.9% to 95.1%. This is because this invention binds a semantic category field, an object identifier field, and an interaction event type field to each grid cell, and performs structured organization within the region of interest. This allows label recognition to rely not only on image features but also on spatial structure parameters and grid topology relationships, thus maintaining a high accuracy rate even in complex scenes.
[0115] In terms of 3D rendering frame rate, this invention achieves 54 frames per second, while traditional methods only achieve 28 frames per second. Because this invention employs a frame buffer structure, a multi-resolution rendering strategy, and a local reconstruction mechanism based on viewpoint parameters, the rendering pipeline can reduce unnecessary patch processing, thereby improving overall rendering efficiency.
[0116] Finally, regarding user target location time, the traditional process takes an average of 16.8 seconds, while this invention only takes 6.3 seconds. Because this invention overlays navigation prompts, object outline markers, and information window interface graphics on the rendered screen, users can quickly identify the target device and direction of travel, thus significantly shortening search time and improving interaction efficiency.
[0117] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. A method for displaying 3D spatial information of panoramic images based on deep learning, characterized in that, Includes the following steps: Acquire panoramic images of the target scene, perform preprocessing, and generate a standardized panoramic image set; A standardized panoramic image set is input into an improved MultiNet network for panoramic feature extraction and structural association processing, generating a depth map set, a semantic label map set, and a spatial structure parameter set. An initial 3D point set is constructed based on a depth atlas, a semantic tag atlas, and a set of spatial structure parameters, and the spatial structure is reconstructed to generate a 3D mesh model of the target scene. In the 3D mesh model, spatial information labels are bound to each mesh cell and each preset region of interest to construct a 3D spatial information data structure. The three-dimensional spatial information data structure is input into the three-dimensional rendering engine. Based on the current viewpoint parameters, the three-dimensional mesh model is projected and rasterized, and the interface graphics corresponding to the spatial information labels are superimposed to generate a three-dimensional spatial display screen. It receives user interaction commands for the 3D spatial display screen, performs data update processing on the viewpoint parameters and spatial information label set, updates the 3D spatial display screen based on the adjusted 3D spatial information data structure, and outputs it to the display terminal.
2. The method for displaying panoramic three-dimensional spatial information based on deep learning according to claim 1, characterized in that, The preprocessing includes geometric distortion correction, color normalization, and preset resolution adjustment.
3. The method for displaying panoramic 3D spatial information based on deep learning according to claim 1, characterized in that, The generation of the depth atlas, semantic tag atlas, and spatial structure parameter set specifically includes: A spherical projection mapping is performed on a standardized panoramic image set. The projected panoramic image data is then input into the spherical Transformer rearrangement module of the improved MultiNet network. The input data is then subjected to global feature rearrangement and structural adjustment based on the spherical feature distribution to generate image data rearranged by spherical features. The improved MultiNet network is based on the original MultiNet multi-task network structure with the addition of a spherical Transformer rearrangement module and a structural parameter feedback update module. In the backbone encoder of the improved MultiNet network, multi-scale convolution processing is performed sequentially on the image data rearranged by spherical features to generate a multi-scale feature map set including multiple scales, multiple channels and spatial distribution structure; Read the corresponding set of spatial structure parameters from the 3D mesh model generated by the previous round of inverse projection and obtained through topological processing, and establish a structure mapping table based on the pixel index of the current panoramic image for the spatial coordinates of the mesh vertices, the topological relationship of the mesh patches, and the relevant spatial structure information. In the structural parameter feedback update module, the spatial structural parameters of the three-dimensional mesh model in the structural mapping table are projected item by item to the corresponding feature positions in the multi-scale feature map set according to the current imaging model, and the correspondence between the multi-scale feature positions and the projected three-dimensional structural information is established based on the projection results. Based on the correspondence, in the structural parameter feedback update module, the difference between the predicted structural feature and the projected target structural feature is calculated for each feature position in the multi-scale feature map set. The difference is used as the structural residual and converted into structural correction coefficients according to preset rules to generate a structural correction information set. Based on the structural correction information set, structural consistency correction processing is performed on each feature position in the multi-scale feature map set. The corrected multi-scale feature map set is generated by weighted superposition of the original feature response and the structural correction component position by position. The corrected multi-scale feature maps are input into the depth feature decoding branch, the semantic feature segmentation branch, and the structural feature encoding branch, respectively. After deconvolution regression, pixel classification, and structural encoding, the depth map, semantic label map, and spatial structure parameter set corresponding to the standardized panoramic image are generated.
4. The method for displaying 3D spatial information of panoramic images based on deep learning according to claim 1, characterized in that, The generation of the 3D mesh model of the target scene specifically includes: The depth value, semantic label, and spatial structure parameter of each pixel in the standardized panoramic image set are read from the depth atlas, semantic label atlas, and spatial structure parameter set. The planar pixel coordinates of each pixel are combined with the depth value according to the preset imaging geometry model to convert them into three-dimensional spatial coordinates, generating an initial three-dimensional point set including three-dimensional spatial coordinates, semantic labels, and texture information. The consistency of the depth values of each point in the initial 3D point set with the depth distribution and spatial position relationship of other points in the spatial neighborhood is checked. Points whose depth changes exceed the preset threshold and whose spatial relationship with the neighboring points does not meet the preset geometric constraints are marked as noise points and removed, thus obtaining the processed 3D point set. Based on the three-dimensional spatial coordinates of each point in the processed three-dimensional point set and the spatial distance between them, the three-dimensional point set is subjected to point set clustering processing according to the preset clustering scale and proximity relationship. The three-dimensional points are divided into multiple point clusters according to the spatial proximity relationship. Within each point cluster, the points are merged and sorted by combining the semantic labels and spatial structure parameters of the points to obtain the clustered three-dimensional point set. Based on the distribution of point clusters in the clustered 3D point set, the connection relationship between each point and its neighboring points is determined according to the preset adjacency search strategy under the unified 3D spatial coordinate system. Based on the combination of multiple 3D points, the connection record between the mesh vertex and the mesh patch is generated. The connection record is then corrected and filtered for consistency by combining the topological relationship constraints contained in the spatial structure parameter set to obtain the mesh connection data. A 3D mesh model is constructed in a unified 3D spatial coordinate system based on mesh connection data. The 3D points in the clustered 3D point set are used as mesh vertices, the connection records in the mesh connection data are used as mesh patches, and the corresponding semantic labels and texture information are attached to the corresponding mesh vertices and mesh patches respectively to generate a 3D mesh model of the target scene.
5. The method for displaying panoramic three-dimensional spatial information based on deep learning according to claim 1, characterized in that, The construction of the three-dimensional spatial information data structure specifically includes: Read the mesh vertices, mesh patches and corresponding 3D spatial coordinate information of each mesh unit from the 3D mesh model, and divide the mesh unit set according to the spatial range parameters of the preset region of interest; Based on the positional relationship of the mesh units in three-dimensional space and the semantic labels and spatial structure parameters corresponding to the mesh vertices, the value of the semantic category field is determined for each mesh unit. The semantic identifier representing the object category and scene region category is written into the semantic category field, and the same semantic category field record is assigned to mesh units that are in the same object or the same scene object. Based on the topological connection relationship of the mesh cells in the 3D mesh model and the spatial structure parameters associated with the mesh cells, an object identification field is determined for each mesh cell, and the mesh cells are grouped according to the object identification field in each preset region of interest. Based on the interaction configuration parameters of the preset region of interest and the spatial position of the grid cell in the preset region of interest, an interaction event type field is determined for each grid cell and each preset region of interest. The semantic category field, object identifier field, and interaction event type field are combined according to the organization method of the grid cell index and the preset region of interest index to construct a three-dimensional spatial information data structure consisting of grid cells, preset region of interest index, and corresponding spatial information labels.
6. The method for displaying panoramic three-dimensional spatial information based on deep learning according to claim 1, characterized in that, The generation of the three-dimensional spatial display specifically includes: Input the three-dimensional spatial information data structure into the three-dimensional rendering engine, establish the viewpoint position, line of sight and projection view plane according to the current viewpoint parameters, and perform rendering initialization processing on the three-dimensional mesh model; In the 3D rendering engine, the 3D spatial coordinates of the mesh vertices in the 3D mesh model are transformed from 3D to 2D according to the projection view plane. The mesh patches are rasterized according to the topological relationship and patch filling rules. The obtained pixel color values and depth values are written into the frame buffer structure and the basic image frames that constitute the 3D scene frame image sequence are output in the rendering order. Pixel data is read from the base image frame. Based on the spatial information labels in the 3D spatial information data structure, the superposition position and display style of the navigation prompt graphics, object outline marker graphics and information window interface graphics in the current image frame are determined. The above graphics are written into the pixel data of the current image frame according to the superposition rules corresponding to the pixel positions to obtain the 3D scene frame image. The 3D scene frame images are organized according to the rendering order and combined into a continuous visual display stream in time sequence to generate a 3D spatial display screen.
7. The method for displaying panoramic three-dimensional spatial information based on deep learning according to claim 1, characterized in that, The update of the three-dimensional spatial display specifically includes: Receive and parse user interaction commands for the 3D spatial display screen, and extract the corresponding viewpoint change, viewpoint displacement, and spatial information trigger target identifier; The viewpoint position, line of sight, and field of view in the current viewpoint parameters are updated based on the change in viewing angle. The current viewpoint position is translated and updated based on the viewpoint displacement. The updated viewpoint position, line of sight, and field of view are recorded as the updated viewpoint parameters. Based on the spatial information triggering target identifier, search for the spatial information tag record corresponding to the target identifier in the three-dimensional spatial information data structure, add the corresponding tag record to the spatial information tag set in the active state, and mask the spatial information tag records in the three-dimensional spatial information data structure that are not related to the current rendering area according to the updated viewpoint parameters and rendering area parameters, so as to obtain the adjusted three-dimensional spatial information data structure. The adjusted 3D spatial information data structure, updated viewpoint parameters, and rendering area parameters are input into the 3D rendering engine. Based on the adjusted 3D spatial information data structure, a new 3D scene frame image sequence is generated, and the updated 3D spatial display screen is output to the display terminal.
8. A panoramic 3D spatial information display device based on deep learning, characterized in that, include: Processor and memory; The memory is used to store computer programs, and the processor calls the computer programs stored in the memory to execute the method for displaying panoramic three-dimensional spatial information based on deep learning as described in any one of claims 1 to 7.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, enables the processor to perform a method for displaying panoramic three-dimensional spatial information based on deep learning, as described in any one of claims 1 to 7.