A method and system for selecting rendering elements based on spatial scene adaptation

By generating a structured spatial scene description map and performing dynamic interaction impact analysis, and adaptively selecting rendering elements, the problem of low rendering effect and resource utilization efficiency in existing technologies is solved, and high-quality 3D spatial scene rendering is achieved.

CN122312866APending Publication Date: 2026-06-30WEIFANG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
WEIFANG UNIV OF SCI & TECH
Filing Date
2026-06-02
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing 3D spatial scene rendering methods lack dynamic interaction impact analysis, and cannot identify the semantic focus area of ​​the scene and user interaction needs, resulting in low rendering effect and resource utilization efficiency.

Method used

By acquiring multi-dimensional spatial perception data, a structured spatial scene description map is generated, dynamic interaction event trigger points and user historical interaction trajectories are extracted, the dynamic interaction influence intensity field is calculated, the semantic focus area of ​​the scene is identified, and rendering elements are adaptively filtered based on the attention weight distribution map. In combination with real-time resource constraints, priority sorting and secondary filtering are performed to drive the graphics processor to execute differentiated rendering instructions.

Benefits of technology

It improves the realism and interactivity of 3D spatial scene rendering, optimizes resource utilization efficiency, resolves the contradiction between rendering effect and resource consumption, and meets the needs of high-quality rendering.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122312866A_ABST
    Figure CN122312866A_ABST
Patent Text Reader

Abstract

This invention relates to the field of scene rendering technology, and more particularly to a method and system for selecting rendering elements based on spatial scene adaptation. The method includes the following steps: acquiring multi-dimensional spatial perception data corresponding to a target 3D spatial scene and performing entity and virtual structure analysis and spatiotemporal correlation mapping of the spatial scene to generate a dynamic interaction influence intensity field corresponding to the target 3D spatial scene; simultaneously identifying scene semantic focus regions to obtain multiple candidate semantic focus sub-regions; calculating scene attention weights and adaptively filtering each candidate semantic focus sub-region based on the dynamic interaction influence intensity field to generate a preliminary set of scene-adapted rendering elements; acquiring real-time rendering resource constraints and dynamically sorting and secondary filtering based on rendering priorities to generate a spatial scene-adapted rendering element sequence, and driving the graphics processor to execute differentiated rendering instructions based on this sequence. This invention maximizes the efficiency of rendering resource utilization.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of scene rendering technology, and in particular to a method and system for selecting rendering elements based on spatial scene adaptation. Background Technology

[0002] With the rapid popularization of technologies such as 3D visualization, virtual reality (VR), and augmented reality (AR), 3D spatial scenes are increasingly widely used in fields such as gaming, digital twins, and virtual simulation, placing higher demands on the realism, interactivity, and resource utilization efficiency of scene rendering. Currently, some methods and systems for selecting 3D spatial scene rendering elements have been proposed. These methods mostly rely on a pre-set, fixed list of rendering elements, simple filtering based on basic scene features, and then executing rendering operations according to a unified rendering priority. However, existing methods lack dynamic interaction impact analysis and cannot identify the semantic focus areas of the scene or user interaction needs, thus reducing the rendering effect and resource utilization efficiency of 3D spatial scenes. Summary of the Invention

[0003] Therefore, the present invention needs to provide a rendering element selection method and system based on spatial scene adaptation to solve at least one of the above-mentioned technical problems.

[0004] To achieve the above objectives, a rendering element selection method based on spatial scene adaptation includes the following steps: Step S1: Obtain multi-dimensional spatial perception data corresponding to the target three-dimensional spatial scene, and perform entity and virtual structure analysis of the spatial scene based on the multi-dimensional spatial perception data to generate a structured spatial scene description map corresponding to the target three-dimensional spatial scene. Step S2: Extract the distribution of dynamic interaction event trigger points in the scene based on the structured spatial scene description map, and perform spatiotemporal correlation mapping on the user's historical interaction trajectory data in the scene according to the distribution of dynamic interaction event trigger points to generate a dynamic interaction influence intensity field corresponding to the target three-dimensional spatial scene; Step S3: Identify the semantic focus region of the structured spatial scene description map to obtain multiple candidate semantic focus sub-regions; calculate the scene attention weight of each candidate semantic focus sub-region based on the dynamic interaction influence intensity field to generate a multi-level scene attention weight distribution map corresponding to the target 3D spatial scene. Step S4: Based on the multi-level scene attention weight distribution map, adaptively filter the candidate library of rendering elements in the target 3D spatial scene, calculate the spatial semantic correlation and interaction necessity measure between each rendering element and the corresponding candidate semantic focus sub-region, and generate a preliminary set of rendering elements adapted to the scene. Step S5: Obtain real-time rendering resource constraints, and based on the real-time rendering resource constraints and the multi-level scene attention weight distribution map, dynamically sort and filter the rendering priority of the initial rendering element set to generate a spatial scene-adaptive rendering element sequence, and drive the graphics processor to execute differentiated rendering instructions according to the sequence.

[0005] Furthermore, step S1 includes the following steps: The target three-dimensional spatial scene is captured by three-dimensional laser scanning and multi-view image fusion technology, and the original multi-dimensional spatial perception data including geometric point cloud, material texture reflectivity and ambient lighting gradient are obtained. The geometric point cloud in the original multi-dimensional spatial perception data is reconstructed with topological consistency and voxelized to generate a dense geometric voxel mesh model corresponding to the target three-dimensional spatial scene. High-frequency detail features and low-frequency primary color distribution are extracted from the material texture reflectivity data. The light propagation attenuation and indirect illumination contribution at each location in the scene are calculated based on the ambient light gradient data to generate a global illumination propagation path estimation map. By integrating dense geometric voxel mesh models, high-frequency detail features, low-frequency primary color distribution, and global illumination propagation path estimation maps, the geometric boundaries and material properties of solid objects are associated and bound together. At the same time, non-solid illumination areas, shadow areas, and reflection virtual imaging areas are identified and marked as virtual structures. Based on the association and binding results of entity objects and the virtual structure of the tags, a hierarchical graph structure containing geometry, material, lighting and virtual-real relationships is constructed to generate a structured spatial scene description graph corresponding to the target 3D spatial scene.

[0006] Furthermore, step S2 includes the following steps: Identify all interactive entities and triggerable virtual effect areas from the structured spatial scene description map, and define their spatial coordinate set as the distribution of dynamic interactive event trigger points; obtain the user's historical interaction trajectory data recorded in the target 3D spatial scene historical session, which includes the user's view movement path, object click sequence and dwell time stamp; A spatiotemporal decay kernel function is constructed with each point in the distribution of dynamic interactive event trigger points as the center, and each interactive event in the user's historical interaction trajectory data is mapped to the corresponding decay kernel function according to its spatiotemporal coordinates; the cumulative intensity value of each dynamic interactive event trigger point affected by all its historical related interactive events is calculated, and the cumulative intensity value is jointly determined by the interaction event type weight, dwell time weight, and spatiotemporal distance from the trigger point; The cumulative intensity values ​​of all dynamic interaction event trigger points are spatially interpolated, smoothed, and normalized to generate a continuous scalar field covering the entire target 3D spatial scene, namely the dynamic interaction influence intensity field.

[0007] Furthermore, step S3 includes the following steps: The structured spatial scene description map is subjected to graph neural network node feature propagation and aggregation. The map is then segmented by unsupervised clustering based on the semantic feature similarity and spatial proximity of the nodes, generating multiple semantically and spatially continuous sub-map clusters. Each sub-map cluster is defined as a candidate semantic focus sub-region. The inherent semantic importance score of all nodes in each candidate semantic focus sub-region in the graph structured spatial scene description map is extracted. This score is calculated by comprehensively considering the node type, its centrality in the map, and its material optical property complexity. Obtain the dynamic interaction influence intensity field, and extract the average dynamic interaction influence intensity value of the spatial range covered by each candidate semantic focus sub-region from the field; Based on the inherent semantic importance score of each candidate semantic focus sub-region and its corresponding average dynamic interaction influence strength value, the primary scene attention weight of each candidate semantic focus sub-region is calculated through attention fusion. Considering the spatial occlusion relationship and line-of-sight continuity between candidate semantic focus sub-regions, an attention transfer probability matrix between regions is constructed. This matrix is ​​then used to iteratively optimize and diffuse the primary scene attention weights, generating a multi-level scene attention weight distribution map. The weights are attached to the scene spatial grid at different resolution levels.

[0008] Furthermore, step S4 includes the following steps: Obtain a candidate library of rendering elements for the target 3D spatial scene, where each rendering element is associated with a predefined geometric complexity descriptor, material texture identifier, and its potential binding space range in the scene; perform a spatial intersection test between the multi-level scene attention weight distribution map and the potential binding space range of each element in the candidate rendering element library; if they intersect, calculate the average scene attention weight of the element based on the pixel values ​​of the multi-level scene attention weight distribution map covered by the range. Based on the structured spatial scene description map, the cosine similarity between the geometric and material properties of the rendered element and the overall semantic features of the candidate semantic focus sub-region to which its bound spatial range belongs is calculated as a measure of spatial semantic relevance. Based on the dynamic interaction influence intensity field, the frequency and type diversity of historical interaction events in the bound spatial range of the element are extracted, and the interaction necessity measure that supports future potential interactions is calculated through a trained interaction necessity prediction model. The average scene attention weight, spatial semantic relevance metric, and interaction necessity metric of each rendered element are input into a preset filtering decision function. This function outputs a binary decision flag, and elements flagged as true are retained. All retained rendered elements and their associated geometry and material data are packaged to generate a preliminary set of rendered elements adapted to the scene.

[0009] Furthermore, the generation of the spatial scene adaptation rendering element sequence in step S5 includes the following steps: The system obtains real-time rendering resource constraints from the graphics processor's real-time monitoring interface, including the available video memory budget for the current frame, the shader core computing power margin, and the maximum number of triangles per frame required by the target frame rate. For each rendering element in the initial rendering element set, the system estimates the video memory usage and computing overhead required for rendering based on its geometric complexity descriptor, and calculates its initial rendering priority score by combining the final weight of its candidate semantic focus sub-region in the multi-level scene attention weight distribution map. Based on the memory budget and computing power margin in the real-time rendering resource constraints, a dynamic resource allocation optimization model is constructed. This model aims to maximize the sum of the weighted priority scores of all selected elements, with the constraints that the total memory usage, total computing overhead, and total number of triangles do not exceed a certain limit. The model performs a 0-1 integer programming solution on the initial set of rendering elements. Elements with a value of 1 in the optimization model solution are included in the candidate sequence and sorted in descending order according to their initial rendering priority scores. The system detects whether there are spatially adjacent and semantically highly similar redundant element pairs in the candidate sequence. If so, it calculates the merging benefit based on their spatial distance and material difference, and performs geometric instantiation merging optimization on element pairs with merging benefits higher than the threshold, updating their rendering cost estimates. The candidate sequence optimized by redundancy merging is defined as the spatial scene-adaptive rendering element sequence.

[0010] Furthermore, step S5, which involves driving the graphics processor to execute differentiated rendering instructions based on the sequence, includes: The final spatial scene adaptation rendering element sequence is analyzed, and each element in the sequence is divided into multiple rendering levels according to its initial rendering priority score. Different rendering quality preset schemes are assigned to each level. Based on the rendering quality preset schemes corresponding to the rendering level, a differentiated rendering resource description block is generated for each rendering element, which includes vertex buffer positioning information, texture sampling parameters and shader constant data. A frame-by-frame rendering task queue is constructed, and elements with high priority rendering levels are assigned to keyframes for full-quality rendering. Elements with medium and low priority rendering levels are dynamically assigned to subsequent non-keyframes or progressively enhanced rendering based on historical frames according to their spatial location and motion prediction. The rendering resource description block and the frame-by-frame rendering task queue are compiled into a list of instructions executable by the graphics processor and distributed to different shader core groups of the graphics processor through the compute shader. The system monitors the execution status of the graphics processor's rendering pipeline. When a significant change in the constraints of real-time rendering resources is detected, the system triggers the dynamic sorting and secondary filtering steps for rendering priorities to dynamically update the spatial scene adaptation rendering element sequence and execute the corresponding differentiated rendering instructions.

[0011] Furthermore, constructing the frame-by-frame rendering task queue includes the following steps: Define the scene region within the current user's view frustum as the critical rendering region. Select elements with high rendering levels that are located in the critical rendering region from the spatial scene adaptation rendering element sequence and mark them as the keyframe forced rendering element set. For the remaining elements in the spatial scene adaptation rendering element sequence, predict their probability and time of entering the critical rendering region in subsequent consecutive frames using a Kalman filter based on the movement speed of the center point of their bound spatial range and the user's view movement direction. Elements whose predicted probability of entering the critical rendering zone within the next N frames is higher than the first threshold are dynamically allocated to the frame in which they are expected to enter the critical rendering zone as the key frame forced rendering element of that frame and reserved rendering resources are used. Elements that are predicted not to enter the critical rendering area and have a low rendering level within the next M frames are marked as progressive enhancement rendering candidate sets, and a series of multi-resolution level detail models from low to high are generated for the elements in this set. In non-key frames, for elements in the progressive enhancement rendering candidate set, a suitable version is selected from the corresponding multi-resolution hierarchical detail model for rendering based on their dynamic distance from the viewpoint and the remaining rendering resources of the current frame. The rendering results can be temporally mixed with the rendering results of historical frames to construct a frame-by-frame rendering task queue.

[0012] Furthermore, in non-keyframes, the step of selecting a suitable version from the corresponding multi-resolution level detail model for rendering elements in the progressive enhancement rendering candidate set based on their dynamic distance from the viewpoint and the remaining rendering resources of the current frame includes the following steps: The system calculates the Euclidean distance from the center point of the bound space of each element in the progressive enhancement rendering candidate set to the current user viewpoint in real time, which is used as its dynamic view distance; it calculates the remaining triangle budget and video memory budget that can be used for progressive enhancement rendering based on the real-time rendering resource constraints of the current non-keyframe; it predefines the ideal rendering level related to the dynamic view distance for the multi-resolution hierarchical detail model of each element, as well as the number of triangles and video memory overhead required to achieve the level. With the goal of maximizing the overall visual quality perception score, a resource allocation optimization problem is constructed under the constraints of the remaining triangle budget and the video memory budget. For each element, an actual rendering level detail model version is selected, which is allowed to be lower than but whose quality is not lower than its ideal rendering level. Based on the optimization problem solution, an appropriate hierarchical detail model version is assigned to each element, corresponding drawing call commands are generated, and these commands are inserted into the rendering command stream of non-keyframes for execution and rendering to generate the corresponding rendering results.

[0013] Furthermore, the present invention also provides a rendering element selection system based on spatial scene adaptation, including a processor, a memory, and a computer program stored in the memory and executable on the processor, for performing the rendering element selection method based on spatial scene adaptation as described above.

[0014] The beneficial effects of this invention are: The rendering element selection method and system based on spatial scene adaptation proposed in this invention, compared with the prior art, has the following advantages: By acquiring multi-dimensional spatial perception data of the target three-dimensional spatial scene, analyzing the entity and virtual structure of the scene, and generating a structured spatial scene description map, the scattered scene data is integrated into a structured scene description map through the analysis of entity and virtual structure, clearly presenting the spatial position, relationship, and semantic features of various elements in the scene. This breaks through the shortcomings of traditional scene description fragmentation and inability to support accurate rendering, laying a solid foundation for optimizing rendering effects and improving resource utilization efficiency, and making rendering operations more in line with the actual characteristics of the scene and user needs. Secondly, by extracting dynamic interaction event trigger points based on the structured spatial scene description map and generating a dynamic interaction influence intensity field by combining user historical interaction trajectory data, this step accurately extracts dynamic interaction event trigger points from the scene description map, clarifying the key locations in the scene where users may interact, breaking through the limitations of traditional methods that ignore interaction needs and blindly select rendering elements. By mapping user historical interaction trajectories to trigger points in a spatiotemporal manner, the frequency and depth of interactions in different regions are quantitatively analyzed, generating a dynamic interaction influence intensity field. This clearly presents the degree of influence of each region in the scene on user interactions, intuitively reflecting user interaction preferences and key needs. By identifying semantic focus regions of the scene and calculating attention weights based on the dynamic interaction influence intensity field, a multi-level scene attention weight distribution map is generated. This step performs semantic analysis on the structured scene description graph, identifying multiple candidate semantic focus sub-regions and focusing on key areas in the scene with core semantics and significant impact on user experience, breaking the limitation of traditional methods in locating scene priorities. Based on the dynamic interaction influence intensity field, scene attention weights are calculated for each candidate semantic focus sub-region. Weight levels are assigned according to the degree of interaction influence and semantic importance, generating a multi-level attention weight distribution map. This clearly defines the rendering priority of different regions in the scene, clarifying which regions require focused rendering and which regions can be simplified, improving overall rendering quality and resource utilization efficiency. Then, based on a multi-level scene attention weight distribution map, the candidate library of rendering elements is adaptively filtered. Spatial semantic relevance and interaction necessity metrics are calculated, and rendering elements are adaptively filtered according to the attention weight distribution map, prioritizing those related to high-weight semantic focus sub-regions. This avoids irrelevant rendering elements consuming resources, overcoming the shortcomings of traditional filtering methods that lack flexibility and adaptability. By calculating the spatial semantic relevance of each rendering element to its corresponding semantic focus sub-region, the rendering elements are ensured to fit the semantic features of the scene. At the same time, the necessity of interaction is evaluated, retaining rendering elements that support user interaction and eliminating redundant elements, generating a preliminary set of rendering elements adapted to the scene.Finally, by acquiring real-time rendering resource constraints, the initial set of rendering elements is dynamically prioritized and subjected to secondary filtering. Rendering elements corresponding to high-weight semantic focus areas are prioritized to ensure the supply of rendering resources for key areas. Simultaneously, secondary filtering is performed based on resource constraints to eliminate redundant elements exceeding resource capacity, avoiding rendering stuttering and latency issues. The generated spatial scene-adaptive rendering element sequence clarifies the rendering order and resource allocation ratio of each element, driving the graphics processor to execute differentiated rendering instructions. High-priority elements are rendered with fine detail, while low-priority elements are rendered with simplified rendering, achieving a balance between rendering quality and resource efficiency. This ensures both the realism and interactivity of the scene rendering while maximizing rendering resource utilization efficiency, effectively resolving the contradiction between rendering effects and resource consumption, and meeting the high-quality requirements of 3D spatial scene rendering. Attached Figure Description

[0015] Other features, objects, and advantages of the invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings: Figure 1 This is a schematic diagram of the steps of the rendering element selection method based on spatial scene adaptation of the present invention; Figure 2 for Figure 1 A detailed flowchart of step S1; Figure 3 for Figure 1 A detailed flowchart of step S2. Detailed Implementation

[0016] The technical method of the present invention will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

[0017] To achieve the above objectives, please refer to Figure 1-2 This invention provides a rendering element selection method based on spatial scene adaptation. In this example, the rendering element selection method based on spatial scene adaptation includes the following steps: Step S1: Obtain multi-dimensional spatial perception data corresponding to the target three-dimensional spatial scene, and perform entity and virtual structure analysis of the spatial scene based on the multi-dimensional spatial perception data to generate a structured spatial scene description map corresponding to the target three-dimensional spatial scene. In this embodiment of the invention, a three-dimensional indoor spatial scene is captured by cooperating a three-dimensional laser scanning device and a multi-view image acquisition device. The scanning interval of the three-dimensional laser scanning device is set to 10 mm, covering all physical objects and environmental areas within the scene. Six multi-view image acquisition devices simultaneously capture images from 0°, 60°, 120°, 180°, 240°, and 300° azimuths, with a capture resolution of 3000×2000 pixels. An image fusion algorithm is used to fuse the multi-view images with the laser scanning data to obtain multi-dimensional spatial perception data including geometric point clouds, material texture reflectivity, and ambient lighting gradients. The geometric point cloud contains the three-dimensional coordinates (x, y, z) of each point in the scene; the material texture reflectivity contains the red, green, and blue channel reflectivity values ​​of each point (range 0-255); and the ambient lighting gradient contains the rate of change of light intensity in the horizontal and vertical directions for each point. A point cloud registration algorithm is used to eliminate deviations at different scanning angles of the geometric point cloud, controlling the deviation to within 5 mm. Then, voxelization is applied to convert the point cloud into a voxel mesh with a side length of 5 mm, generating a dense geometric voxel mesh model. A Gaussian filtering algorithm (3×3 filter kernel) is used to filter high-frequency noise in the material reflectivity data, preserving low-frequency primary color distribution. A Laplacian operator (coefficient 0.2) is then used to extract high-frequency texture details. Based on the ambient lighting gradient, light propagation attenuation is calculated using the distance attenuation formula (attenuation coefficient 0.05, attenuation value = 1 / (1+0.05×distance)). An indirect lighting contribution is calculated using a radiometric algorithm, generating a global lighting propagation path estimation map. Material texture features are associated and bound to voxel meshes through feature matching algorithms. Non-physical lighting areas, shadow areas, and reflection virtual imaging areas are identified and marked as virtual structures through image segmentation algorithms. A three-layer hierarchical graph structure is constructed. The bottom layer stores the three-dimensional coordinates and topological relationships of voxel meshes, the middle layer stores the material reflectivity and lighting parameters, and the top layer stores the spatial positional relationship and mutual influence between entities and virtual bodies. Finally, a structured spatial scene description graph is generated.

[0018] Step S2: Extract the distribution of dynamic interaction event trigger points in the scene based on the structured spatial scene description map, and perform spatiotemporal correlation mapping on the user's historical interaction trajectory data in the scene according to the distribution of dynamic interaction event trigger points to generate a dynamic interaction influence intensity field corresponding to the target three-dimensional spatial scene; In this embodiment of the invention, the geometric boundaries and attribute information of all entity objects are extracted from the structured spatial scene description map. An object detection algorithm is used to identify interactive entity objects and triggerable virtual body effect areas. The spatial coordinate set of these two entities is defined as the distribution of dynamic interactive event trigger points, with each trigger point corresponding to a unique three-dimensional coordinate (x, y, z). User historical interaction trajectory data recorded in the scene's historical sessions is obtained, including the user's viewpoint movement path, object click sequence, and dwell time stamp. A spatiotemporal decay kernel function is constructed centered on each trigger point, with a spatial decay radius of 500 mm and a temporal decay coefficient of 0.1. The kernel function expression is K(d,t)=exp(-d 2 / 500 2 )×exp(-0.1t), where d is the linear distance (mm) between the interactive event and the trigger point, and t is the time difference (seconds). Map each interactive event to the corresponding decay kernel function according to its spatiotemporal coordinates. If the spatial distance is less than 500 mm, the trigger point is associated. Calculate the cumulative intensity value of each trigger point. Click-type interactive events have a weight of 0.8, and view dwell-type events have a weight of 0.5. The dwell time weight is dwell time / 10 (maximum 10 seconds). The cumulative intensity value = Σ[interactive event type weight × dwell time weight × K(d,t)]. For example, for a trigger point with three related events, the contribution value of a click event (8-second dwell time, 300mm distance, 5-second time difference) is approximately 0.271 (0.8 × 0.8 × (0.698 × 0.607)); the contribution value of a viewpoint dwell event (5-second dwell time, 200mm distance, 3-second time difference) is approximately 0.158 (0.5 × 0.5 × (0.852 × 0.741)); and the contribution value of a click event (10-second dwell time, 400mm distance, 1-second time difference) is approximately 0.382 (0.8 × 1 × (0.527 × 0.905)). The cumulative intensity value is approximately 0.811. The cumulative intensity value is smoothed using a bilinear interpolation algorithm (10mm interval), and then mapped to the range of 0-1 using a normalization formula (normalized intensity value = (original value - minimum value) / (maximum value - minimum value)), generating a dynamic interaction influence intensity field covering the entire scene.

[0019] Step S3: Identify the semantic focus region of the structured spatial scene description map to obtain multiple candidate semantic focus sub-regions; calculate the scene attention weight of each candidate semantic focus sub-region based on the dynamic interaction influence intensity field to generate a multi-level scene attention weight distribution map corresponding to the target 3D spatial scene. In this embodiment of the invention, the structured spatial scene description map is subjected to graph neural network node feature propagation and aggregation, with 3 propagation steps. In each step, the node fusion feature is calculated as: (self-feature × 0.6) + Σ(adjacent node features × (1 - distance / 1000)) (distance in millimeters). The map is segmented using unsupervised clustering with a cluster radius of 300 millimeters. Semantic feature similarity is calculated using cosine similarity with a threshold of 0.7. Nodes with semantic similarity ≥ 0.7 and spatial distance ≤ 300 millimeters are grouped into the same cluster, generating multiple candidate semantic focus sub-regions. The inherent semantic importance score of each sub-region is extracted, with entity nodes having a weight of 0.7 and virtual nodes 0.3. Centrality is calculated as: (number of adjacent nodes / total number of nodes in the region). Material optical property complexity is calculated as: (standard deviation of reflectance / 255). The score is calculated as: (0.4 × node type weight + 0.3 × centrality + 0.3 × complexity). For example, if an entity node has 20 adjacent nodes, a total of 50 nodes in the region, and a reflectance standard deviation of 60, the score = 0.4 × 0.7 + 0.3 × 0.4 + 0.3 × 0.235 ≈ 0.471, and the region score is the average score of the nodes. The average dynamic interaction influence intensity value of each sub-region is extracted using a region integration algorithm. The average intensity value = the sum of the intensity values ​​of all points within the sub-region / the total number of points. A primary weight (fusion weight 0.5) is calculated using attention fusion: primary weight = 0.5 × inherent score + 0.5 × average intensity value. The occlusion relationship of the region is determined by the ray detection algorithm. The occlusion coefficient is 0.3 when occluded and 1.0 when unoccluded. The line-of-sight continuity coefficient is calculated based on the angle between the line connecting the center of the region and the line of sight. The coefficient is 1.0 for an angle ≤30° and decreases by 0.1 for every 10° increase. A transition probability matrix is ​​constructed (transition probability = 0.6 × occlusion coefficient + 0.4 × continuity coefficient). The initial weights are optimized by iterating 5 times (updated weight = current weight × 0.7 + Σ(other region weights × transition probability) × 0.3). A multi-level scene attention weight distribution map is generated attached to a 10 mm, 20 mm, and 50 mm resolution grid.

[0020] Step S4: Based on the multi-level scene attention weight distribution map, adaptively filter the candidate library of rendering elements in the target 3D spatial scene, calculate the spatial semantic correlation and interaction necessity measure between each rendering element and the corresponding candidate semantic focus sub-region, and generate a preliminary set of rendering elements adapted to the scene. In this embodiment of the invention, a candidate library of rendering elements, including geometric models, material textures, and lighting effects, is obtained. Each element is associated with a geometric complexity descriptor (number of triangle faces), a material texture identifier, and a potential binding space range. A spatial intersection algorithm is used to calculate the intersection volume between the multi-level scene attention weight distribution map and the potential binding space range of each element. If the intersection volume is greater than 30% of the element's potential range volume, an intersection is determined. The average scene attention weight is calculated as the sum of pixel weights within the coverage area / the total number of pixels. Feature vectors of rendering elements and their corresponding sub-regions are extracted based on a structured graph. The spatial semantic relevance is calculated using the cosine similarity formula (similarity = dot product / (modulus product)). The frequency (total number of events / total duration, in times / hour) and type diversity (number of event types in the region / total number of event types) of historical interaction events in the element's binding region are extracted. The interaction necessity metric is calculated as 0.6 × frequency + 0.4 × diversity. The average attention weight, spatial semantic relevance, and interaction necessity are input into a filtering decision function. The decision value is calculated as 0.4 × average weight + 0.3 × relevance + 0.3 × necessity. Elements with a decision value ≥ 0.4 are retained. For example, if an element has an average weight of 0.55, a relevance of 0.008, and a necessity of 0.76, the decision value is approximately 0.4 × 0.55 + 0.3 × 0.008 + 0.3 × 0.76 ≈ 0.4504. This element is retained, and all retained elements, along with their associated geometry and material data, are packaged to generate a preliminary set of rendering elements.

[0021] Step S5: Obtain real-time rendering resource constraints, and based on the real-time rendering resource constraints and the multi-level scene attention weight distribution map, dynamically sort and filter the rendering priority of the initial rendering element set to generate a spatial scene-adaptive rendering element sequence, and drive the graphics processor to execute differentiated rendering instructions according to the sequence.

[0022] In this embodiment of the invention, real-time rendering resource constraints are obtained every frame through the graphics processor's real-time monitoring interface, including an available video memory budget of 8GB, a shader core computing power margin of 40%, and a maximum limit of 1 million triangles per frame. For each element in the initial rendering element set, the video memory usage (the sum of the number of bytes of geometry and texture data) and computational overhead (the product of the number of triangle faces and the number of material samples) are estimated based on the geometric complexity descriptor. Combined with the final weight of its sub-region, an initial rendering priority score is calculated: Priority score = 0.4 × (1 - video memory usage / 8GB) + 0.3 × (1 - computational overhead / 40%) + 0.3 × region weight. For example, if an element has a video memory usage of 0.8GB, a computational overhead of 15%, and a region weight of 0.826, its priority score = 0.4 × 0.9 + 0.3 × 0.625 + 0.3 × 0.826 ≈ 0.7953. A dynamic resource allocation optimization model is constructed with the goal of maximizing the weighted sum of the priorities of selected elements. Constraints include total VRAM ≤ 8GB, total computational overhead ≤ 40%, and total number of triangles ≤ 1 million. A 0-1 integer programming problem is solved on the initial set. Selected elements are included in a candidate sequence and arranged in descending order of priority. Redundant element pairs in the candidate sequence with a spatial distance ≤ 200 mm and a semantic similarity ≥ 0.8 are detected. The merging benefit is calculated as 0.6 × (1 - spatial distance / 200) + 0.4 × (1 - material difference), with a threshold of 0.7. If the value exceeds the threshold, geometric instantiation merging is performed, updating the rendering overhead. After merging, VRAM and computational overhead are reduced by 20% and 15% respectively, resulting in a spatial scene-adapted rendering element sequence. The sequence is parsed and divided into high, medium, and low rendering levels according to priority ≥ 0.7, 0.5 ≤ priority < 0.7, and priority < 0.5. Different rendering quality schemes are assigned, differentiated rendering resource description blocks are generated, compiled into graphics processor executable instructions, and distributed to the corresponding shader core group for execution. Resource constraints are monitored every frame, and if changes exceed the threshold, secondary filtering and sequence updates are triggered.

[0023] Furthermore, as an embodiment of the present invention, reference is made to... Figure 2 As shown, Figure 1 A detailed flowchart of step S1 is shown below. In this embodiment, step S1 includes the following steps: S11: Capture the target's three-dimensional spatial scene through three-dimensional laser scanning and multi-view image fusion technology, and obtain raw multi-dimensional spatial perception data including geometric point clouds, material texture reflectivity and ambient lighting gradient; S12: Perform topological consistency reconstruction and voxelization on the geometric point cloud in the original multi-dimensional spatial perception data to generate a dense geometric voxel mesh model corresponding to the target three-dimensional spatial scene, and extract high-frequency detail features and low-frequency primary color distribution from the material texture reflectivity data; calculate the light propagation attenuation and indirect illumination contribution at each location in the scene based on the ambient light gradient data, and generate a global illumination propagation path estimation map. S13: Integrate dense geometric voxel mesh models, high-frequency detail features and low-frequency primary color distributions, and global illumination propagation path estimation maps to associate and bind the geometric boundaries and material properties of solid objects. At the same time, identify and mark non-solid illumination areas, shadow areas, and reflection virtual imaging areas as virtual structures. S14: Based on the association and binding results of entity objects and the marked virtual structure, construct a hierarchical graph structure containing geometry, material, lighting and virtual-real relationships, and generate a structured spatial scene description graph corresponding to the target 3D spatial scene.

[0024] In this embodiment of the invention, a three-dimensional laser scanning device is used to perform an all-round scan of the target indoor space scene, covering all physical objects and environmental areas within the scene. The scanning interval is set to 10 mm. Simultaneously, six multi-view image acquisition devices are used to synchronously capture scene images from different directions at shooting angles of 0°, 60°, 120°, 180°, 240°, and 300°. The shooting resolution of each device is set to 3000×2000 pixels. The multi-view images are fused with the laser scanning data using an image fusion algorithm to obtain original multi-dimensional spatial perception data including geometric point clouds, material texture reflectivity, and ambient lighting gradient. The geometric point cloud data includes the three-dimensional coordinates (x, y, z) of each point in the scene. The material texture reflectivity data includes the reflectivity values ​​of each point in the red, green, and blue channels (with a value range of 0-255). The ambient lighting gradient data includes the rate of change of light intensity of each point in the horizontal and vertical directions. Topological consistency reconstruction is performed on the geometric point cloud in the original multi-dimensional spatial perception data. Point cloud registration algorithms are used to eliminate point cloud deviations caused by different scanning angles, controlling the deviation within 5 mm. Then, voxelization is performed to convert the geometric point cloud into a voxel mesh with a voxel side length of 5 mm, generating a dense geometric voxel mesh model corresponding to the target 3D spatial scene. This model can clearly present the 3D contours and spatial relationships of entities within the scene. High-frequency detail features and low-frequency primary color distributions are extracted from the material texture reflectivity data. A Gaussian filtering algorithm is used to filter the reflectivity data with a kernel size of 3×3 to remove high-frequency noise and retain the low-frequency primary color distribution. Then, the Laplacian operator is used to extract high-frequency detail features with an operator coefficient of 0.2 to obtain the texture details of the material surface, such as the texture of a wooden tabletop or minor scratches on a wall. Based on ambient lighting gradient data, the light propagation attenuation and indirect lighting contribution at each location in the scene are calculated. The light propagation attenuation is calculated using the distance attenuation formula, with the attenuation coefficient set to 0.05. The formula is: attenuation value = 1 / (1 + 0.05 × distance), where the distance is the straight-line distance between the current location and the light source. The indirect lighting contribution is calculated using the radiometric algorithm. Each surface in the scene is taken as a radiation source, and its lighting contribution to other surfaces is calculated to generate a global lighting propagation path estimation map. This map can show the propagation path of light and the distribution of light intensity in the scene.By integrating dense geometric voxel mesh models, high-frequency detail features, low-frequency primary color distribution, and global illumination propagation path estimation maps, a feature matching algorithm is used to associate and bind material texture features with corresponding positions in the geometric voxel mesh model, ensuring that each voxel mesh corresponds to a unique material attribute. At the same time, an image segmentation algorithm is used to identify and label non-physical illuminated areas, shadow areas, and reflective virtual imaging areas as virtual structures. Non-physical illuminated areas are areas without physical objects but with illumination distribution, shadow areas are dark areas formed by physical objects blocking light, and reflective virtual imaging areas are virtual image areas formed by reflections from materials such as mirrors and glass. Based on the association and binding results of entity objects and the marked virtual structure, a hierarchical graph structure is constructed, which includes geometry, material, lighting, and virtual-real relationships. The graph structure is divided into three layers: the bottom layer is the geometry structure layer, which stores the 3D coordinates and topological relationships of the voxel mesh; the middle layer is the material and lighting layer, which stores the material reflectivity and lighting parameters corresponding to each geometry; and the top layer is the virtual-real relationship layer, which stores the spatial positional relationships and mutual influences between entity objects and virtual structures. Finally, a structured spatial scene description map corresponding to the target 3D spatial scene is generated. This map can fully present all the spatial features and attribute information of the scene.

[0025] Furthermore, as an embodiment of the present invention, reference is made to... Figure 3 As shown, Figure 1 A detailed flowchart of step S2 is shown below. In this embodiment, step S2 includes the following steps: S21: Identify all interactive entity objects and triggerable virtual body effect areas from the structured spatial scene description map, and define their spatial coordinate set as the distribution of dynamic interactive event trigger points; obtain user historical interaction trajectory data recorded in the target 3D spatial scene historical session, which includes user view movement path, object click sequence and dwell time stamp; S22: Construct a spatiotemporal decay kernel function with each point in the distribution of dynamic interactive event trigger points as the center, and map each interactive event in the user's historical interaction trajectory data to the corresponding decay kernel function according to its spatiotemporal coordinates; calculate the cumulative intensity value of each dynamic interactive event trigger point affected by all its historical related interactive events. This cumulative intensity value is jointly determined by the interaction event type weight, dwell time weight, and spatiotemporal distance from the trigger point. S23: Perform spatial interpolation smoothing and normalization on the cumulative intensity values ​​of all dynamic interaction event trigger points to generate a continuous scalar field covering the entire target 3D spatial scene, namely the dynamic interaction influence intensity field.

[0026] In this embodiment of the invention, the geometric boundaries and attribute information of all entity objects are extracted from the structured spatial scene description map. All interactive entity objects are identified using a target detection algorithm. Interactive entity objects are entities that can be triggered by user actions, such as switches, tables, chairs, and monitors within the scene. Simultaneously, triggerable virtual body effect regions are identified. These are regions where virtual body changes occur after user interaction, such as specular reflection regions and shadow change regions. The set of spatial coordinates of these interactive entity objects and triggerable virtual body effect regions is defined as the dynamic interaction event trigger point distribution, with each trigger point corresponding to a unique three-dimensional coordinate (x, y, z). User historical interaction trajectory data recorded in the target three-dimensional spatial scene's historical sessions is acquired. This data is synchronously acquired through a location tracking device and an interaction recording device within the scene, including the user's viewpoint movement path, object click sequence, and dwell time stamp. The user's viewpoint movement path is the sequence of changes in the user's three-dimensional viewpoint coordinates within the scene; the object click sequence is the coordinates of the trigger point clicked by the user and the click type; and the dwell time stamp is the duration (in seconds) the user spends near each trigger point. A spatiotemporal decay kernel function is constructed, centered on each point in the distribution of dynamic interactive event trigger points. The spatial decay radius of the kernel function is set to 500 mm, and the temporal decay coefficient is set to 0.1. The kernel function expression is: K(d,t)=exp(-d 2 / 500 2 )×exp(-0.1t), where d is the spatial straight-line distance between the current interaction event and the trigger point (unit: mm), and t is the time difference between the current interaction event and the current time (unit: seconds). Each interaction event in the user's historical interaction trajectory data is mapped to the corresponding decay kernel function according to its spatiotemporal coordinates. If the spatial distance between the spatial coordinates of an interaction event and a trigger point is less than 500 mm, then the interaction event is mapped to the kernel function of that trigger point. Calculate the cumulative intensity value of each dynamic interactive event trigger point's influence from all its historical related interactive events. The cumulative intensity value is jointly determined by the interaction event type weight, dwell time weight, and spatiotemporal distance from the trigger point. Click-type interactive events have a weight of 0.8, view-dwelling interactive events have a weight of 0.5, and the dwell time weight is dwell time / 10 (dwell time is a maximum of 10 seconds). The formula for calculating the cumulative intensity value is: Cumulative Intensity Value = Σ[Interaction Event Type Weight × Dwell Time Weight × K(d,t)]. For example, if a trigger point has 3 related interactive events, the first being a click-type event with a dwell time of 8 seconds, a spatial distance of 300 mm, and a time difference of 5 seconds, its contribution value is 0.8 × (8 / 10) × [exp(-300...]]. 2 / 500 2)×exp(-0.1×5)]=0.8×0.8×(0.698×0.607)≈0.271; The second is a viewpoint dwell event, with a dwell time of 5 seconds, a spatial distance of 200 mm, a time difference of 3 seconds, and a contribution value of 0.5×(5 / 10)×[exp(-200 2 / 500 2 )×exp(-0.1×3)]=0.5×0.5×(0.852×0.741)≈0.158; The third is a click event, with a dwell time of 10 seconds, a spatial distance of 400 mm, a time difference of 1 second, and a contribution value of 0.8×(10 / 10)×[exp(-400 2 / 500 2 )×exp(-0.1×1)]=0.8×1×(0.527×0.905)≈0.382; The cumulative intensity value of the trigger point is 0.271+0.158+0.382≈0.811. The cumulative intensity values ​​of all dynamic interactive event trigger points are spatially interpolated and smoothed. A bilinear interpolation algorithm is used to interpolate the cumulative intensity values ​​of adjacent trigger points, with an interpolation interval of 10 mm to eliminate abrupt changes in intensity values. Then, normalization is performed to map all cumulative intensity values ​​to the range of 0-1. The normalization formula is: Normalized Intensity Value = (Original Cumulative Intensity Value - Minimum Cumulative Intensity Value) / (Maximum Cumulative Intensity Value - Minimum Cumulative Intensity Value). For example, if the original cumulative intensity value of a trigger point is 0.811, the minimum cumulative intensity value among all trigger points is 0.123, and the maximum cumulative intensity value is 0.956, its normalized intensity value = (0.811 - 0.123) / (0.956 - 0.123) ≈ 0.826. Through the above processing, a continuous scalar field covering the entire target 3D spatial scene is generated, namely the dynamic interaction influence intensity field. The scalar value at each location in this scalar field corresponds to the interaction influence intensity at that location, providing data support for subsequent rendering element selection.

[0027] Furthermore, step S3 includes the following steps: The structured spatial scene description map is subjected to graph neural network node feature propagation and aggregation. The map is then segmented by unsupervised clustering based on the semantic feature similarity and spatial proximity of the nodes, generating multiple semantically and spatially continuous sub-map clusters. Each sub-map cluster is defined as a candidate semantic focus sub-region. The inherent semantic importance score of all nodes in each candidate semantic focus sub-region in the graph structured spatial scene description map is extracted. This score is calculated by comprehensively considering the node type, its centrality in the map, and its material optical property complexity. Obtain the dynamic interaction influence intensity field, and extract the average dynamic interaction influence intensity value of the spatial range covered by each candidate semantic focus sub-region from the field; Based on the inherent semantic importance score of each candidate semantic focus sub-region and its corresponding average dynamic interaction influence strength value, the primary scene attention weight of each candidate semantic focus sub-region is calculated through attention fusion. Considering the spatial occlusion relationship and line-of-sight continuity between candidate semantic focus sub-regions, an attention transfer probability matrix between regions is constructed. This matrix is ​​then used to iteratively optimize and diffuse the primary scene attention weights, generating a multi-level scene attention weight distribution map. The weights are attached to the scene spatial grid at different resolution levels.

[0028] In this embodiment of the invention, graph neural network node features are propagated and aggregated on a structured spatial scene description graph. The propagation steps of the graph neural network are set to 3 steps. During each propagation step, each node weightedly fuses its own features with the features of its neighboring nodes. The weights are determined by the spatial distance between nodes; the closer the distance, the greater the weight. The fusion formula is: Node fused feature = own feature × 0.6 + Σ(neighboring node features × (1 - distance / 1000)), where the distance unit is millimeters. The graph is segmented by unsupervised clustering based on the semantic feature similarity and spatial proximity of nodes. The clustering radius is set to 300 millimeters. Semantic feature similarity is calculated using cosine similarity, and the similarity threshold is set to 0.7. When the semantic feature cosine similarity of two nodes is ≥ 0.7 and the spatial distance is ≤ 300 millimeters, they are classified into the same cluster, generating multiple semantically and spatially continuous sub-graph clusters. Each sub-graph cluster is defined as a candidate semantic focus sub-region. For example, the desktop area, wall area, and mirror reflection area in the scene each form an independent candidate semantic focus sub-region. The inherent semantic importance score of all nodes in each candidate semantic focus sub-region in the structured spatial scene description graph is extracted. This score is calculated by combining the node type, its centrality in the graph, and its material optical property complexity. In the node type weight, the weight of entity nodes is 0.7 and the weight of virtual nodes is 0.3. The centrality is calculated by the number of adjacent nodes connected to the node, centrality = number of adjacent nodes / total number of nodes in the region. The material optical property complexity is calculated by the standard deviation of material reflectivity, complexity = standard deviation of reflectivity / 255. The formula for the inherent semantic importance score is: score = 0.4 × node type weight + 0.3 × centrality + 0.3 × material optical property complexity. For example, for an entity node in a candidate semantic focus sub-region, the number of adjacent nodes is 20, the total number of nodes in the region is 50, and the standard deviation of material reflectivity is 60, its score = 0.4 × 0.7 + 0.3 × (20 / 50) + 0.3 × (60 / 255) ≈ 0.471. The inherent semantic importance score of this region is the average score of all nodes in the region. The dynamic interaction influence intensity field is obtained. The average dynamic interaction influence intensity value of the spatial range covered by each candidate semantic focus sub-region is extracted by the regional integration algorithm. The integration range is the three-dimensional spatial boundary of the sub-region. The average intensity value = the sum of the intensity values ​​of all points in the sub-region / the total number of points in the sub-region. For example, if there are 1000 points in a sub-region and the sum of the intensity values ​​is 680, its average dynamic interaction influence intensity value = 680 / 1000 = 0.68.Based on the inherent semantic importance score of each candidate semantic focus sub-region and its corresponding average dynamic interaction influence intensity value, the primary scene attention weight is calculated through attention fusion. The fusion weight is 0.5, and the formula is: Primary weight = 0.5 × inherent semantic importance score + 0.5 × average dynamic interaction influence intensity value. For example, if the inherent semantic importance score of a certain sub-region is 0.471 and the average dynamic interaction influence intensity value is 0.68, its primary weight = 0.5 × 0.471 + 0.5 × 0.68 = 0.2355 + 0.34 = 0.5755. Considering the spatial occlusion relationship and line-of-sight coherence between candidate semantic focus sub-regions, a ray detection algorithm is used to determine the occlusion relationship between regions. If a ray is blocked by another region when moving from one region to another, the occlusion coefficient is 0.3; otherwise, it is 1.0. Line-of-sight coherence is calculated by the angle between the line connecting the center of the regions and the scene's line-of-sight direction. When the angle is ≤30°, the coherence coefficient is 1.0, and the coefficient decreases by 0.1 for every 10° increase in the angle. An inter-region attention transfer probability matrix is ​​constructed, where the matrix elements are the transfer probabilities between two regions, and the transfer probability = 0.6 × occlusion coefficient + 0.4 × coherence coefficient. This matrix is ​​used to iteratively optimize and diffuse the initial scene attention weights. The number of iterations is set to 5, and the weight update formula after each iteration is: updated weight = current weight × 0.7 + Σ(other region weights × transfer probability) × 0.3. Finally, a multi-level scene attention weight distribution map is generated. The weights are attached to the scene space grid at different resolution levels, with grid resolutions of 10 mm, 20 mm, and 50 mm, corresponding to different rendering precision requirements.

[0029] Furthermore, step S4 includes the following steps: Obtain a candidate library of rendering elements for the target 3D spatial scene, where each rendering element is associated with a predefined geometric complexity descriptor, material texture identifier, and its potential binding space range in the scene; perform a spatial intersection test between the multi-level scene attention weight distribution map and the potential binding space range of each element in the candidate rendering element library; if they intersect, calculate the average scene attention weight of the element based on the pixel values ​​of the multi-level scene attention weight distribution map covered by the range. Based on the structured spatial scene description map, the cosine similarity between the geometric and material properties of the rendered element and the overall semantic features of the candidate semantic focus sub-region to which its bound spatial range belongs is calculated as a measure of spatial semantic relevance. Based on the dynamic interaction influence intensity field, the frequency and type diversity of historical interaction events in the bound spatial range of the element are extracted, and the interaction necessity measure that supports future potential interactions is calculated through a trained interaction necessity prediction model. The average scene attention weight, spatial semantic relevance metric, and interaction necessity metric of each rendered element are input into a preset filtering decision function. This function outputs a binary decision flag, and elements flagged as true are retained. All retained rendered elements and their associated geometry and material data are packaged to generate a preliminary set of rendered elements adapted to the scene.

[0030] In this embodiment of the invention, a candidate library of rendering elements for the target 3D spatial scene is obtained. The candidate library contains three types of rendering elements: geometric models, material textures, and lighting effects. Each rendering element is associated with a predefined geometric complexity descriptor, a material texture identifier, and its potential binding space range in the scene. The geometric complexity descriptor is measured by the number of triangle faces, the material texture identifier corresponds to a unique set of texture parameters, and the potential binding space range is the range of 3D spatial coordinates that the element can adapt to. A spatial intersection test is performed between the multi-level scene attention weight distribution map and the potential binding space range of each element in the candidate rendering element library. The intersection volume of the two is calculated using a spatial intersection algorithm. If the intersection volume is greater than 30% of the volume of the element's potential binding space range, it is determined to be an intersection. The average scene attention weight of the element is calculated based on the pixel values ​​of the multi-level scene attention weight distribution map covered by the range. The calculation method is to divide the sum of the weight values ​​of all pixels within the coverage area by the total number of pixels. For example, if the potential binding space range of a rendering element covers 800 pixels in the weight distribution map, the sum of the weight values ​​is 440, and its average scene attention weight = 440 / 800 = 0.55. Based on the structured spatial scene description map, the geometric and material attribute feature vectors of the rendered elements, as well as the overall semantic feature vectors of the candidate semantic focus sub-regions to which their bound spatial ranges belong, are extracted. The cosine similarity between the two is calculated using the cosine similarity formula as a measure of spatial semantic relevance. The similarity value ranges from 0 to 1, and the formula is: similarity = dot product of two feature vectors / (product of the magnitudes of two feature vectors). For example, if the dot product of the feature vectors of the rendered element is 120 and its own magnitude is 150, and the magnitude of the feature vector of the sub-region is 100, its spatial semantic relevance measure = 120 / (150 × 100) = 120 / 15000 = 0.008. Based on the dynamic interaction influence intensity field, the frequency and type diversity of historical interaction events within the spatial range bound to the element are extracted. The frequency of historical interaction events = total number of historical interaction events in the region / total duration (unit: times / hour), and the type diversity = number of interaction event types in the region / total number of interaction event types. The interaction necessity metric supporting future potential interactions is calculated using a trained interaction necessity prediction model. The model input is frequency and diversity, and the output is a metric value between 0 and 1. The calculation method is: interaction necessity metric = 0.6 × frequency + 0.4 × diversity. For example, if the frequency of historical interaction events in the region bound to an element is 0.8 times / hour and the type diversity is 0.7, its interaction necessity metric = 0.6 × 0.8 + 0.4 × 0.7 = 0.48 + 0.28 = 0.76.The average scene attention weight, spatial semantic relevance metric, and interaction necessity metric of each rendered element are input into a preset filtering decision function. The function expression is: Decision value = 0.4 × Average scene attention weight + 0.3 × Spatial semantic relevance metric + 0.3 × Interaction necessity metric. If the decision value ≥ 0.4, the binary decision is marked as true and the element is retained; if the decision value < 0.4, it is marked as false and the element is removed. For example, if a rendered element has an average scene attention weight of 0.55, a spatial semantic relevance metric of 0.008, and an interaction necessity metric of 0.76, its decision value = 0.4 × 0.55 + 0.3 × 0.008 + 0.3 × 0.76 = 0.22 + 0.0024 + 0.228 = 0.4504 ≥ 0.4, and it is marked as true. All retained rendering elements and their associated geometry and material data are packaged together. The geometry data includes voxel mesh coordinates and topological relationships, and the material data includes parameters such as reflectivity and texture details. This generates a preliminary set of rendering elements adapted to the scene, providing a foundation for subsequent rendering optimization.

[0031] Furthermore, the generation of the spatial scene adaptation rendering element sequence in step S5 includes the following steps: The system obtains real-time rendering resource constraints from the graphics processor's real-time monitoring interface, including the available video memory budget for the current frame, the shader core computing power margin, and the maximum number of triangles per frame required by the target frame rate. For each rendering element in the initial rendering element set, the system estimates the video memory usage and computing overhead required for rendering based on its geometric complexity descriptor, and calculates its initial rendering priority score by combining the final weight of its candidate semantic focus sub-region in the multi-level scene attention weight distribution map. Based on the memory budget and computing power margin in the real-time rendering resource constraints, a dynamic resource allocation optimization model is constructed. This model aims to maximize the sum of the weighted priority scores of all selected elements, with the constraints that the total memory usage, total computing overhead, and total number of triangles do not exceed a certain limit. The model performs a 0-1 integer programming solution on the initial set of rendering elements. Elements with a value of 1 in the optimization model solution are included in the candidate sequence and sorted in descending order according to their initial rendering priority scores. The system detects whether there are spatially adjacent and semantically highly similar redundant element pairs in the candidate sequence. If so, it calculates the merging benefit based on their spatial distance and material difference, and performs geometric instantiation merging optimization on element pairs with merging benefits higher than the threshold, updating their rendering cost estimates. The candidate sequence optimized by redundancy merging is defined as the spatial scene-adaptive rendering element sequence.

[0032] In this embodiment of the invention, real-time rendering resource constraints are obtained through the graphics processor's real-time monitoring interface. The monitoring frequency is set to once per frame. The obtained constraints include the available video memory budget for the current frame, the shader core computing power margin, and the maximum number of triangles per frame required by the target frame rate. The available video memory budget is set to 8GB, the shader core computing power margin is set to 40%, the target frame rate is 60 frames per second, and the maximum number of triangles per frame is limited to 1 million. For each rendering element in the initial rendering element set, its required video memory usage and computational overhead are estimated based on its geometric complexity descriptor. Video memory usage is calculated based on the sum of the number of bytes of geometric data and the number of bytes of texture data. Computational overhead is calculated based on the product of the number of triangle faces and the number of material samplings. Combined with the final weight of its candidate semantic focus sub-region in the multi-level scene attention weight distribution map, its initial rendering priority score is calculated. The priority score formula is: Priority Score = 0.4 × (1 - Video Memory Usage / Video Memory Budget) + 0.3 × (1 - computational overhead / computational margin) + 0.3 × final weight of the region. For example, if a rendering element has a video memory usage of 0.8GB, a computational overhead of 15%, and a final weight of 0.826 for its region, its initial rendering priority score = 0.4 × (1 - 0.8 / 8) + 0.3 × (1 - 15% / 40%) + 0.3 × 0.826 = 0.4 × 0.9 + 0.3 × 0.625 + 0.3 × 0.826 = 0.36 + 0.1875 + 0.2478 ≈ 0.7953. Based on the memory budget and computational margin constraints in real-time rendering resources, a dynamic resource allocation optimization model is constructed. This model aims to maximize the sum of weighted priority scores for all selected elements. Constraints include: total memory usage ≤ 8GB, total computational overhead ≤ 40%, and total number of triangles ≤ 1 million. A 0-1 integer programming problem is solved on the initial set of rendering elements. Each element corresponds to a decision variable, with a value of 1 indicating selection and a value of 0 indicating non-selection. During the solution process, iterative calculations are used to select the element combination that satisfies all constraints and has the largest sum of priorities. Elements with a value of 1 in the optimization model's solution are included in the candidate sequence and sorted in descending order according to their initial rendering priority scores. For example, if the candidate sequence contains 5 elements with priority scores of 0.7953, 0.7621, 0.6894, 0.5512, and 0.4876, the sorted order is 0.7953, 0.7621, 0.6894, 0.5512, and 0.4876.The system detects whether there are spatially adjacent and semantically highly similar redundant element pairs in the candidate sequence. This is achieved through spatial distance calculation and semantic similarity calculation. Elements with a spatial distance ≤ 200 mm and a semantic similarity ≥ 0.8 are considered redundant. If such pairs exist, the merging benefit is calculated based on their spatial distance and material difference. The merging benefit formula is: Merging Benefit = 0.6 × (1 - Spatial Distance / 200) + 0.4 × (1 - Material Difference). Material difference is calculated using the difference in the standard deviation of material reflectivity; the larger the difference, the higher the difference. The merging benefit threshold is set to 0.7. For element pairs with a merging benefit higher than the threshold, geometric instantiation merging optimization is performed, merging the geometric models of the two elements into one instantiated model, sharing material parameters, and updating its rendering cost estimate. After merging, memory usage is reduced by 20%, and computational cost is reduced by 15%. The candidate sequence optimized by redundancy merging is defined as a spatial scene-adaptive rendering element sequence, ensuring that the sequence meets rendering resource constraints and has no redundancy.

[0033] Furthermore, step S5, which involves driving the graphics processor to execute differentiated rendering instructions based on the sequence, includes: The final spatial scene adaptation rendering element sequence is analyzed, and each element in the sequence is divided into multiple rendering levels according to its initial rendering priority score. Different rendering quality preset schemes are assigned to each level. Based on the rendering quality preset schemes corresponding to the rendering level, a differentiated rendering resource description block is generated for each rendering element, which includes vertex buffer positioning information, texture sampling parameters and shader constant data. A frame-by-frame rendering task queue is constructed, and elements with high priority rendering levels are assigned to keyframes for full-quality rendering. Elements with medium and low priority rendering levels are dynamically assigned to subsequent non-keyframes or progressively enhanced rendering based on historical frames according to their spatial location and motion prediction. The rendering resource description block and the frame-by-frame rendering task queue are compiled into a list of instructions executable by the graphics processor and distributed to different shader core groups of the graphics processor through the compute shader. The system monitors the execution status of the graphics processor's rendering pipeline. When a significant change in the constraints of real-time rendering resources is detected, the system triggers the dynamic sorting and secondary filtering steps for rendering priorities to dynamically update the spatial scene adaptation rendering element sequence and execute the corresponding differentiated rendering instructions.

[0034] In this embodiment of the invention, by parsing the final spatial scene adaptation rendering element sequence, the elements in the sequence are divided into three rendering levels according to their rendering priority score: a priority score ≥ 0.7 is high priority, 0.5 ≤ priority score < 0.7 is medium priority, and a priority score < 0.5 is low priority. Different rendering quality preset schemes are assigned to each level. The high priority level uses the highest rendering quality, with a texture resolution of 4096×4096 pixels, an anti-aliasing level of 8x, and the highest shadow quality; the medium priority level uses medium rendering quality, with a texture resolution of 2048×2048 pixels, an anti-aliasing level of 4x, and medium shadow quality; and the low priority level uses basic rendering quality, with a texture resolution of 1024×1024 pixels, an anti-aliasing level of 2x, and basic shadow quality. Based on the rendering quality preset scheme corresponding to the rendering level, a differentiated rendering resource description block is generated for each rendering element. This block includes vertex buffer positioning information, texture sampling parameters, and shader constant data. The vertex buffer positioning information specifies the storage address of the geometric data in video memory, the texture sampling parameters specify the texture sampling method and filtering mode, and the shader constant data specifies rendering parameters such as lighting and materials. A frame-based rendering task queue is constructed, and high-priority rendering level elements are assigned to keyframes for full-quality rendering, with a keyframe interval of 3 frames. Medium and low-priority rendering level elements are dynamically assigned to subsequent non-keyframes based on their spatial position and motion prediction. Medium-priority elements are assigned to non-keyframes with an interval of 1 frame, and low-priority elements are assigned to non-keyframes with an interval of 2 frames. Some low-priority elements use progressive enhancement rendering based on historical frames, supplementing the details of the current frame with historical frame rendering data to reduce real-time computation. The rendering resource description block and the frame-by-frame rendering task queue are compiled into a list of instructions executable by the graphics processing unit (GPU). During compilation, the instructions are optimized to eliminate redundant instructions. These instructions are then distributed to different shader core groups of the GPU through the compute shaders. High-priority element instructions are distributed to higher-performance core groups, while medium- and low-priority element instructions are distributed to ordinary core groups, achieving load balancing. The execution status of the rendering pipeline is monitored through the GPU's real-time monitoring interface. Real-time rendering resource constraints are checked every frame. When a change in available video memory budget exceeds 1GB, a change in shader core compute margin exceeds 10%, or a fluctuation in the target frame rate exceeds 5 frames per second, it is determined that the constraints have changed significantly. This triggers a dynamic sorting and secondary filtering process for rendering priorities. The priorities of each element are recalculated, elements that meet the new constraints are selected, the spatial scene adaptation rendering element sequence is dynamically updated, and corresponding differentiated rendering instructions are executed to ensure a balance between rendering smoothness and rendering quality.

[0035] Furthermore, constructing the frame-by-frame rendering task queue includes the following steps: Define the scene region within the current user's view frustum as the critical rendering region. Select elements with high rendering levels that are located in the critical rendering region from the spatial scene adaptation rendering element sequence and mark them as the keyframe forced rendering element set. For the remaining elements in the spatial scene adaptation rendering element sequence, predict their probability and time of entering the critical rendering region in subsequent consecutive frames using a Kalman filter based on the movement speed of the center point of their bound spatial range and the user's view movement direction. Elements whose predicted probability of entering the critical rendering zone within the next N frames is higher than the first threshold are dynamically allocated to the frame in which they are expected to enter the critical rendering zone as the key frame forced rendering element of that frame and reserved rendering resources are used. Elements that are predicted not to enter the critical rendering area and have a low rendering level within the next M frames are marked as progressive enhancement rendering candidate sets, and a series of multi-resolution level detail models from low to high are generated for the elements in this set. In non-key frames, for elements in the progressive enhancement rendering candidate set, a suitable version is selected from the corresponding multi-resolution hierarchical detail model for rendering based on their dynamic distance from the viewpoint and the remaining rendering resources of the current frame. The rendering results can be temporally mixed with the rendering results of historical frames to construct a frame-by-frame rendering task queue.

[0036] In this embodiment of the invention, the scene region within the current user's view frustum is defined as the critical rendering region using a view frustum clipping algorithm. The horizontal field of view of the view frustum is set to 60°, the vertical field of view to 45°, the near clipping plane distance to 500 mm, and the far clipping plane distance to 5000 mm. Elements with high rendering levels and spatial locations within the critical rendering region are selected from the spatial scene adaptation rendering element sequence. This selection is achieved through spatial coordinate comparison. If the center point coordinates of the element's bound spatial range are within the view frustum, it is determined to be located within the critical rendering region and marked as a keyframe forced rendering element set. For the remaining elements in the spatial scene adaptation rendering element sequence, the motion velocity of the center point of each element's bound spatial range is obtained using a motion trajectory tracking algorithm. The velocity unit is mm / frame. Simultaneously, the unit vector of the user's viewpoint movement direction is obtained. A Kalman filter is used to predict the probability and time of entering the critical rendering region in subsequent consecutive frames. The state equation of the Kalman filter is set as position = previous frame position + velocity × time, and the observation equation is set as observed position = actual position + observation error. The observation error is set to 5 mm. During the prediction process, the predicted value is iteratively corrected to reduce the error. With N set to 5 frames and the first threshold set to 0.7, elements predicted to have a probability higher than 0.7 of entering the critical rendering zone within the next 5 frames are dynamically allocated to the frame in which they are expected to enter the critical rendering zone. These frames are then used as the keyframes for forced rendering of the elements, with reserved rendering resources. The reserved resource ratio is set to 20% of the total rendering resources for that frame, ensuring that elements can be rendered in full quality immediately upon entering the critical rendering zone. For example, if an element is predicted to have a probability of 0.85 of entering the critical rendering zone in the 3rd frame, which is higher than 0.7, it is allocated to the 3rd frame, with 20% of the video memory and computing resources reserved for that frame. Setting M to 8 frames, elements predicted not to enter the critical rendering area and with low rendering levels within the next 8 frames are marked as a progressive enhancement rendering candidate set. A series of multi-resolution level detail models, from low to high, are generated for these elements using a multi-resolution modeling algorithm. Each element corresponds to four levels: Level 1 has 100 triangle faces, Level 2 has 500, Level 3 has 1000, and Level 4 has 2000. Higher levels offer richer model details but also higher memory usage and computational overhead. In non-critical frames, for elements in the progressive enhancement rendering candidate set, a distance calculation algorithm is used to obtain their dynamic distance from the viewpoint in real time. This distance is combined with the remaining rendering resources for the current frame (obtained through the graphics processor's real-time monitoring interface). An appropriate version is selected from the corresponding multi-resolution level detail models for rendering. Simultaneously, a temporal blending algorithm is used to fuse the current frame's rendering result with the rendering results of historical frames, with a blending weight of 0.6 for the current frame and 0.4 for historical frames, reducing rendering stutters. Through these operations, a frame-by-frame rendering task queue is constructed to achieve reasonable allocation of rendering resources.

[0037] Furthermore, in non-keyframes, the step of selecting a suitable version from the corresponding multi-resolution level detail model for rendering elements in the progressive enhancement rendering candidate set based on their dynamic distance from the viewpoint and the remaining rendering resources of the current frame includes the following steps: The system calculates the Euclidean distance from the center point of the bound space of each element in the progressive enhancement rendering candidate set to the current user viewpoint in real time, which is used as its dynamic view distance; it calculates the remaining triangle budget and video memory budget that can be used for progressive enhancement rendering based on the real-time rendering resource constraints of the current non-keyframe; it predefines the ideal rendering level related to the dynamic view distance for the multi-resolution hierarchical detail model of each element, as well as the number of triangles and video memory overhead required to achieve the level. With the goal of maximizing the overall visual quality perception score, a resource allocation optimization problem is constructed under the constraints of the remaining triangle budget and the video memory budget. For each element, an actual rendering level detail model version is selected, which is allowed to be lower than but whose quality is not lower than its ideal rendering level. Based on the optimization problem solution, an appropriate hierarchical detail model version is assigned to each element, corresponding drawing call commands are generated, and these commands are inserted into the rendering command stream of non-keyframes for execution and rendering to generate the corresponding rendering results.

[0038] In this embodiment of the invention, the Euclidean distance from the center point of the bound spatial range of each element in the progressive enhancement rendering candidate set to the current user's viewpoint is calculated in real time using the Euclidean distance calculation formula, and is used as its dynamic view distance. The calculation formula is: Dynamic view distance = Where (x1, y1, z1) are the coordinates of the element's center point, and (x2, y2, z2) are the coordinates of the user's viewpoint. For example, if the center point coordinates of an element are (1000, 800, 1200) and the user's viewpoint coordinates are (500, 500, 1000), its dynamic viewing distance = ≈616.44 mm. Based on the current real-time rendering resource constraints of non-keyframes, the remaining triangle budget and video memory budget available for progressive enhancement rendering are calculated using a resource statistics algorithm. The remaining triangle budget = maximum number of triangles per frame - number of triangles occupied by keyframe forced rendering elements. The remaining video memory budget = available video memory budget - video memory occupied by keyframe forced rendering elements. For example, if the maximum number of triangles per frame is limited to 1 million, and keyframe forced rendering elements occupy 300,000, the remaining triangle budget is 700,000; if the available video memory budget is 8GB, and keyframe forced rendering elements occupy 2GB, the remaining video memory budget is 6GB. For each element's multi-resolution hierarchical detail model, an ideal rendering level related to dynamic view distance is predefined. The ideal level is 4 when the dynamic view distance is ≤1000 mm, 3 when the dynamic view distance is ≤2000 mm, 2 when the dynamic view distance is ≤3000 mm, and 1 when the dynamic view distance is >3000 mm. The number of triangles and video memory required for each level are also specified. Level 4 requires 2000 triangles and 0.1 GB of video memory, Level 3 requires 1000 triangles and 0.05 GB of video memory, Level 2 requires 500 triangles and 0.02 GB of video memory, and Level 1 requires 100 triangles and 0.01 GB of video memory. The goal is to maximize the overall visual quality perception score, calculated as Σ[element visual quality score × element weight]. The element visual quality score is determined based on the hierarchical model level: level 4 scores 1.0, level 3 scores 0.8, level 2 scores 0.6, and level 1 scores 0.4. The element weight is its initial rendering priority score. Under constraints of remaining triangle budget and video memory budget, a resource allocation optimization problem is constructed. For each element, an actual rendering hierarchical detail model version is selected, which can be lower than, but whose quality is not lower than, its ideal rendering level. An iterative algorithm solves the optimization problem. Each iteration selects the element model version with the highest visual quality perception score that does not exceed the resource constraints. For example, if an element has a dynamic viewing distance of 616.44 mm, an ideal level of 4, and requires 2000 triangles and 0.1 GB of video memory, level 4 is chosen if there are sufficient remaining resources; otherwise, level 3 can be chosen to ensure that the quality is not lower than the ideal level. Based on the optimization problem solution, an appropriate hierarchical detail model version is assigned to each element, and corresponding drawing call commands are generated. The model data address, rendering parameters and other information are specified. These commands are inserted into the rendering command stream of non-key frames for execution and rendering, generating corresponding rendering results, ensuring a balance between rendering quality and rendering smoothness in non-key frames.

[0039] Furthermore, the present invention also provides a rendering element selection system based on spatial scene adaptation, including a processor, a memory, and a computer program stored in the memory and executable on the processor, for performing the rendering element selection method based on spatial scene adaptation as described above.

[0040] The above description is merely a specific embodiment of the present invention, enabling those skilled in the art to understand or implement the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features of the invention herein.

Claims

1. A method for selecting a rendering element based on spatial scene adaptation, characterized in that, The method comprises the following steps: Step S1: Obtain multi-dimensional spatial perception data corresponding to the target three-dimensional space scene, and perform entity and virtual structure analysis on the space scene based on the multi-dimensional spatial perception data to generate a structured space scene description graph corresponding to the target three-dimensional space scene; Step S2: Extract the distribution of dynamic interaction event trigger points in the scene based on the structured space scene description graph, and perform spatio-temporal correlation mapping on the user historical interaction trajectory data in the scene according to the distribution of dynamic interaction event trigger points to generate a dynamic interaction influence intensity field corresponding to the target three-dimensional space scene; Step S3: Perform scene semantic focus area identification on the structured space scene description graph to obtain a plurality of candidate semantic focus sub-areas; perform scene attention weight calculation on each candidate semantic focus sub-area based on the dynamic interaction influence intensity field to generate a multi-level scene attention weight distribution graph corresponding to the target three-dimensional space scene; Step S4: Perform adaptive screening on the rendering element candidate library in the target three-dimensional space scene based on the multi-level scene attention weight distribution graph, calculate the spatial semantic relevance and interaction necessity measure of each rendering element and the corresponding candidate semantic focus sub-area, and generate a preliminary rendering element set adapted to the scene; Step S5: Obtain real-time rendering resource constraints, and perform rendering priority dynamic sorting and secondary screening on the preliminary rendering element set based on the real-time rendering resource constraints and the multi-level scene attention weight distribution graph to generate a space scene adaptation rendering element sequence, and drive the graphics processor to execute differential rendering instructions according to the sequence.

2. The method of claim 1, wherein, Step S1 comprises the following steps: Capture the target three-dimensional space scene by three-dimensional laser scanning and multi-view image fusion technology, and obtain original multi-dimensional spatial perception data including geometric point cloud, material texture reflectivity and environmental illumination gradient; Perform topological consistency reconstruction and voxelization processing on the geometric point cloud in the original multi-dimensional spatial perception data to generate a dense geometric voxel grid model corresponding to the target three-dimensional space scene, and extract high-frequency detail features and low-frequency base color distribution from the material texture reflectivity data; calculate the light propagation attenuation and indirect illumination contribution of each position in the scene based on the environmental illumination gradient data to generate a global illumination propagation path estimation graph; Fuse the dense geometric voxel grid model, high-frequency detail features and low-frequency base color distribution, and global illumination propagation path estimation graph to associate and bind the geometric boundary and material attribute of the entity object, and identify and mark non-entity illumination area, shadow area and reflection virtual imaging area as virtual structure; According to the association and binding result of the entity object and the marked virtual structure, construct a hierarchical graph structure containing geometry, material, illumination and virtual-real relationship to generate a structured space scene description graph corresponding to the target three-dimensional space scene.

3. The method of claim 2, wherein, Step S2 comprises the following steps: Identify all interactive entities and triggerable virtual effect areas from the structured spatial scene description map, and define their spatial coordinate set as the distribution of dynamic interactive event trigger points; obtain the user's historical interaction trajectory data recorded in the target 3D spatial scene historical session, which includes the user's view movement path, object click sequence and dwell time stamp; A spatiotemporal decay kernel function is constructed with each point in the distribution of dynamic interactive event trigger points as the center, and each interactive event in the user's historical interaction trajectory data is mapped to the corresponding decay kernel function according to its spatiotemporal coordinates; the cumulative intensity value of each dynamic interactive event trigger point affected by all its historical related interactive events is calculated, and the cumulative intensity value is jointly determined by the interaction event type weight, dwell time weight, and spatiotemporal distance from the trigger point; The cumulative intensity values ​​of all dynamic interaction event trigger points are spatially interpolated, smoothed, and normalized to generate a continuous scalar field covering the entire target 3D spatial scene, namely the dynamic interaction influence intensity field.

4. The method of claim 1, wherein, Step S3 includes the following steps: The structured spatial scene description map is subjected to graph neural network node feature propagation and aggregation. The map is then segmented by unsupervised clustering based on the semantic feature similarity and spatial proximity of the nodes, generating multiple semantically and spatially continuous sub-map clusters. Each sub-map cluster is defined as a candidate semantic focus sub-region. The inherent semantic importance score of all nodes in each candidate semantic focus sub-region in the graph structured spatial scene description map is extracted. This score is calculated by comprehensively considering the node type, its centrality in the map, and its material optical property complexity. Obtain the dynamic interaction influence intensity field, and extract the average dynamic interaction influence intensity value of the spatial range covered by each candidate semantic focus sub-region from the field; Based on the inherent semantic importance score of each candidate semantic focus sub-region and its corresponding average dynamic interaction influence strength value, the primary scene attention weight of each candidate semantic focus sub-region is calculated through attention fusion. Considering the spatial occlusion relationship and line-of-sight continuity between candidate semantic focus sub-regions, an attention transfer probability matrix between regions is constructed. This matrix is ​​then used to iteratively optimize and diffuse the primary scene attention weights, generating a multi-level scene attention weight distribution map. The weights are attached to the scene spatial grid at different resolution levels.

5. The method of claim 1, wherein, Step S4 includes the following steps: Obtain a candidate library of rendering elements for the target 3D spatial scene, where each rendering element is associated with a predefined geometric complexity descriptor, material texture identifier, and its potential binding space range in the scene; perform a spatial intersection test between the multi-level scene attention weight distribution map and the potential binding space range of each element in the candidate rendering element library; if they intersect, calculate the average scene attention weight of the element based on the pixel values ​​of the multi-level scene attention weight distribution map covered by the range. Based on the structured spatial scene description map, the cosine similarity between the geometric and material properties of the rendered element and the overall semantic features of the candidate semantic focus sub-region to which its bound spatial range belongs is calculated as a measure of spatial semantic relevance. Based on the dynamic interaction influence intensity field, the frequency and type diversity of historical interaction events in the bound spatial range of the element are extracted, and the interaction necessity measure that supports future potential interactions is calculated through a trained interaction necessity prediction model. The average scene attention weight, spatial semantic relevance metric, and interaction necessity metric of each rendered element are input into a preset filtering decision function. This function outputs a binary decision flag, and elements flagged as true are retained. All retained rendered elements and their associated geometry and material data are packaged to generate a preliminary set of rendered elements adapted to the scene.

6. The method of claim 1, wherein, The step S5 of generating the spatial scene adaptation rendering element sequence includes the following steps: The system obtains real-time rendering resource constraints from the graphics processor's real-time monitoring interface, including the available video memory budget for the current frame, the shader core computing power margin, and the maximum number of triangles per frame required by the target frame rate. For each rendering element in the initial rendering element set, the system estimates the video memory usage and computing overhead required for rendering based on its geometric complexity descriptor, and calculates its initial rendering priority score by combining the final weight of its candidate semantic focus sub-region in the multi-level scene attention weight distribution map. Based on the memory budget and computing power margin in the real-time rendering resource constraints, a dynamic resource allocation optimization model is constructed. This model aims to maximize the sum of the weighted priority scores of all selected elements, with the constraints that the total memory usage, total computing overhead, and total number of triangles do not exceed a certain limit. The model performs a 0-1 integer programming solution on the initial set of rendering elements. Elements with a value of 1 in the optimization model solution are included in the candidate sequence and sorted in descending order according to their initial rendering priority scores. The system detects whether there are spatially adjacent and semantically highly similar redundant element pairs in the candidate sequence. If so, it calculates the merging benefit based on their spatial distance and material difference, and performs geometric instantiation merging optimization on element pairs with merging benefits higher than the threshold, updating their rendering cost estimates. The candidate sequence optimized by redundancy merging is defined as the spatial scene-adaptive rendering element sequence.

7. The method of claim 6, wherein, The step S5, which involves driving the graphics processor to execute differential rendering instructions based on the sequence, includes: The final spatial scene adaptation rendering element sequence is analyzed, and each element in the sequence is divided into multiple rendering levels according to its initial rendering priority score. Different rendering quality preset schemes are assigned to each level. Based on the rendering quality preset schemes corresponding to the rendering level, a differentiated rendering resource description block is generated for each rendering element, which includes vertex buffer positioning information, texture sampling parameters and shader constant data. A frame-by-frame rendering task queue is constructed, and elements with high priority rendering levels are assigned to keyframes for full-quality rendering. Elements with medium and low priority rendering levels are dynamically assigned to subsequent non-keyframes or progressively enhanced rendering based on historical frames according to their spatial location and motion prediction. The rendering resource description block and the frame-by-frame rendering task queue are compiled into a list of instructions executable by the graphics processor and distributed to different shader core groups of the graphics processor through the compute shader. The system monitors the execution status of the graphics processor's rendering pipeline. When a significant change in the constraints of real-time rendering resources is detected, the system triggers the dynamic sorting and secondary filtering steps for rendering priorities to dynamically update the spatial scene adaptation rendering element sequence and execute the corresponding differentiated rendering instructions.

8. The method of claim 7, wherein, The process of constructing the frame-by-frame rendering task queue includes the following steps: Define the scene region within the current user's view frustum as the critical rendering region. Select elements with high rendering levels that are located in the critical rendering region from the spatial scene adaptation rendering element sequence and mark them as the keyframe forced rendering element set. For the remaining elements in the spatial scene adaptation rendering element sequence, predict their probability and time of entering the critical rendering region in subsequent consecutive frames using a Kalman filter based on the movement speed of the center point of their bound spatial range and the user's view movement direction. Elements whose predicted probability of entering the critical rendering zone within the next N frames is higher than the first threshold are dynamically allocated to the frame in which they are expected to enter the critical rendering zone as the key frame forced rendering element of that frame and reserved rendering resources are used. Elements that are predicted not to enter the critical rendering area and have a low rendering level within the next M frames are marked as progressive enhancement rendering candidate sets, and a series of multi-resolution level detail models from low to high are generated for the elements in this set. In non-key frames, for elements in the progressive enhancement rendering candidate set, a suitable version is selected from the corresponding multi-resolution hierarchical detail model for rendering based on their dynamic distance from the viewpoint and the remaining rendering resources of the current frame. The rendering results can be temporally mixed with the rendering results of historical frames to construct a frame-by-frame rendering task queue.

9. The method of claim 8, wherein, In non-keyframes, for elements in the progressive enhancement rendering candidate set, selecting a suitable version from the corresponding multi-resolution level detail model for rendering based on their dynamic distance from the viewpoint and the remaining rendering resources of the current frame includes the following steps: The system calculates the Euclidean distance from the center point of the bound space of each element in the progressive enhancement rendering candidate set to the current user viewpoint in real time, which is used as its dynamic view distance; it calculates the remaining triangle budget and video memory budget that can be used for progressive enhancement rendering based on the real-time rendering resource constraints of the current non-keyframe; it predefines the ideal rendering level related to the dynamic view distance for the multi-resolution hierarchical detail model of each element, as well as the number of triangles and video memory overhead required to achieve the level. With the goal of maximizing the overall visual quality perception score, a resource allocation optimization problem is constructed under the constraints of the remaining triangle budget and the video memory budget. For each element, an actual rendering level detail model version is selected, which is allowed to be lower than but whose quality is not lower than its ideal rendering level. Based on the optimization problem solution, an appropriate hierarchical detail model version is assigned to each element, corresponding drawing call commands are generated, and these commands are inserted into the rendering command stream of non-keyframes for execution and rendering to generate the corresponding rendering results.

10. A spatial scene adaptation based rendering element selection system, characterized in that, It includes a processor, a memory, and a computer program stored in the memory and executable on the processor, for performing the rendering element selection method based on spatial scene adaptation as described in any one of claims 1-9.