Three-dimensional model generation method and system based on single 2D image AI reinforcement learning

By employing an AI reinforcement learning method based on a single 2D image, combined with multi-dimensional feature extraction and a geometric evolution strategy based on deep reinforcement learning, the problem of topological rationality and physical consistency in 3D reconstruction in existing technologies is solved. This enables efficient and practical 3D model generation, applicable to scenarios such as industrial manufacturing and VR/AR.

CN122243728APending Publication Date: 2026-06-19TAIAN HANLIN NETWORK TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TAIAN HANLIN NETWORK TECHNOLOGY CO LTD
Filing Date
2026-03-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for 3D reconstruction of single 2D images suffer from poor geometric topological rationality, lack of physical consistency, weak reasoning ability in occluded areas, low hardware resource utilization, and difficulty in meeting the real-time requirements of large-scale, high-precision reconstruction.

Method used

We employ an AI reinforcement learning method based on a single 2D image, using multi-dimensional feature extraction, a geometric evolution strategy of deep reinforcement learning, and a dynamic reward evaluation mechanism, combined with a distributed parallel computing system, to generate a physically reasonable and topologically complete 3D model.

Benefits of technology

It enables the rapid and high-precision generation of topologically complete 3D models from a single 2D image, improving reconstruction efficiency and practicality, adapting to various application scenarios, and supporting large-scale, high-precision reconstruction needs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243728A_ABST
    Figure CN122243728A_ABST
Patent Text Reader

Abstract

This invention relates to the fields of computer vision, deep learning, and 3D reconstruction technology, and particularly to a method and system for generating 3D models based on AI reinforcement learning from a single 2D image. The method includes multi-dimensional feature extraction and initial state space construction, geometric evolution strategy design based on deep reinforcement learning, construction of a three-in-one dynamic reward evaluation mechanism, strategy update and iterative optimization, geometric refinement and model output, which enables the rapid and high-precision generation of physically reasonable and topologically complete 3D models from a single 2D image, thereby improving the efficiency and practicality of 3D reconstruction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision, deep learning and 3D reconstruction technology, and in particular to a method and system for generating 3D models based on AI reinforcement learning from a single 2D image. Background Technology

[0002] 3D models have wide applications in various fields such as industrial design, virtual reality (VR), augmented reality (AR), digital twins, and game development. Traditional 3D model generation methods usually rely on professional modeling software (such as 3ds Max and Blender) for manual modeling. This not only requires operators to have professional skills, but also suffers from low modeling efficiency, high cost, and difficulty in quickly adapting to batch scenes.

[0003] With the development of deep learning technology, 3D reconstruction technology based on 2D images has gradually become a research hotspot. Existing 3D reconstruction methods based on single 2D images are mainly divided into two categories: one is based on implicit representations (such as Symbolic Distance Field (SDF) and Neural Radiation Field (NeRF), which learn the mapping relationship between the image and the 3D implicit field through neural networks. While this type of method can generate high-precision surface details, it suffers from poor geometric topological rationality and lack of physical consistency. The generated model may have defects such as suspended structures and unstable support, making it difficult to directly apply to scenarios with high physical performance requirements, such as industrial manufacturing. The other category is based on explicit mesh evolution, which iteratively optimizes the mesh vertex positions to fit image features. However, this type of method is prone to getting trapped in local optima, has weak inference ability for occluded areas, and lacks an adaptive mechanism for mesh resolution adjustment, making it difficult to simultaneously capture the global contour and local details of the model.

[0004] Furthermore, existing methods generally lack effective dynamic optimization mechanisms, failing to adjust the generation strategy in real time based on feedback during the reconstruction process. This results in reconstruction accuracy relying heavily on training data and poor adaptability to different image types (such as images with sparse textures and images with multiple object occlusions). Additionally, existing reconstruction systems are mostly deployed on a single machine, leading to low hardware resource utilization and difficulty in meeting the real-time requirements of large-scale, high-precision reconstruction.

[0005] Therefore, there is an urgent need for a method and system for generating 3D models from a single 2D image that can balance geometric accuracy, topological integrity and physical consistency, and has efficient iterative optimization capabilities and is adaptable to various application scenarios. Summary of the Invention

[0006] To address the shortcomings of the existing technologies, a method and system for generating 3D models based on AI reinforcement learning from a single 2D image are provided. This method enables the rapid and high-precision generation of physically reasonable and topologically complete 3D models from a single 2D image, thereby improving the efficiency and practicality of 3D reconstruction.

[0007] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is a method for generating three-dimensional models based on AI reinforcement learning of a single 2D image, comprising the following steps:

[0008] S1: Multi-dimensional Feature Extraction and Initial State Space Construction: Receive a single 2D image to be processed and extract deep semantic and geometric prior features using a pre-defined feature extraction network based on a visual transformer architecture. The feature extraction network is pre-trained on the ImageNet-21k dataset and initialized using a Xavier normal distribution. It segments the input 2D image into fixed-size 16×16 image blocks and performs linear projection mapping and positional encoding. A global feature vector is output through multi-layered stacked Transformer encoder blocks. and local spatial feature maps The core formula for feature extraction is:

[0009] ;

[0010] ;

[0011] ;

[0012] ;

[0013] in: : No. Encoded feature vectors of image patches (fusing linear projection and positional coding); Linear projection function maps the pixel values ​​of an image patch to a high-dimensional vector; : No. A 16×16 fixed-size image block pixel matrix; : No. The location encoding vectors of each image patch provide spatial location information for the Transformer; The fused feature tensor output by the Transformer encoder; : Multi-layer stacked Transformer encoder block functions; : Encoded feature vectors of all image patches The total number of image blocks; Learnable category embedding vectors are used to aggregate semantic context information across the entire graph; : Global feature vector, representing the high-level semantic features of the entire 2D image; Global average pooling function reduces the dimensionality of feature tensors to vectors; Local spatial feature maps preserve the spatial structure and edge details of the image; : Dimension reshaping function, which converts the encoder output tensor into a feature map that matches the spatial dimensions of the original image;

[0014] Subsequently, based on the global feature vector In three-dimensional coordinate space Initialize an implicit symbolic distance field of preset resolution as the initial state of the reinforcement learning agent. The initial state Including the occupancy probability distribution of 3D mesh nodes and preliminary surface normal vector information, the initial state construction formula is:

[0015] ;

[0016] ;

[0017] ;

[0018] ;

[0019] ;

[0020] in: : Three-dimensional space point The initial symbolic distance field value at the point, SDF is the symbolic distance field, which represents the signed distance from the point to the surface of the object; Multilayer perceptron specifically designed for SDF prediction; : A point in three-dimensional coordinate space ; : Three-dimensional space point The initial surface normal vector (unit normal vector) at that location. : A multilayer perceptron specifically designed for normal vector prediction; Normal vector normalization function, which converts the output vector into a unit normal vector; : The gradient is used to solve for the surface normal vector; : Three-dimensional space point The initial normal vector at that location (unnormalized); The initial state of the reinforcement learning agent includes the initial SDF, the unit surface normal vector, the original value of the normal vector, and the probability distribution of the occupancy of the three-dimensional mesh nodes.

[0021] S2: Geometric Evolution Strategy Design Based on Deep Reinforcement Learning: The generation process of the 3D model is modeled as a Markov decision process, and the reinforcement learning agent determines the geometric evolution strategy based on the current state. and extracted features , Through policy network Output a set of geometric evolution actions The action space is defined as the adjustment parameters of the voxel density in the implicit field, the displacement vectors of the mesh vertices, and the control factors of the surface subdivision operation; wherein, the mesh vertex displacement vector is decomposed into a direction vector and a step size factor, and the calculation formula is:

[0022] ;

[0023] ;

[0024] ;

[0025] To improve search efficiency, a prior probability distribution constraint based on the action space is introduced, which restricts the actions... The range of values ​​is constrained within the feasible geometric envelope derived from the two-dimensional image contour lines, and the constraint formula is:

[0026] ;

[0027] ;

[0028] in: : The displacement vector of the grid vertex at any given time; : Unit direction vector, obtained by equal-angle sampling of a unit sphere; : The step size factor at any given time is dynamically adjusted based on the current grid resolution; : Grid resolution function; : The current state of the agent at any given moment; : with parameters The strategy network; Features that fuse global and local features; : Geometric envelope constraint function, outputting 0 / 1 to indicate whether it is inside the envelope; : Individual action components; : Indicator function, returns 1 if the condition is true, and 0 otherwise; The two-dimensional image contour lines are projected onto the three-dimensional space through the camera's intrinsic and extrinsic parameters to form a closed set of projection points, which constitute the initial geometric boundary envelope of the three-dimensional model. Envelope scaling factor, default value is 1.2;

[0029] S3: Construct a three-in-one dynamic reward evaluation mechanism: when the agent performs an action Get a new state Then, through a multimodal reward function The generated geometry is evaluated; the multimodal reward function Rewards by visual projection Geometric topology rewards and physical consistency rewards The weighted composition is calculated using the following formula:

[0030] ;

[0031] in: : The total reward value at each time step drives the reinforcement learning agent to optimize its strategy. , , : Reward weight coefficients, which are the weights of visual projection, geometric topology, and physical consistency rewards, respectively; Visual projection reward represents the similarity between the 3D model projected onto the 2D plane and the original image; Geometric topological reward characterizes the surface smoothness and topological integrity of a 3D model; Physical consistency reward, characterizing the physical rationality and structural stability of the 3D model;

[0032] Visual projection rewards The current 3D model is projected back to the 2D plane using a differentiable renderer, and the intersection-over-union ratio of the projection mask and the pixel-level color deviation are calculated using the following formulas:

[0033] ;

[0034] ;

[0035] in: : Reprojection loss function, the smaller the loss, the better The larger; Mean squared error function, calculates the pixel-level squared difference loss between the original image and the rendered image; Input: the original 2D image; : A 2D projected rendering of a 3D model obtained through differentiable rendering; L1 loss function: calculates the pixel-wise absolute error between the original image segmentation mask and the rendering mask; Original image segmentation mask; : Segmentation mask of the rendered image;

[0036] Geometric topology reward Surface smoothness and topology error are evaluated using Laplace smoothness constraints and mesh Jacobian determinants. The calculation formula is as follows:

[0037] ;

[0038] ;

[0039] ;

[0040] in: Laplace smoothing loss; the smaller the value, the smoother the surface. : Laplace operator; The vertex set of a 3D model; Topology loss: the smaller the value, the fewer the topology errors. : The face set of a 3D model; :noodle The Jacobian determinant represents the local topological rationality; , Loss weighting coefficients, all positive real numbers, with a default value range. , ;

[0041] Physical consistency reward Rigid body dynamics simulation is introduced to examine the model's center-of-mass equilibrium state and surface normal consistency under a preset gravity field. Simultaneously, surface stress analysis of the spring-damped model and manifold closure checks using Euler eigennumbers are combined. The calculation formula is as follows:

[0042] ;

[0043] ;

[0044] ;

[0045] ;

[0046] in: : Center of mass balance reward; The centroid of a 3D model; : The polygonal region of the model support surface; Surface stress reward; : Deformation energy of the model surface; : Deformation energy threshold, default value is 50; The Euler eigenvalues ​​of the model. , For the number of vertices, For the number of sides, Number of faces; : The genus characteristics of the target object; Manifold closed reward; , , : Weighting coefficients for physical reward components, all of which are positive real numbers, with default values. =0.3, =0.4, =0.3;

[0047] S4: Policy Update and Iterative Optimization: A near-end policy optimization algorithm is used to update the parameters of the policy network. During training, an experience replay pool is introduced to store the state transition sequences. The action output distribution is adjusted by calculating the advantage function of the generalized advantage estimation. The operation formula is as follows:

[0048] ;

[0049] ;

[0050] ;

[0051] ;

[0052] Simultaneously, a dynamic learning rate decay mechanism is introduced, focusing on global contour exploration and local detail repair at different training stages respectively. The learning rate formula is:

[0053] ;

[0054] in: : Generalized advantage estimate at time; Discount factor The default value is 0.99; GAE attenuation coefficient The default value is 0.95; Temporal differential residuals; : with parameters Value network; : PPO strategy loss function; : time step Expectations; : Policy probability ratio; : Pruning function, which limits the range of probability ratios; Cutting factor The default value is 0.2; : Policy network parameters before update; : Learning rate at any given moment; Initial learning rate, default value ; : The starting step for learning rate decay, the default value is 10000; : Maximum number of training iterations, default value is 50000;

[0055] S5: Geometric Refinement and Model Output: After the reinforcement learning agent converges or reaches the preset number of iterations, the final implicit field representation is converted into an explicit polygonal mesh model using the moving cube algorithm. The operation formula is as follows:

[0056] ;

[0057] The model surface texture mapping is optimized using coordinate-based neural radiation field technology. The texture optimization employs frequency-encoded coordinate mapping technology, and the operation formula is as follows:

[0058] ;

[0059] ;

[0060] Finally, perform mesh smoothing based on Laplacian coordinates and output a 3D model file. The mesh smoothing formula is:

[0061] ;

[0062] in: Initial polygonal mesh model; The moving cube algorithm converts implicit SDF into an explicit mesh. The final symbolic distance field after training convergence; SDF isosurface threshold, set to 0; 3D point Frequency coding features; : The order of frequency coding, the default value is 10; 3D point The texture color value at that location; : A multilayer perceptron specifically designed for texture color prediction; : The set of smoothed mesh vertices; : Original mesh vertex set; Smoothing coefficient ; Laplace coordinate transformation of a vertex set.

[0063] The aforementioned method for generating 3D models based on AI reinforcement learning from a single 2D image, in its multi-dimensional feature extraction process, employs an overlapping sampling strategy when segmenting and sampling the 2D image. This strategy divides the image into 16×16 blocks while maintaining 50% overlap to enhance the continuity of edge features. The Transformer encoder block contains 12 attention heads, with a hidden layer dimension of 768. Furthermore, the feature extraction network is configured with a learnable class embedding vector. This is used to aggregate semantic context information from the entire image, and a feature pyramid structure is introduced to fuse deep semantic information with shallow texture information through lateral connections, forming a multi-scale feature descriptor. The fusion formula is as follows:

[0064] ;

[0065] ;

[0066] in: : No. Multi-scale fusion characteristics of layers; : No. Shallow / deep original features of the layer; Upsampling functions map high-dimensional features to a low-dimensional scale. The final feature descriptor after multi-scale feature concatenation; Feature concatenation function; =4: The total number of layers in the feature pyramid.

[0067] In the above-described method for generating 3D models based on AI reinforcement learning from a single 2D image, in S2, the policy network... A multilayer perceptron structure is adopted and residual connections are introduced; the geometric evolution action A hierarchical architecture is adopted, where the first layer of actions is global scaling and displacement for rapid alignment of the 3D envelope, and the second layer of actions is adaptive offset of local vertices for fine-tuning details. The displacement vector of each vertex is decomposed into a direction vector and a step factor. The direction vector is obtained by isoangular sampling on a unit sphere, and the step factor is dynamically adjusted according to the resolution of the current mesh. In each action iteration, the agent simultaneously outputs the current optimal displacement and an action uncertainty index. The action uncertainty is calculated using entropy, and the operation formula is as follows:

[0068] ;

[0069] When uncertainty exceeds a preset threshold (When the default value is 0.8, and 0.5 is used for high-precision reconstruction scene adaptation), the system automatically increases the sampling density in that area. The sampling density update formula is:

[0070] ;

[0071] After an agent performs an action, a state transition occurs. The transition formula is as follows:

[0072] ;

[0073] ;

[0074] ;

[0075] ;

[0076] in: : Indicators of action uncertainty at any given moment; Information entropy function: The larger the entropy value, the higher the uncertainty in action selection; Action uncertainty threshold; New sampling density / grid resolution; Original sampling density / grid resolution; resolution doubles when uncertainty exceeds the threshold. : Time point SDF value; :Depend on Moment Action Caused Point SDF increment; : Time point The unit surface normal vector; :Depend on Moment Action Caused Increment of point normal vector; : Time point The original value of the normal vector (unnormalized); : The new state of the intelligent agent at any given moment.

[0077] In the aforementioned method for generating 3D models based on AI reinforcement learning from a single 2D image, S3 involves the visual projection reward. The computation involves a raster-based differentiable rendering technique that allows gradients to be directly propagated from 2D image pixels back to the coordinates of 3D vertices; the reprojection loss function... Defined as the sum of the pixel-level squared difference loss between the original image and the rendered image, and the L1 loss between the original image mask and the rendered image mask, i.e.:

[0078] ;

[0079] By minimizing the reprojection loss function Enhance visual projection rewards ( This drives the model to approximate the original image in terms of contour.

[0080] In the aforementioned method for generating 3D models based on AI reinforcement learning from a single 2D image, S3 refers to the physical consistency reward. The calculation involves a surface stress analysis process based on a spring-damped model, treating the generated 3D mesh as a set of points connected by elastic edges, and calculating the mesh deformation energy under specific external loads. If the model's structure contains suspended or unreasonably elongated connections, causing the deformation energy to exceed the threshold... (Default value is 50), then =0 (negative reward); the physical consistency reward It also involves manifold closure detection, through the calculation of Euler characteristic numbers. Detect the presence of hanging edges, non-manifold vertices, and open boundaries. If they do not conform to the preset genus characteristics of the object, [then the system will proceed]. (ideal Euler characteristic number) ),but =0 (a penalty value is given).

[0081] In the aforementioned method for generating 3D models based on AI reinforcement learning from a single 2D image, S4 introduces a cross-dimensional cross-attention mechanism to enhance the reasoning ability for occluded regions, integrating 2D image features. The formula for associating the data with the coordinates of sampling points in three-dimensional space is as follows:

[0082] ;

[0083] ;

[0084] ;

[0085] ;

[0086] By establishing a lookup table, the coordinates of the center point of each voxel in 3D space are projected onto 2D image coordinates according to a preset camera extrinsic matrix. Local feature blocks are then extracted centered on these coordinates, and the association weights between the 3D point and its corresponding 2D pixel and its context are calculated. When the agent determines that the current viewpoint information is insufficient to support deterministic deformation, the diffusion model prior is activated, and the closest geometric distribution probability is extracted from the pre-trained 3D shape library as a supplementary guide for reinforcement learning exploration. The fusion formula is as follows:

[0087] ;

[0088] in: 3D point Coordinates projected onto a two-dimensional image; Camera projection function, input 3D points and intrinsic parameters External reference ; 3D point The corresponding two-dimensional local feature block; Feature extraction function, with Extract from center Local features; : Size of the local feature block, default value is 3; 3D point Cross-dimensional cross-attention features; : Normalized exponential function; Attention feature dimension, default value is 64, to prevent gradient explosion; : The final policy distribution after fusion diffusion priors; Strategy fusion coefficient The default value is 0.8; : Geometric prior probability distribution of the pre-trained diffusion model.

[0089] In the aforementioned 3D model generation method based on AI reinforcement learning from a single 2D image, S5 employs a frequency-encoded coordinate mapping technique in the texture optimization stage. This technique maps 3D point coordinates to a high-dimensional space composed of higher-order sine and cosine spaces to learn high-frequency surface texture details. Simultaneously, a symmetry prior is introduced; when symmetrical features are detected in the target object, the texture of the visible area is automatically mapped to the non-visible side. The operation formula is as follows:

[0090] ;

[0091] Furthermore, a reinforcement learning fine-tuning mechanism is used to eliminate the visual discontinuity at the seams. The formula for the seam fine-tuning loss is:

[0092] ;

[0093] Before outputting the final model, a mesh deformation algorithm based on Laplacian coordinates is executed to smooth the mesh surface by solving a system of linear equations; geometric thickening is then performed on the weak regions of the model obtained from the finite element analysis, using the following formula:

[0094] ;

[0095] ;

[0096] ;

[0097] Geometric thickening is achieved by adjusting the isosurface threshold of the implicit field by outputting the displacement increment along the normal direction in the region by the agent; where: 3D point The symmetrical point; The plane of symmetry / axis of symmetry of an object; : Texture seam loss function; the smaller the value, the smoother the seam. : The three-dimensional point set at the texture seam; Geometric thickening factor, adaptively adjusted according to the stress magnitude in the weak area (stress value 0~10). =0.1, stress value 10~30 suitable =0.3); : The original value of the point's normal vector is increased along the normal direction; The final symbolic distance field after thickening; The final 3D mesh model after geometric thickening and smoothing.

[0098] The aforementioned system for generating 3D models based on AI reinforcement learning from a single 2D image includes: an image input and preprocessing module for receiving, scaling, and normalizing a 2D image, and generating a segmentation mask for the target using a deep convolutional neural network (U-Net) pre-trained on the COCO dataset; the module includes an adaptive contrast enhancement submodule for improving the edge sharpness of the target object, wherein the contrast enhancement formula is:

[0099] ;

[0100] in: Image with enhanced contrast; : Contrast-limited adaptive histogram equalization function; The original image after normalization; Contrast limit threshold; Histogram block size;

[0101] The high-dimensional feature encoding subsystem includes a pre-trained visual feature extractor that uses a hierarchical feature fusion architecture to capture low-level texture information and high-level semantic category information of images and transform them into vector representations.

[0102] The core engine of the reinforcement learning agent integrates a policy network. Value Network The target network supports distributed parallel computing; internally, it maintains a dynamic topology graph structure, where the number of grid vertices dynamically increases or decreases based on the magnitude of local curvature. The formula for calculating local curvature is:

[0103] ;

[0104] ;

[0105] ;

[0106] in: 3D point The average curvature at that point; , : Principal curvature at point; Curvature threshold, default value is 0.5 (unit: 1 / pixel); : Resampling at points increases the local subdivision count; resampling is automatically triggered in regions with high curvature to increase the local subdivision count.

[0107] The environment simulation and multi-dimensional evaluation module includes a differentiable rendering engine and a physics simulation engine. The physics simulation engine uses a force analysis model based on a particle system (particle count ≥ 1000) to simulate the static equilibrium state of an object on a horizontal support surface. If the support polygon does not include the vertical projection of the object's center of gravity, a penalty term is fed back. =0); the environment simulation and multi-dimensional evaluation module is equipped with a real-time ray tracing unit to simulate ambient light occlusion and global illumination when evaluating visual rewards;

[0108] The geometric reconstruction and post-processing unit is responsible for converting the implicit representation into a standard format triangular mesh, and performing mesh simplification, hole filling, and normal correction to generate a 3D asset file that can be called by industrial software. Before outputting the final model, the unit calculates the stress distribution of the model under its own weight and preset external loads based on the finite element analysis principle, and feeds back the analysis results to the agent for geometric thickening processing. Geometric thickening processing is achieved by adjusting the isosurface threshold of the implicit field by outputting the displacement increment along the normal direction in the weak region by the agent.

[0109] The aforementioned system for generating 3D models based on AI reinforcement learning from a single 2D image employs an asynchronous update architecture in its core engine. This engine includes multiple environment samplers running in independent processes and a central parameter server. The environment samplers collect trajectory data. The data is then pushed to the central parameter server to update the global policy weights. The system runs in a cluster environment consisting of multiple computing nodes. Each computing node is equipped with a graphics processing unit (GPU) with tensor core acceleration capabilities. Distributed parameter synchronization is performed through the InfiniBand high-speed communication bus (bandwidth ≥100G / s), and matrix operations are performed using hardware acceleration units. The minimum hardware configuration of the system is: equipped with an NVIDIA RTX 3090 or higher GPU (with ≥24G VRAM), an Intel Xeon Gold 6330 or higher CPU, ≥64G of memory, and ≥2 computing nodes.

[0110] The aforementioned system for generating 3D models based on AI reinforcement learning from a single 2D image also includes a multi-agent cooperation mechanism. When processing a single image containing multiple mutually occluded objects, a reinforcement learning agent is assigned to each independent target, and the multiple agents share a global scene feature map. They communicate and coordinate boundary ranges to prevent spatial overlap or physical penetration; the implicit representation employs a multi-level continuous hierarchy, allowing users to specify different sampling frequencies according to the application scenario. (Quick preview scene) =16, standard reconstruction scenario =32, High-precision manufacturing scenario =64) to extract 3D assets of different complexities.

[0111] The beneficial effects of the present invention on the method and system for generating three-dimensional models based on AI reinforcement learning of a single 2D image are that it deeply integrates reinforcement learning with three-dimensional reconstruction, models the three-dimensional model generation process through Markov decision process, and combines a dynamic policy update mechanism to achieve efficient iterative optimization from a single 2D image to a three-dimensional model. It solves the problem that traditional methods are prone to getting trapped in local optima and improves the adaptability of the model's global contour and local details.

[0112] A multi-dimensional feature extraction architecture is proposed, which combines a visual transformer and a feature pyramid structure to capture both high-level semantic features and shallow texture features of the image. An overlapping sampling strategy is used to enhance the continuity of edge features, providing rich feature support for 3D reconstruction and improving the model's ability to restore image details.

[0113] A three-in-one dynamic reward evaluation mechanism is constructed, which comprehensively considers visual projection similarity, geometric topological integrity and physical consistency. The generated 3D model not only closely matches the original image visually, but also has a reasonable topological structure and physical stability. It can be directly applied to scenarios such as industrial manufacturing and VR / AR where the practicality of the model is required.

[0114] By introducing a cross-dimensional cross-attention mechanism and a diffusion model prior fusion strategy, the reasoning ability for occluded regions is enhanced, the reconstruction defects caused by insufficient viewpoint information in a single image are solved, and the adaptability of the model to complex scenes (such as multi-object occlusion and sparse texture) is improved.

[0115] The system is designed as a distributed parallel computing system. It adopts an asynchronous update architecture and a high-speed communication bus, combined with a dynamic mesh resolution adjustment mechanism, which significantly improves the efficiency of 3D reconstruction and the utilization of hardware resources, and supports large-scale, high-precision reconstruction requirements. At the same time, it supports multi-agent collaboration and multi-level sampling frequency configuration to adapt to the needs of different application scenarios. Attached Figure Description

[0116] Figure 1 This is an overall flowchart of a method for generating 3D models based on AI reinforcement learning from a single 2D image;

[0117] Figure 2This is a system architecture diagram of a 3D model generation method based on AI reinforcement learning from a single 2D image. Detailed Implementation

[0118] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0119] A method for generating 3D models based on AI reinforcement learning from a single 2D image includes the following steps:

[0120] S1: Multi-dimensional Feature Extraction and Initial State Space Construction: Receive a single 2D image to be processed and extract deep semantic and geometric prior features using a pre-defined feature extraction network based on a visual transformer architecture. The feature extraction network is pre-trained on the ImageNet-21k dataset and initialized using a Xavier normal distribution. It segments the input 2D image into fixed-size 16×16 image blocks and performs linear projection mapping and positional encoding. A global feature vector is output through multi-layered stacked Transformer encoder blocks. and local spatial feature maps The core formula for feature extraction is:

[0121] ;

[0122] ;

[0123] ;

[0124] ;

[0125] in: : No. Encoded feature vectors of image patches (fusing linear projection and positional coding); Linear projection function maps the pixel values ​​of an image patch to a high-dimensional vector; : No. A 16×16 fixed-size image block pixel matrix; : No. The location encoding vectors of each image patch provide spatial location information for the Transformer; The fused feature tensor output by the Transformer encoder; : Multi-layer stacked Transformer encoder block functions; : Encoded feature vectors of all image patches The total number of image blocks; Learnable category embedding vectors are used to aggregate semantic context information across the entire graph; : Global feature vector, representing the high-level semantic features of the entire 2D image; Global average pooling function reduces the dimensionality of feature tensors to vectors; Local spatial feature maps preserve the spatial structure and edge details of the image; : Dimension reshaping function, which converts the encoder output tensor into a feature map that matches the spatial dimensions of the original image;

[0126] Subsequently, based on the global feature vector In three-dimensional coordinate space Initialize an implicit symbolic distance field of preset resolution as the initial state of the reinforcement learning agent. The initial state Including the occupancy probability distribution of 3D mesh nodes and preliminary surface normal vector information, the initial state construction formula is:

[0127] ;

[0128] ;

[0129] ;

[0130] ;

[0131] ;

[0132] in: : Three-dimensional space point The initial symbolic distance field value at the point, SDF is the symbolic distance field, which represents the signed distance from the point to the surface of the object; Multilayer perceptron specifically designed for SDF prediction; : A point in three-dimensional coordinate space ; : Three-dimensional space point The initial surface normal vector (unit normal vector) at that location. : A multilayer perceptron specifically designed for normal vector prediction; Normal vector normalization function, which converts the output vector into a unit normal vector; : The gradient is used to solve for the surface normal vector; : Three-dimensional space point The initial normal vector at that location (unnormalized); The initial state of the reinforcement learning agent includes the initial SDF, the unit surface normal vector, the original value of the normal vector, and the probability distribution of the occupancy of the three-dimensional mesh nodes.

[0133] In the multi-dimensional feature extraction process, the feature extraction network employs an overlapping sampling strategy when segmenting and sampling the two-dimensional image, dividing the image into 16×16 blocks and retaining 50% overlap to enhance the continuity of edge features; the Transformer encoder block contains 12 attention heads, and the hidden layer dimension is set to 768; the feature extraction network is also configured with a learnable class embedding vector. This is used to aggregate semantic context information from the entire image, and a feature pyramid structure is introduced to fuse deep semantic information with shallow texture information through lateral connections, forming a multi-scale feature descriptor. The fusion formula is as follows:

[0134] ;

[0135] ;

[0136] in: : No. Multi-scale fusion characteristics of layers; : No. Shallow / deep original features of the layer; Upsampling functions map high-dimensional features to a low-dimensional scale. The final feature descriptor after multi-scale feature concatenation; Feature concatenation function; =4: The total number of layers in the feature pyramid.

[0137] S2: Geometric Evolution Strategy Design Based on Deep Reinforcement Learning: The generation process of the 3D model is modeled as a Markov decision process, and the reinforcement learning agent determines the geometric evolution strategy based on the current state. and extracted features , Through policy network Output a set of geometric evolution actions The action space is defined as the adjustment parameters of the voxel density in the implicit field, the displacement vectors of the mesh vertices, and the control factors of the surface subdivision operation; wherein, the mesh vertex displacement vector is decomposed into a direction vector and a step size factor, and the calculation formula is:

[0138] ;

[0139] ;

[0140] ;

[0141] To improve search efficiency, a prior probability distribution constraint based on the action space is introduced, which restricts the actions... The range of values ​​is constrained within the feasible geometric envelope derived from the two-dimensional image contour lines, and the constraint formula is:

[0142] ;

[0143] ;

[0144] in: : The displacement vector of the grid vertex at any given time; : Unit direction vector, obtained by equal-angle sampling of a unit sphere; : The step size factor at any given time is dynamically adjusted based on the current grid resolution; : Grid resolution function; : The current state of the agent at any given moment; : with parameters The strategy network; Features that fuse global and local features; : Geometric envelope constraint function, outputting 0 / 1 to indicate whether it is inside the envelope; : Individual action components; : Indicator function, returns 1 if the condition is true, and 0 otherwise; The two-dimensional image contour lines are projected onto the three-dimensional space through the camera's intrinsic and extrinsic parameters to form a closed set of projection points, which constitute the initial geometric boundary envelope of the three-dimensional model. Envelope scaling factor, default value is 1.2;

[0145] The policy network A multilayer perceptron structure is adopted and residual connections are introduced; the geometric evolution action A hierarchical architecture is adopted, where the first layer of actions is global scaling and displacement for rapid alignment of the 3D envelope, and the second layer of actions is adaptive offset of local vertices for fine-tuning details. The displacement vector of each vertex is decomposed into a direction vector and a step factor. The direction vector is obtained by isoangular sampling on a unit sphere, and the step factor is dynamically adjusted according to the resolution of the current mesh. In each action iteration, the agent simultaneously outputs the current optimal displacement and an action uncertainty index. The action uncertainty is calculated using entropy, and the operation formula is as follows:

[0146] ;

[0147] When uncertainty exceeds a preset threshold (When the default value is 0.8, and 0.5 is used for high-precision reconstruction scene adaptation), the system automatically increases the sampling density in that area. The sampling density update formula is:

[0148] ;

[0149] After an agent performs an action, a state transition occurs. The transition formula is as follows:

[0150] ;

[0151] ;

[0152] ;

[0153] ;

[0154] in: : Indicators of action uncertainty at any given moment; Information entropy function: The larger the entropy value, the higher the uncertainty in action selection; Action uncertainty threshold; New sampling density / grid resolution; Original sampling density / grid resolution; resolution doubles when uncertainty exceeds the threshold. : Time point SDF value; :Depend on Moment Action Caused Point SDF increment; : Time point The unit surface normal vector; :Depend on Moment Action Caused Increment of point normal vector; : Time point The original value of the normal vector (unnormalized); : The new state of the intelligent agent at any given moment.

[0155] S3: Construct a three-in-one dynamic reward evaluation mechanism: when the agent performs an action Get a new state Then, through a multimodal reward function The generated geometry is evaluated; the multimodal reward function Rewards by visual projection Geometric topology rewards and physical consistency rewards The weighted composition is calculated using the following formula:

[0156] ;

[0157] in: : The total reward value at each time step drives the reinforcement learning agent to optimize its strategy. , , : Reward weight coefficients, which are the weights of visual projection, geometric topology, and physical consistency rewards, respectively; Visual projection reward represents the similarity between the 3D model projected onto the 2D plane and the original image; Geometric topological reward characterizes the surface smoothness and topological integrity of a 3D model; Physical consistency reward, characterizing the physical rationality and structural stability of the 3D model;

[0158] Visual projection rewards The current 3D model is projected back to the 2D plane using a differentiable renderer, and the intersection-over-union ratio of the projection mask and the pixel-level color deviation are calculated using the following formulas:

[0159] ;

[0160] ;

[0161] in: : Reprojection loss function, the smaller the loss, the better The larger; Mean squared error function, calculates the pixel-level squared difference loss between the original image and the rendered image; Input: the original 2D image; : A 2D projected rendering of a 3D model obtained through differentiable rendering; L1 loss function: calculates the pixel-wise absolute error between the original image segmentation mask and the rendering mask; Original image segmentation mask; : Segmentation mask of the rendered image;

[0162] Geometric topology reward Surface smoothness and topology error are evaluated using Laplace smoothness constraints and mesh Jacobian determinants. The calculation formula is as follows:

[0163] ;

[0164] ;

[0165] ;

[0166] in: Laplace smoothing loss; the smaller the value, the smoother the surface. : Laplace operator; The vertex set of a 3D model; Topology loss: the smaller the value, the fewer the topology errors. : The face set of a 3D model; :noodle The Jacobian determinant represents the local topological rationality; , Loss weighting coefficients, all positive real numbers, with a default value range. , ;

[0167] Physical consistency reward Rigid body dynamics simulation is introduced to examine the model's center-of-mass equilibrium state and surface normal consistency under a preset gravity field. Simultaneously, surface stress analysis of the spring-damped model and manifold closure checks using Euler eigennumbers are combined. The calculation formula is as follows:

[0168] ;

[0169] ;

[0170] ;

[0171] ;

[0172] in: : Center of mass balance reward; The centroid of a 3D model; : The polygonal region of the model support surface; Surface stress reward; : Deformation energy of the model surface; : Deformation energy threshold, default value is 50; The Euler eigenvalues ​​of the model. , For the number of vertices, For the number of sides, Number of faces; : The genus characteristics of the target object; Manifold closed reward; , , : Weighting coefficients for physical reward components, all of which are positive real numbers, with default values. =0.3, =0.4, =0.3;

[0173] The visual projection reward The computation involves a raster-based differentiable rendering technique that allows gradients to be directly propagated from 2D image pixels back to the coordinates of 3D vertices; the reprojection loss function... Defined as the sum of the pixel-level squared difference loss between the original image and the rendered image, and the L1 loss between the original image mask and the rendered image mask, i.e.:

[0174] ;

[0175] By minimizing the reprojection loss function Enhance visual projection rewards ( This drives the model to approximate the original image in terms of contour.

[0176] The physical consistency reward The calculation involves a surface stress analysis process based on a spring-damped model, treating the generated 3D mesh as a set of points connected by elastic edges, and calculating the mesh deformation energy under specific external loads. If the model's structure contains suspended or unreasonably elongated connections, causing the deformation energy to exceed the threshold... (Default value is 50), then =0 (negative reward); the physical consistency reward It also involves manifold closure detection, through the calculation of Euler characteristic numbers. Detect the presence of hanging edges, non-manifold vertices, and open boundaries. If they do not conform to the preset genus characteristics of the object, [then the system will proceed]. (ideal Euler characteristic number) ),but =0 (a penalty value is given).

[0177] S4: Policy Update and Iterative Optimization: A near-end policy optimization algorithm is used to update the parameters of the policy network. During training, an experience replay pool is introduced to store the state transition sequences. The action output distribution is adjusted by calculating the advantage function of the generalized advantage estimation. The operation formula is as follows:

[0178] ;

[0179] ;

[0180] ;

[0181] ;

[0182] Simultaneously, a dynamic learning rate decay mechanism is introduced, focusing on global contour exploration and local detail repair at different training stages respectively. The learning rate formula is:

[0183] ;

[0184] in: : Generalized advantage estimate at time; Discount factor The default value is 0.99; GAE attenuation coefficient The default value is 0.95; Temporal differential residuals; : with parameters Value network; : PPO strategy loss function; : time step Expectations; : Policy probability ratio; : Pruning function, which limits the range of probability ratios; Cutting factor The default value is 0.2; : Policy network parameters before update; : Learning rate at any given moment; Initial learning rate, default value ; : The starting step for learning rate decay, the default value is 10000; : Maximum number of training iterations, default value is 50000;

[0185] To enhance reasoning capabilities for occluded regions, a cross-dimensional cross-attention mechanism is introduced, integrating two-dimensional image features. The formula for associating the data with the coordinates of sampling points in three-dimensional space is as follows:

[0186] ;

[0187] ;

[0188] ;

[0189] ;

[0190] By establishing a lookup table, the coordinates of the center point of each voxel in 3D space are projected onto 2D image coordinates according to a preset camera extrinsic matrix. Local feature blocks are then extracted centered on these coordinates, and the association weights between the 3D point and its corresponding 2D pixel and its context are calculated. When the agent determines that the current viewpoint information is insufficient to support deterministic deformation, the diffusion model prior is activated, and the closest geometric distribution probability is extracted from the pre-trained 3D shape library as a supplementary guide for reinforcement learning exploration. The fusion formula is as follows:

[0191] ;

[0192] in: 3D point Coordinates projected onto a two-dimensional image; Camera projection function, input 3D points and intrinsic parameters External reference ; 3D point The corresponding two-dimensional local feature block; Feature extraction function, with Extract from center Local features; : Size of the local feature block, default value is 3; 3D point Cross-dimensional cross-attention features; : Normalized exponential function; Attention feature dimension, default value is 64, to prevent gradient explosion; : The final policy distribution after fusion diffusion priors; Strategy fusion coefficient The default value is 0.8; : Geometric prior probability distribution of the pre-trained diffusion model.

[0193] S5: Geometric Refinement and Model Output: After the reinforcement learning agent converges or reaches the preset number of iterations, the final implicit field representation is converted into an explicit polygonal mesh model using the moving cube algorithm. The operation formula is as follows:

[0194] ;

[0195] The model surface texture mapping is optimized using coordinate-based neural radiation field technology. The texture optimization employs frequency-encoded coordinate mapping technology, and the operation formula is as follows:

[0196] ;

[0197] ;

[0198] Finally, perform mesh smoothing based on Laplacian coordinates and output a 3D model file. The mesh smoothing formula is:

[0199] ;

[0200] in: Initial polygonal mesh model; The moving cube algorithm converts implicit SDF into an explicit mesh. The final symbolic distance field after training convergence; SDF isosurface threshold, set to 0; 3D point Frequency coding features; : The order of frequency coding, the default value is 10; 3D point The texture color value at that location; : A multilayer perceptron specifically designed for texture color prediction; : The set of smoothed mesh vertices; : Original mesh vertex set; Smoothing coefficient ; Laplace coordinate transformation of a vertex set.

[0201] The texture optimization stage employs a frequency-encoded coordinate mapping technique. This technique maps 3D point coordinates to a high-dimensional space composed of higher-order sine and cosine spaces to learn high-frequency surface texture details. Simultaneously, a symmetry prior is introduced; when symmetrical features are detected in the target object, the texture of the visible area is automatically mapped to the non-visible side. The operation formula is as follows:

[0202] ;

[0203] Furthermore, a reinforcement learning fine-tuning mechanism is used to eliminate the visual discontinuity at the seams. The formula for the seam fine-tuning loss is:

[0204] ;

[0205] Before outputting the final model, a mesh deformation algorithm based on Laplacian coordinates is executed to smooth the mesh surface by solving a system of linear equations; geometric thickening is then performed on the weak regions of the model obtained from the finite element analysis, using the following formula:

[0206] ;

[0207] ;

[0208] ;

[0209] Geometric thickening is achieved by adjusting the isosurface threshold of the implicit field by outputting the displacement increment along the normal direction in the region by the agent; where: 3D point The symmetrical point; The plane of symmetry / axis of symmetry of an object; : Texture seam loss function; the smaller the value, the smoother the seam. : The three-dimensional point set at the texture seam; Geometric thickening factor, adaptively adjusted according to the stress magnitude in the weak area (stress value 0~10). =0.1, stress value 10~30 suitable =0.3); : The original value of the point's normal vector is increased along the normal direction; The final symbolic distance field after thickening; The final 3D mesh model after geometric thickening and smoothing.

[0210] The aforementioned system for generating 3D models based on AI reinforcement learning from a single 2D image includes: an image input and preprocessing module for receiving, scaling, and normalizing a 2D image, and generating a segmentation mask for the target using a deep convolutional neural network (U-Net) pre-trained on the COCO dataset; the module includes an adaptive contrast enhancement submodule for improving the edge sharpness of the target object, wherein the contrast enhancement formula is:

[0211] ;

[0212] in: Image with enhanced contrast; : Contrast-limited adaptive histogram equalization function; The original image after normalization; Contrast limit threshold; Histogram block size;

[0213] The high-dimensional feature encoding subsystem includes a pre-trained visual feature extractor that uses a hierarchical feature fusion architecture to capture low-level texture information and high-level semantic category information of images and transform them into vector representations.

[0214] The core engine of the reinforcement learning agent integrates a policy network. Value Network The target network supports distributed parallel computing; internally, it maintains a dynamic topology graph structure, where the number of grid vertices dynamically increases or decreases based on the magnitude of local curvature. The formula for calculating local curvature is:

[0215] ;

[0216] ;

[0217] ;

[0218] in: 3D point The average curvature at that point; , : Principal curvature at point; Curvature threshold, default value is 0.5 (unit: 1 / pixel); : Resampling at points increases the local subdivision count; resampling is automatically triggered in regions with high curvature to increase the local subdivision count.

[0219] The environment simulation and multi-dimensional evaluation module includes a differentiable rendering engine and a physics simulation engine. The physics simulation engine uses a force analysis model based on a particle system (particle count ≥ 1000) to simulate the static equilibrium state of an object on a horizontal support surface. If the support polygon does not include the vertical projection of the object's center of gravity, a penalty term is fed back. =0); the environment simulation and multi-dimensional evaluation module is equipped with a real-time ray tracing unit to simulate ambient light occlusion and global illumination when evaluating visual rewards;

[0220] The geometric reconstruction and post-processing unit is responsible for converting the implicit representation into a standard format triangular mesh, and performing mesh simplification, hole filling, and normal correction to generate a 3D asset file that can be called by industrial software. Before outputting the final model, the unit calculates the stress distribution of the model under its own weight and preset external loads based on the finite element analysis principle, and feeds back the analysis results to the agent for geometric thickening processing. Geometric thickening processing is achieved by adjusting the isosurface threshold of the implicit field by outputting the displacement increment along the normal direction in the weak region by the agent.

[0221] The aforementioned system for generating 3D models based on AI reinforcement learning from a single 2D image employs an asynchronous update architecture in its core engine. This engine includes multiple environment samplers running in independent processes and a central parameter server. The environment samplers collect trajectory data. The data is then pushed to the central parameter server to update the global policy weights. The system runs in a cluster environment consisting of multiple computing nodes. Each computing node is equipped with a graphics processing unit (GPU) with tensor core acceleration capabilities. Distributed parameter synchronization is performed through the InfiniBand high-speed communication bus (bandwidth ≥100G / s), and matrix operations are performed using hardware acceleration units. The minimum hardware configuration of the system is: equipped with an NVIDIA RTX 3090 or higher GPU (with ≥24G VRAM), an Intel Xeon Gold 6330 or higher CPU, ≥64G of memory, and ≥2 computing nodes.

[0222] It also includes a multi-agent collaboration mechanism. When processing a single image containing multiple mutually occluded objects, a reinforcement learning agent is assigned to each independent target, and the multiple agents share a global scene feature map. They communicate and coordinate boundary ranges to prevent spatial overlap or physical penetration; the implicit representation employs a multi-level continuous hierarchy, allowing users to specify different sampling frequencies according to the application scenario. (Quick preview scene) =16, standard reconstruction scenario =32, High-precision manufacturing scenario =64) to extract 3D assets of different complexities.

[0223] Example 1:

[0224] This embodiment provides a method for generating 3D models based on AI reinforcement learning from a single 2D image, such as... Figure 1 As shown, it includes the following steps:

[0225] S1: Multidimensional Feature Extraction and Initial State Space Construction

[0226] The system receives a 1920×1080 resolution 2D image of an industrial part and performs feature extraction using a feature extraction network based on a vision transformer architecture. This feature extraction network is pre-trained on the ImageNet-21k dataset, initialized using a Xavier normal distribution, and divides the image into 16×16 image patches (approximately 120×67.5 ≈ 8100 image patches). Linear projection mapping (projection dimension 768) and position encoding (sine position encoding) are performed, and a global feature vector is output through a 12-layer stacked Transformer encoder block (12 attention heads, hidden layer dimension 768). (Dimension 768) and Local Spatial Feature Map (Dimensions: 120×67×768).

[0227] The feature extraction network employs a 50% overlap sampling strategy and introduces a 4-layer feature pyramid structure for multi-scale feature fusion. The fused multi-scale feature descriptor The dimension is 768 × 4 = 3072.

[0228] based on In three-dimensional coordinate space An implicit symbolic range field with an initial resolution of 64×64×64 is initialized and then... (3 hidden layers, 256 neurons per layer) Predict the initial SDF value, by (3 hidden layers, 128 neurons per layer) Predict the original value of the initial normal vector, normalize it to obtain the unit normal vector, and finally form the initial state of the reinforcement learning agent. .

[0229] S2: Geometric evolutionary policy design based on deep reinforcement learning models the 3D model generation process as a Markov decision process, and the policy network of the reinforcement learning agent. A 4-layer MLP structure is adopted (hidden layer dimension 512, residual connection is introduced). The action space includes voxel density adjustment parameters (range [-0.1, 0.1]), mesh vertex displacement vector (direction vector is obtained by isoangular sampling of unit sphere, step size factor range [0.01, 0.1]), and surface subdivision operation control factor (0 or 1, 0 means no subdivision, 1 means subdivision).

[0230] The range of action values ​​is restricted by the geometric envelope constraint function. Within the projection envelope of the 2D image contour line with a value of 1.2, the action uncertainty is calculated using the information entropy function. When the uncertainty exceeds a threshold... When the value is 0.8, the grid resolution of the corresponding area will be doubled (e.g., 64×64×64 → 128×128×128).

[0231] S3: Construct a three-in-one dynamic reward evaluation mechanism and set reward weight coefficients. =0.4、 =0.3、 =0.3, where:

[0232] Visual projection rewards The 3D model is projected back onto the 2D plane using a differentiable renderer (1920×1080 resolution), and the sum of the MSE loss (weight 0.5) and L1 loss (weight 0.5) is calculated. ;

[0233] Geometric topology reward set up =0.3, =0.7, calculated using Laplace smoothing loss and topological loss; Physical consistency reward set up =0.3, =0.4, =0.3, deformation energy threshold =50, calculated through centroid balance detection, surface stress analysis and manifold closure detection.

[0234] S4: Strategy Update and Iterative Optimization

[0235] The policy network is updated using the PPO algorithm, the experience replay pool size is set to 10000, and the GAE discount factor is used. =0.99, attenuation coefficient =0.95, cutting factor =0.2; Initial value of dynamic learning rate , After 10,000 steps, a second decay begins. =Training ends at 50,000 steps.

[0236] Introducing a cross-dimensional attention mechanism, local feature block size Attention feature dimension When perspective information is insufficient, the prior fusion strategy of the activation diffusion model is adopted, and the fusion coefficient is adjusted accordingly. .

[0237] S5: Geometric Refinement and Model Output

[0238] After the agent training converges (the reward value stabilizes above 0.9), the moving cube algorithm (with isosurface threshold) is used. =0) converts the implicit SDF to an explicit triangular mesh model; frequency coding (order) is used. )and (3 hidden layers, 128 neurons per layer) Optimize texture mapping, introduce symmetry prior to eliminate texture seams; set smoothing coefficient. =0.3 Perform mesh smoothing; for weak regions where the stress value obtained from finite element analysis is greater than 20, set... =0.3 Perform geometric thickening processing, and finally output a 3D model file in STL format.

[0239] Example 2:

[0240] This embodiment provides a system for implementing the above method, such as... Figure 2 As shown, it includes an image input and preprocessing module, a high-dimensional feature encoding subsystem, a reinforcement learning agent core engine, an environment simulation and multi-dimensional evaluation module, and a geometric reconstruction and post-processing unit, with the specific deployment as follows:

[0241] System hardware configuration: It adopts two compute nodes 6, each compute node is equipped with an NVIDIA RTX 3090 GPU (24G VRAM), an Intel Xeon Gold 6330 CPU (24 cores), and 128G memory. The nodes are connected through an InfiniBand high-speed communication bus 8 (100G / s bandwidth), and one central parameter server 7 is configured (configured the same as the compute nodes).

[0242] Image input and preprocessing module: Receives 2D images uploaded by users, scales them to 1920×1080 resolution, normalizes them to the range [0,1], generates target segmentation masks using a U-Net network pre-trained on the COCO dataset, and enhances image contrast using the CLAHE algorithm (clipLimit=2.0, tileGridSize=(8,8)).

[0243] High-dimensional feature encoding subsystem: Deploys a pre-trained visual transformer feature extraction network to achieve multi-dimensional feature extraction and multi-scale feature fusion, outputting... , and .

[0244] The core engine of the reinforcement learning agent employs an asynchronous update architecture, deploying four environment samplers (independent processes) to collect trajectory data and push it to the central parameter server to update global policy weights. Internally, the engine maintains a dynamic topology graph structure with curvature thresholds. (1 / pixel), automatically triggering resampling in areas with large curvature.

[0245] Environment Simulation and Multidimensional Evaluation Module: Deploys a differentiable rendering engine (supporting real-time ray tracing) and a physical simulation engine (particle count = 1000) to simulate ambient occlusion and global illumination, perform centroid balance detection, surface stress analysis and manifold closure detection, and output a three-in-one reward value.

[0246] The geometric reconstruction and post-processing unit deploys the moving cube algorithm, texture optimization module, and mesh smoothing module. Based on the finite element analysis results, it performs geometric thickening processing and outputs 3D model files in industrial common formats such as STL and OBJ.

[0247] When processing images with multiple object occlusions, the system activates a multi-agent cooperation mechanism, assigning a reinforcement learning agent to each independent target and sharing... Coordinate the boundary range; users can select the sampling frequency according to their needs. Quick Preview ( =16), conventional reconstruction ( =32), high-precision manufacturing ( =64).

[0248] The system in this embodiment has been tested and found that the average time to generate a high-precision 3D model from a single 2D image is 45 minutes. The visual similarity between the model and the original image is ≥95%, the topological error rate is ≤0.5%, and the physical stability meets the requirements of industrial manufacturing. It can be widely used in industrial design, VR / AR, digital twin and other fields.

[0249] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for generating 3D models based on AI reinforcement learning from a single 2D image, characterized in that, Includes the following steps: S1: Multi-dimensional feature extraction and initial state space construction: Receive a single two-dimensional image to be processed as input, and use a pre-defined feature extraction network based on a visual transformer architecture to extract deep semantic features and geometric prior features. The feature extraction network is pre-trained on the ImageNet-21k dataset and initialized using a Xavier normal distribution. It segments the input 2D image into fixed-size 16×16 image blocks and performs linear projection mapping and position encoding. The global feature vector is output through multi-layer stacked Transformer encoder blocks. and local spatial feature maps The core formula for feature extraction is: ; ; ; ; in: : No. Encoded feature vectors of image patches; Linear projection function maps the pixel values ​​of an image patch to a high-dimensional vector; : No. A 16×16 fixed-size image block pixel matrix; : No. The location encoding vectors of each image patch provide spatial location information for the Transformer; The fused feature tensor output by the Transformer encoder; : Multi-layer stacked Transformer encoder block functions; : Encoded feature vectors of all image patches The total number of image blocks; Learnable category embedding vectors are used to aggregate semantic context information across the entire graph; : Global feature vector, representing the high-level semantic features of the entire 2D image; Global average pooling function reduces the dimensionality of feature tensors to vectors; Local spatial feature maps preserve the spatial structure and edge details of the image; : Dimension reshaping function, which converts the encoder output tensor into a feature map that matches the spatial dimensions of the original image; Subsequently, based on the global feature vector In three-dimensional coordinate space Initialize an implicit symbolic distance field of preset resolution as the initial state of the reinforcement learning agent. The initial state Including the occupancy probability distribution of 3D mesh nodes and preliminary surface normal vector information, the initial state construction formula is: ; ; ; ; ; in: : Three-dimensional space point The initial symbolic distance field value at the point, SDF is the symbolic distance field, which represents the signed distance from the point to the surface of the object; Multilayer perceptron specifically designed for SDF prediction; : A point in three-dimensional coordinate space ; : Three-dimensional space point The initial surface normal vector at the location; : A multilayer perceptron specifically designed for normal vector prediction; Normal vector normalization function, which converts the output vector into a unit normal vector; : The gradient is used to solve for the surface normal vector; : Three-dimensional space point The initial value of the normal vector at that location; The initial state of the reinforcement learning agent includes the initial SDF, the unit surface normal vector, the original value of the normal vector, and the probability distribution of the occupancy of the three-dimensional mesh nodes. S2: Geometric Evolution Strategy Design Based on Deep Reinforcement Learning: The generation process of the 3D model is modeled as a Markov decision process, and the reinforcement learning agent determines the geometric evolution strategy based on the current state. and extracted features , Through policy network Output a set of geometric evolution actions The action space is defined as the adjustment parameters of the voxel density in the implicit field, the displacement vectors of the mesh vertices, and the control factors of the surface subdivision operation; wherein, the mesh vertex displacement vector is decomposed into a direction vector and a step size factor, and the calculation formula is: ; ; ; To improve search efficiency, a prior probability distribution constraint based on the action space is introduced, which restricts the actions... The range of values ​​is constrained within the feasible geometric envelope derived from the two-dimensional image contour lines, and the constraint formula is: ; ; in: : The displacement vector of the grid vertex at any given time; : Unit direction vector, obtained by equal-angle sampling of a unit sphere; : The step size factor at any given time is dynamically adjusted based on the current grid resolution; : Grid resolution function; : The current state of the agent at any given moment; : with parameters The strategy network; Features that fuse global and local features; : Geometric envelope constraint function, outputting 0 / 1 to indicate whether it is inside the envelope; : Individual action components; : Indicator function, returns 1 if the condition is true, and 0 otherwise; The two-dimensional image contour lines are projected onto the three-dimensional space through the camera's intrinsic and extrinsic parameters to form a closed set of projection points, which constitute the initial geometric boundary envelope of the three-dimensional model. Envelope scaling factor, default value is 1.2; S3: Construct a three-in-one dynamic reward evaluation mechanism: when the agent performs an action Get a new state Then, through a multimodal reward function The generated geometry is evaluated; the multimodal reward function Rewards by visual projection Geometric topology rewards and physical consistency rewards The weighted composition is calculated using the following formula: ; in: : The total reward value at each time step drives the reinforcement learning agent to optimize its strategy. , , : Reward weight coefficients, which are the weights of visual projection, geometric topology, and physical consistency rewards, respectively; Visual projection reward represents the similarity between the 3D model projected onto the 2D plane and the original image; Geometric topological reward characterizes the surface smoothness and topological integrity of a 3D model; Physical consistency reward, characterizing the physical rationality and structural stability of the 3D model; Visual projection rewards The current 3D model is projected back to the 2D plane using a differentiable renderer, and the intersection-over-union ratio of the projection mask and the pixel-level color deviation are calculated using the following formulas: ; ; in: : Reprojection loss function, the smaller the loss, the better The larger; Mean squared error function, calculates the pixel-level squared difference loss between the original image and the rendered image; Input: the original 2D image; : A 2D projected rendering of a 3D model obtained through differentiable rendering; L1 loss function: calculates the pixel-wise absolute error between the original image segmentation mask and the rendering mask; Original image segmentation mask; : Segmentation mask of the rendered image; Geometric topology reward Surface smoothness and topology error are evaluated using Laplace smoothness constraints and mesh Jacobian determinants. The calculation formula is as follows: ; ; ; in: Laplace smoothing loss; the smaller the value, the smoother the surface. : Laplace operator; The vertex set of a 3D model; Topology loss: the smaller the value, the fewer the topology errors. : The face set of a 3D model; :noodle The Jacobian determinant represents the local topological rationality; , Loss weighting coefficients, all positive real numbers, with a default value range. , ; Physical consistency reward Rigid body dynamics simulation is introduced to examine the model's center-of-mass equilibrium state and surface normal consistency under a preset gravity field. Simultaneously, surface stress analysis of the spring-damped model and manifold closure checks using Euler eigennumbers are combined. The calculation formula is as follows: ; ; ; ; in: : Center of mass balance reward; The centroid of a 3D model; : The polygonal region of the model support surface; Surface stress reward; : Deformation energy of the model surface; : Deformation energy threshold, default value is 50; The Euler eigenvalues ​​of the model. , For the number of vertices, For the number of sides, Number of faces; : The genus characteristics of the target object; Manifold closed reward; , , : Weighting coefficients for physical reward components, all of which are positive real numbers, with default values. =0.3, =0.4, =0.3; S4: Policy Update and Iterative Optimization: A near-end policy optimization algorithm is used to update the parameters of the policy network. During training, an experience replay pool is introduced to store the state transition sequence, and the action output distribution is adjusted by calculating the advantage function of the generalized advantage estimation. The operation formula is as follows: ; ; ; ; Simultaneously, a dynamic learning rate decay mechanism is introduced, focusing on global contour exploration and local detail repair at different training stages respectively. The learning rate formula is: ; in: : Generalized advantage estimate at time; Discount factor The default value is 0.99; GAE attenuation coefficient The default value is 0.95; Temporal differential residuals; : with parameters Value network; : PPO strategy loss function; : time step Expectations; : Policy probability ratio; : Pruning function, which limits the range of probability ratios; Cutting factor The default value is 0.2; : Policy network parameters before update; : Learning rate at any given moment; Initial learning rate, default value ; : The starting step for learning rate decay, the default value is 10000; : Maximum number of training iterations, default value is 50000; S5: Geometric Refinement and Model Output: After the reinforcement learning agent converges or reaches the preset number of iterations, the final implicit field representation is converted into an explicit polygonal mesh model using the moving cube algorithm. The operation formula is as follows: ; The model surface texture mapping is optimized using coordinate-based neural radiation field technology. The texture optimization employs frequency-encoded coordinate mapping technology, and the operation formula is as follows: ; ; Finally, perform mesh smoothing based on Laplacian coordinates and output a 3D model file. The mesh smoothing formula is: ; in: Initial polygonal mesh model; The moving cube algorithm converts implicit SDF into an explicit mesh. The final symbolic distance field after training convergence; SDF isosurface threshold, set to 0; 3D point Frequency coding features; : The order of frequency coding, the default value is 10; 3D point The texture color value at that location; : A multilayer perceptron specifically designed for texture color prediction; : The set of smoothed mesh vertices; : Original mesh vertex set; Smoothing coefficient ; Laplace coordinate transformation of a vertex set.

2. The method for generating 3D models based on AI reinforcement learning of a single 2D image according to claim 1, characterized in that, in In the multi-dimensional feature extraction process, the feature extraction network employs an overlapping sampling strategy when segmenting and sampling the two-dimensional image, dividing the image into 16×16 blocks and retaining 50% overlap to enhance the continuity of edge features; the Transformer encoder block contains 12 attention heads, and the hidden layer dimension is set to 768; the feature extraction network is also configured with a learnable class embedding vector. This is used to aggregate semantic context information from the entire image, and a feature pyramid structure is introduced to fuse deep semantic information with shallow texture information through lateral connections, forming a multi-scale feature descriptor. The fusion formula is as follows: ; ; in: : No. Multi-scale fusion characteristics of layers; : No. Shallow / deep original features of the layer; Upsampling functions map high-dimensional features to a low-dimensional scale. The final feature descriptor after multi-scale feature concatenation; Feature concatenation function; =4: The total number of layers in the feature pyramid.

3. The method for generating a 3D model based on AI reinforcement learning of a single 2D image according to claim 2, characterized in that, in S2, the policy network... A multilayer perceptron structure is adopted and residual connections are introduced; the geometric evolution action A hierarchical architecture is adopted, where the first layer of actions is global scaling and displacement for rapid alignment of the 3D envelope, and the second layer of actions is adaptive offset of local vertices for fine-tuning details. The displacement vector of each vertex is decomposed into a direction vector and a step factor. The direction vector is obtained by isoangular sampling on a unit sphere, and the step factor is dynamically adjusted according to the resolution of the current mesh. In each action iteration, the agent simultaneously outputs the current optimal displacement and an action uncertainty index. The action uncertainty is calculated using entropy, and the operation formula is as follows: ; When uncertainty exceeds a preset threshold When this happens, the system automatically increases the sampling density in that area. The sampling density update formula is: ; After an agent performs an action, a state transition occurs. The transition formula is as follows: ; ; ; ; in: : Indicators of action uncertainty at any given moment; Information entropy function: The larger the entropy value, the higher the uncertainty in action selection; Action uncertainty threshold; New sampling density / grid resolution; Original sampling density / grid resolution; resolution doubles when uncertainty exceeds the threshold. : Time point SDF value; :Depend on Moment Action Caused Point SDF increment; : Time point The unit surface normal vector; :Depend on Moment Action Caused Increment of point normal vector; : Time point The original value of the normal vector; : The new state of the intelligent agent at any given moment.

4. The method for generating a 3D model based on AI reinforcement learning from a single 2D image according to claim 3, characterized in that, In S3, the visual projection reward The computation involves a raster-based differentiable rendering technique that allows gradients to be directly propagated from 2D image pixels back to the coordinates of 3D vertices; the reprojection loss function... Defined as the sum of the pixel-level squared difference loss between the original image and the rendered image, and the L1 loss between the original image mask and the rendered image mask, i.e.: ; By minimizing the reprojection loss function Enhance visual projection rewards This drives the model to approximate the original image in terms of contour.

5. The method for generating a 3D model based on AI reinforcement learning from a single 2D image according to claim 4, characterized in that, In S3, the physical consistency reward The calculation involves a surface stress analysis process based on a spring-damped model, treating the generated 3D mesh as a set of points connected by elastic edges, and calculating the mesh deformation energy under specific external loads. If the model's structure contains suspended or unreasonably elongated connections, causing the deformation energy to exceed the threshold... ,but =0; the physical consistency reward It also involves manifold closure detection, through the calculation of Euler characteristic numbers. Detect the presence of hanging edges, non-manifold vertices, and open boundaries. If they do not conform to the preset genus characteristics of the object, [then the system will proceed]. ,but =0.

6. The method for generating a 3D model based on AI reinforcement learning from a single 2D image according to claim 5, characterized in that, In S4, to enhance the reasoning ability for occluded regions, a cross-dimensional cross-attention mechanism is introduced to integrate two-dimensional image features. The formula for associating the data with the coordinates of sampling points in three-dimensional space is as follows: ; ; ; ; By establishing a lookup table, the coordinates of the center point of each voxel in 3D space are projected onto 2D image coordinates according to a preset camera extrinsic matrix. Local feature blocks are then extracted centered on these coordinates, and the association weights between the 3D point and its corresponding 2D pixel and its context are calculated. When the agent determines that the current viewpoint information is insufficient to support deterministic deformation, the diffusion model prior is activated, and the closest geometric distribution probability is extracted from the pre-trained 3D shape library as a supplementary guide for reinforcement learning exploration. The fusion formula is as follows: ; in: 3D point Coordinates projected onto a two-dimensional image; Camera projection function, input 3D points and intrinsic parameters External reference ; 3D point The corresponding two-dimensional local feature block; Feature extraction function, with Extract from center Local features; : Size of the local feature block, default value is 3; 3D point Cross-dimensional cross-attention features; : Normalized exponential function; Attention feature dimension, default value is 64, to prevent gradient explosion; : The final policy distribution after fusion diffusion priors; Strategy fusion coefficient The default value is 0.8; : Geometric prior probability distribution of the pre-trained diffusion model.

7. The method for generating 3D models based on AI reinforcement learning of a single 2D image according to claim 6, characterized in that, In S5, the texture optimization stage employs a frequency-encoded coordinate mapping technique. This involves mapping 3D point coordinates to a high-dimensional space composed of higher-order sine and cosine spaces to learn high-frequency surface texture details. Simultaneously, a symmetry prior is introduced; when a target object is detected to have symmetrical features, the texture of the visible area is automatically mapped to the non-visible side. The operation formula is as follows: ; Furthermore, a reinforcement learning fine-tuning mechanism is used to eliminate the visual discontinuity at the seams. The formula for the seam fine-tuning loss is: ; Before outputting the final model, a mesh deformation algorithm based on Laplacian coordinates is executed to smooth the mesh surface by solving a system of linear equations; geometric thickening is then performed on the weak regions of the model obtained from the finite element analysis, using the following formula: ; ; ; Geometric thickening is achieved by adjusting the isosurface threshold of the implicit field by outputting the displacement increment along the normal direction in the region by the agent; where: 3D point The symmetrical point; The plane of symmetry / axis of symmetry of an object; : Texture seam loss function; the smaller the value, the smoother the seam. : The three-dimensional point set at the texture seam; : Geometric thickening factor, which is adaptively adjusted according to the stress magnitude in the weak area; : The original value of the point's normal vector is increased along the normal direction; The final symbolic distance field after thickening; The final 3D mesh model after geometric thickening and smoothing.

8. The system for generating 3D models based on AI reinforcement learning of a single 2D image according to any one of claims 1-7, characterized in that, include: The image input and preprocessing module receives, scales, and normalizes two-dimensional images, and generates a segmentation mask for the target using a deep convolutional neural network (U-Net) pre-trained on the COCO dataset. This module includes an adaptive contrast enhancement submodule to improve the edge sharpness of the target object; the contrast enhancement formula is as follows: ; in: Image with enhanced contrast; : Contrast-limited adaptive histogram equalization function; The original image after normalization; Contrast limit threshold; Histogram block size; The high-dimensional feature encoding subsystem includes a pre-trained visual feature extractor that uses a hierarchical feature fusion architecture to capture low-level texture information and high-level semantic category information of images and transform them into vector representations. The core engine of the reinforcement learning agent integrates a policy network. Value Network The target network supports distributed parallel computing; internally, it maintains a dynamic topology graph structure, where the number of grid vertices dynamically increases or decreases based on the magnitude of local curvature. The formula for calculating local curvature is: ; ; ; in: 3D point The average curvature at that point; , : Principal curvature at point; Curvature threshold, default value is 0.5; : Resampling at points increases the local subdivision count; resampling is automatically triggered in regions with high curvature to increase the local subdivision count. The environment simulation and multidimensional evaluation module includes a differentiable rendering engine and a physical simulation engine. The physical simulation engine uses a force analysis model based on particle systems to simulate the static equilibrium state of an object on a horizontal support surface. If the support polygon does not include the vertical projection of the object's center of gravity, a penalty term is fed back. The environment simulation and multidimensional evaluation module is equipped with a real-time ray tracing unit to simulate ambient occlusion and global illumination when evaluating visual rewards. The geometric reconstruction and post-processing unit is responsible for converting the implicit representation into a standard format triangular mesh, and performing mesh simplification, hole filling, and normal correction to generate a 3D asset file that can be called by industrial software. Before outputting the final model, the unit calculates the stress distribution of the model under its own weight and preset external loads based on the finite element analysis principle, and feeds back the analysis results to the agent for geometric thickening processing. Geometric thickening processing is achieved by adjusting the isosurface threshold of the implicit field by outputting the displacement increment along the normal direction in the weak region by the agent.

9. The system for generating 3D models based on AI reinforcement learning of a single 2D image according to claim 8, characterized in that, The core engine of the reinforcement learning agent adopts an asynchronous update architecture, which includes multiple environment samplers running in independent processes and a central parameter server. The environment samplers collect trajectory data and push it to the central parameter server to update the global policy weights. The system operates in a cluster environment consisting of multiple computing nodes. Each computing node is equipped with a graphics processing unit (GPU) with tensor core acceleration capabilities. Distributed parameter synchronization is achieved through the InfiniBand high-speed communication bus, and matrix operations are performed using hardware acceleration units. The minimum hardware configuration of the system is: equipped with an NVIDIA RTX 3090 or higher GPU, an Intel Xeon Gold 6330 or higher CPU, ≥64GB of memory, and ≥2 computing nodes.

10. The system for generating 3D models based on AI reinforcement learning of a single 2D image according to claim 9, characterized in that, It also includes a multi-agent collaboration mechanism. When processing a single image containing multiple mutually occluded objects, a reinforcement learning agent is assigned to each independent target, and the multiple agents share a global scene feature map. They communicate and coordinate boundary ranges to prevent spatial overlap or physical penetration; the implicit representation employs a multi-level continuous hierarchy, allowing users to specify different sampling frequencies according to the application scenario. To extract 3D assets of varying complexity.