Target object modeling method

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining multi-view optical images and auxiliary data with the SfM+MVS algorithm to generate an initial model and integrating morphological and multimodal visual features, the problem of high requirements for lighting and texture in existing technologies is solved, and high-precision 3D modeling is achieved.

CN122199862APending Publication Date: 2026-06-12SECOND AFFILIATED HOSPITAL OF COLLEGE OF MEDICINEOF XIAN JIAOTONG UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SECOND AFFILIATED HOSPITAL OF COLLEGE OF MEDICINEOF XIAN JIAOTONG UNIV
Filing Date: 2026-02-06
Publication Date: 2026-06-12

Application Information

Patent Timeline

06 Feb 2026

Application

12 Jun 2026

Publication

CN122199862A

IPC: G06T17/20; G06T5/70; G06V10/72; G06V10/44; G06V10/80; G06V10/74; G06T7/10; G06V10/26; G06T7/13; G06V10/46; G06V10/50; G06N3/0464; G06N3/045; G06N3/0455; G06N3/0499; G06N3/084

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing 3D reconstruction technologies have high requirements for lighting and texture, and are difficult to handle weak textures or reflective surfaces, resulting in insufficient modeling accuracy.

⚗Method used

Using multi-view optical images and auxiliary data such as point cloud data and infrared images, an initial model is generated through the SfM+MVS algorithm. Combining morphological features and multimodal visual features, the final target object model is generated.

🎯Benefits of technology

The generated model has both high-precision geometric structure and rich surface features, meeting the requirements of industrial tolerances and medical anatomy, and is suitable for practical scenarios such as quality inspection, simulation and replication.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122199862A_ABST

Patent Text Reader

Abstract

The application relates to the technical field of modeling, in particular to a target object modeling method. A plurality of target optical images of a target object at different angles and target auxiliary data corresponding to the target object are acquired; the target auxiliary data comprises target point cloud data and / or a target infrared image; the target object is modeled based on the target optical images and the target auxiliary data, and an initial target object model is obtained; the initial target object model is identified to determine target morphological features corresponding to the target object; based on the target optical images and the target auxiliary data, multi-modal visual features corresponding to the target object are determined; and based on the target morphological features and the multi-modal visual features, a final target object model corresponding to the target object is generated. The generated final model has high-precision geometric structure and rich surface features, and meets the requirements of rules in the fields of industrial tolerance and medical dissection, and can be directly applied to actual scenes such as quality inspection, simulation, replication and the like.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of modeling technology, and more specifically to a method for modeling target objects. Background Technology

[0002] 3D reconstruction technology, which transforms 2D images into 3D digital models, serves as a crucial bridge connecting the physical and digital worlds, with wide applications in fields such as industrial manufacturing, medical diagnosis, cultural heritage preservation, and virtual reality. Existing 3D reconstruction techniques typically employ geometric vision methods, such as Structure from Motion (SfM) and Multi-View Stereo (MVS). These methods can generate dimensionally accurate and geometrically correct models, but they often require high precision in lighting and texture, and the resulting models often lack surface detail and texture realism, struggling to handle weak textures or reflective surfaces. Therefore, ensuring the accuracy of modeling target objects remains a pressing challenge in the current technological field. Summary of the Invention

[0003] This invention provides a method for modeling target objects to solve the problem of how to ensure the accuracy of modeling target objects.

[0004] In a first aspect, the present invention provides a target object modeling method, the method comprising: acquiring multiple optical images of the target object from different perspectives and target auxiliary data corresponding to the target object; the target auxiliary data including target point cloud data and / or target infrared images; modeling the target object based on the target optical images and target auxiliary data to obtain an initial target object model; identifying the initial target object model to determine the target morphological features corresponding to the target object; determining the multimodal visual features corresponding to the target object based on the target optical images and target auxiliary data; and generating a final target object model corresponding to the target object based on the target morphological features and multimodal visual features.

[0005] In one optional implementation, acquiring multiple target optical images of the target object from different perspectives includes: acquiring multiple initial optical images of the target object from different perspectives, calculating the average image gradient magnitude of each initial optical image, and determining the sharpness of each initial optical image; calculating the image grayscale histogram distribution of each initial optical image, and determining the illumination uniformity of each initial optical image; calculating the camera pose corresponding to each initial optical image based on the SfM algorithm, and determining the viewpoint coverage integrity of each initial optical image; and selecting the target optical image from each initial optical image based on the sharpness, illumination uniformity, and viewpoint coverage integrity of each initial optical image.

[0006] In one optional implementation, the initial auxiliary data includes at least one of initial point cloud data, initial infrared image, initial ultrasound image, and initial temporal video frame; obtaining target auxiliary data corresponding to the target object includes: for the initial point cloud data, performing denoising processing on the initial point cloud data using statistical filtering and radius filtering to obtain target point cloud data; for the initial infrared image, performing filtering processing on the initial infrared image using a bilateral filtering algorithm to generate a target infrared image; for the initial ultrasound image, performing filtering processing on the initial ultrasound image using a wavelet threshold filtering algorithm to generate a target ultrasound image; and for the initial temporal video frame, removing motion-blurred video frames using the inter-frame difference method to obtain the target temporal video frame.

[0007] In one optional implementation, modeling the target object based on the target optical image and target auxiliary data to obtain an initial target object model includes: identifying the target optical image and determining the target camera parameters corresponding to the target optical image; performing dense matching on the target optical image using the SfM+MVS algorithm to output a depth map; back-projecting each pixel in the depth map to a unified world coordinate system based on the target camera parameters to generate initial dense point cloud data; optimizing the initial dense point cloud data using a region growing algorithm to generate target dense point cloud data; converting the target dense point cloud data into a triangular mesh model using a Poisson reconstruction algorithm; and optimizing the triangular mesh model based on the target auxiliary data to obtain the initial target object model.

[0008] In one optional implementation, the triangular mesh model is optimized based on target auxiliary data to obtain an initial target object model. This includes: sampling the triangular mesh model to obtain sampled point cloud data corresponding to the triangular mesh model; performing an initial matching between the target point cloud data and the sampled point cloud data to obtain an initial registration result; using the initial registration result as the initial value, employing a kd-tree nearest point search algorithm, matching the nearest point on the triangular mesh model for each point in the target point cloud data in each iteration to construct corresponding point pairs; solving for the optimal rigid transformation matrix using the least squares method based on the corresponding point pairs; updating the coordinates of the target point cloud data based on the optimal rigid transformation matrix, and calculating the root mean square error (RMSE) of the current iteration. If the RMSE is less than a preset convergence threshold or the maximum number of iterations is reached, the iteration is stopped, resulting in... The process involves: obtaining an initial target object model; and / or, using the ORB algorithm to extract edge feature points corresponding to the target infrared image and the target optical image, respectively, generating binary descriptors corresponding to the target infrared image and the target optical image, respectively; calculating the Hamming distance between the binary descriptors corresponding to the target infrared image and the target optical image, respectively; determining valid matching pairs from the binary descriptors corresponding to the target infrared image and the target optical image, respectively, based on the Hamming distance; solving the homography matrix using the RANSAC algorithm based on the valid matching pairs; performing coordinate transformation on the target infrared image based on the homography matrix to obtain the transformed infrared image of the target infrared image in the target optical image coordinate system; and mapping the edge features of the transformed infrared image to a triangular mesh model according to the pixel coordinate correspondence to obtain the initial target object model.

[0009] In one optional implementation, the initial target object model is identified to determine the target morphological features corresponding to the target object, including: acquiring the target domain corresponding to the target object; determining a preset feature extraction model corresponding to the target object based on the target domain; the preset feature extraction model includes a size-based feature extraction module, a curvature-based feature extraction module, a symmetry-based feature extraction module, a topological structure-based feature extraction module, and a specific feature extraction module corresponding to the target object; acquiring the feature extraction weights corresponding to the size-based feature extraction module, the curvature-based feature extraction module, the symmetry-based feature extraction module, the topological structure-based feature extraction module, and the specific feature extraction module; allocating computing resources to each feature extraction module based on each feature extraction weight; controlling each feature extraction module to extract features based on the computing resources corresponding to each feature extraction module, thereby obtaining the size-based features, curvature-based features, symmetry-based features, topological structure-based features, and specific features corresponding to the target object; and determining the target morphological features corresponding to the target object based on the size-based features, curvature-based features, symmetry-based features, topological structure-based features, and specific features.

[0010] In one optional implementation, the target auxiliary data includes a target infrared image; based on the target optical image and the target auxiliary data, determining the multimodal visual features corresponding to the target object includes: extracting features from each target optical image at different levels to generate features corresponding to each level of the target optical image; determining the image weights corresponding to each target optical image according to the viewpoint type corresponding to each target optical image; fusing the features at each level based on the image weights to obtain a basic visual feature vector; extracting edge features from the target infrared image in the target auxiliary data to obtain infrared edge features corresponding to the target infrared image; and fusing the basic visual feature vector and the infrared edge features to generate multimodal visual features.

[0011] In one optional implementation, the features at each level include a first-level feature map, a second-level feature vector, and a third-level feature vector. Feature extraction is performed on each target optical image at different levels to generate corresponding features at each level, including: first-level feature extraction of the target optical image to obtain first-level edge features, first-level texture features, and first-level color features; fusing the first-level edge features, first-level texture features, and first-level color features to generate a first-level feature map; downsampling the first-level feature map using max pooling to obtain a downsampled feature map; and applying a multi-channel convolutional kernel to the downsampled feature map. The graph is subjected to sliding convolution aggregation to obtain an aggregated feature map. A dilated convolution operation is then performed on the aggregated feature map to obtain an expanded receptive field feature map. The expanded receptive field feature map is compared with a preset response threshold to determine candidate shape regions. Based on these candidate shape regions, structural features are extracted to generate a structured feature map. Global average pooling is performed on the structured feature map to obtain a second-level feature vector. Global average pooling is then performed on the second-level feature vector to obtain a feature vector of a preset dimension. L2 normalization is then applied to the feature vector of the preset dimension to map the feature values of each dimension to a uniform scale, resulting in a third-level feature vector.

[0012] In one optional implementation, a final target object model is generated based on the target morphological features and multimodal visual features, including: fusing the target morphological features and multimodal visual features to obtain target fused features; performing dimensionality reduction processing on the target fused features based on principal component analysis to generate target dimensionality reduction features; determining the rule constraint vector corresponding to the target object based on the target application scenario; multiplying the rule constraint vector and the target dimensionality reduction features to obtain an enhanced feature vector; determining a preset 3D latent generation model according to the target application scenario; and inputting the enhanced feature vector into the preset 3D latent generation model to generate the final target object model corresponding to the target object.

[0013] In one optional implementation, the enhanced feature vector is input into a preset 3D latent generation model to generate a final target object model corresponding to the target object. This includes: splitting the enhanced feature vector into a key vector and a value vector; using initial latent features generated by the preset 3D latent generation model as query vectors; calculating the association weights between latent features and enhanced feature vectors in each injection layer using a cross-attention mechanism; weighting the key vectors and value vectors based on the association weights to obtain a weighted feature vector; updating the initial latent features based on the weighted feature vector to obtain updated latent features; outputting a virtual target object model based on the updated latent features from the preset 3D latent generation model; calculating the target loss value between the virtual target object model and the initial target object model based on a preset loss function; the preset loss function includes geometric consistency loss, feature matching loss, and domain constraint loss; and correcting the parameters in the preset 3D latent generation model based on the target loss value until the target loss value is less than the preset loss function value, thus obtaining the final target object model.

[0014] The target object modeling method provided in this embodiment utilizes multi-view optical images to fully cover the surface details of the object, avoiding information blind spots from a single viewpoint. Target point cloud data provides high-precision three-dimensional geometric coordinates, and infrared images capture hidden features that are difficult to identify in optical images, providing multi-dimensional data support for subsequent modeling. Then, optical visual features are fused with the geometric and thermal features of the point cloud / infrared image to compensate for the accuracy deficiencies of single-data modeling, quickly generating an initial target object model with basic geometric shape and surface details. Core morphological features such as size, curvature, symmetry, and topological structure are extracted from the initial target object model to accurately depict the geometric essence of the object. Integrating the texture and color features of the optical image with the edge features of the infrared image forms a "visual + infrared" multimodal feature set, which enhances the recognition of key object details and improves the accuracy of subsequent model fusion. Finally, the target morphological features and multimodal visual features are fused to generate the final target object model, resolving the issues of detail deviations and rule inconsistencies in the initial target object model. The generated final model has both high-precision geometric structure and rich surface features, while meeting the requirements of industrial tolerances, medical anatomy and other fields. It can be directly applied to practical scenarios such as quality inspection, simulation and replication. Attached Figure Description

[0015] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0016] Figure 1This is a schematic diagram of an application scenario according to an embodiment of the present invention; Figure 2 This is a schematic diagram of the first type of target object modeling method according to an embodiment of the present invention. Detailed Implementation

[0017] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0018] This embodiment provides a target object modeling method that can be used in electronic devices. Figure 1 This is a flowchart of a target object modeling method according to an embodiment of the present invention, such as... Figure 1 As shown, the process includes the following steps: Step S101: Acquire multiple optical images of the target object from different perspectives and the corresponding auxiliary data of the target object.

[0019] The target auxiliary data includes target point cloud data and / or target infrared images.

[0020] Specifically, the electronic device can receive multiple optical images of the target object from different perspectives and corresponding target auxiliary data of the target object input by the user, or it can receive multiple optical images of the target object from different perspectives and corresponding target auxiliary data of the target object input by the user.

[0021] This step will be explained in detail below.

[0022] Step S102: Model the target object based on the target optical image and target auxiliary data to obtain an initial target object model.

[0023] Specifically, electronic devices can use a preset method to model the target object based on the target optical image and target auxiliary data to obtain an initial target object model.

[0024] This step will be explained in detail below.

[0025] Step S103: Identify the initial target object model and determine the target morphological features corresponding to the target object.

[0026] Specifically, electronic devices can scan the initial target object model to determine the target morphological features corresponding to the target object.

[0027] This step will be explained in detail below.

[0028] Step S104: Based on the target optical image and target auxiliary data, determine the multimodal visual features corresponding to the target object.

[0029] Specifically, electronic devices can identify target optical images and target auxiliary data to determine the multimodal visual features corresponding to the target object.

[0030] This step will be explained in detail below.

[0031] Step S105: Based on the target morphological features and multimodal visual features, generate the final target object model corresponding to the target object.

[0032] Specifically, the electronic device trains a preset 3D latent generative model based on the target's morphological features and multimodal visual features to generate the final target object model corresponding to the target object.

[0033] The target object modeling method provided in this embodiment acquires multiple optical images of the target object from different perspectives, along with auxiliary data. The multi-view optical images can fully cover the surface details of the object, avoiding information blind spots from a single perspective. The target point cloud data provides high-precision three-dimensional geometric coordinates, and the infrared images can capture hidden features that are difficult to identify in optical images, providing multi-dimensional data support for subsequent modeling. Then, based on the target optical images and auxiliary data, an initial target object model is obtained. This model integrates optical visual features with the geometric and thermal features of the point cloud / infrared image, compensating for the accuracy deficiencies of single-data modeling and quickly generating an initial target object model with basic geometric shape and surface details, providing a benchmark framework for subsequent feature extraction and optimization. The initial target object model is identified, and its morphological features are determined. Core morphological features such as size, curvature, symmetry, and topological structure are extracted from the initial target object model to accurately depict the geometric essence of the object. Based on the target optical images and auxiliary data, multimodal visual features are determined. The texture and color features of the optical images are integrated with the edge features of the infrared images to form a "visual + infrared" multimodal feature set, which enhances the recognition of key details of the object and improves the accuracy of subsequent model fusion. By integrating target morphological features with multimodal visual features, a final target object model is generated, resolving the issues of detail deviations and rule inconsistencies in the initial target object model. The generated final model possesses both high-precision geometric structure and rich surface features, while simultaneously meeting the rule requirements of fields such as industrial tolerances and medical anatomy, and can be directly applied to practical scenarios such as quality inspection, simulation, and replication.

[0034] This embodiment provides a method for modeling a target object, such as Figure 2 As shown, the process includes the following steps: Step S201: Acquire multiple optical images of the target object from different perspectives and corresponding auxiliary data of the target object.

[0035] The target auxiliary data includes target point cloud data and / or target infrared images.

[0036] Specifically, step S201 above may include the following steps: Step S2011: Acquire multiple initial optical images of the target object from different perspectives.

[0037] Specifically, the electronic device can receive multiple initial optical images of the target object from different perspectives input by the user, or it can receive multiple initial optical images of the target object from different perspectives sent by the camera device.

[0038] Step S2012: Calculate the average image gradient magnitude of each initial optical image to determine the sharpness of each initial optical image.

[0039] Specifically, the electronic device can convert the initial optical image into a grayscale image I(x,y), where x,y are pixel coordinates. Then, convolution operations are performed on the grayscale image using the horizontal Sobel operator Gx and the vertical Sobel operator Gy, respectively. ; Then, the electronic device can calculate the gradient magnitude G(x,y) for each pixel: .

[0040] The electronic device calculates the average gradient magnitude of the entire initial optical image based on the gradient magnitude of each pixel. : Where W is the image width and H is the image height. The electronic device can compare the calculated average gradient magnitude with a preset average gradient magnitude threshold TG. If the average gradient magnitude... If the gradient magnitude mean is greater than or equal to the preset threshold TG, the image is considered clear and retained as a candidate optical image; otherwise, the gradient magnitude mean is lower. Images with a gradient magnitude less than the preset average threshold TG are considered blurry and are directly discarded.

[0041] Step S2013: Calculate the image grayscale histogram distribution of each initial optical image to determine the illumination uniformity of each initial optical image.

[0042] Specifically, the electronic device can convert the initial optical image into an 8-bit grayscale image (grayscale value range 0-255), count the number of pixels appearing at each grayscale level, and generate a grayscale histogram H(k), where k∈[0,255] is the grayscale level. The electronic device can then calculate the proportion P of the highlight region based on the grayscale histogram. highThe highlight region is defined as the set of pixels with a gray value k ≥ 220, and its proportion in the entire image is calculated: Electronic devices can also calculate the proportion P of the shaded area. low The shaded area is defined as the set of pixels with a gray value k ≤ 35, and its proportion in the entire image is calculated: .

[0043] Electronic devices can also calculate the standard deviation σ of grayscale distribution. H The standard deviation of the grayscale histogram is calculated to reflect the dispersion of grayscale values; a smaller standard deviation indicates more uniform illumination. The calculation formula is: ,in The grayscale mean is... .

[0044] If P high If the exposure rate is >30%, the initial optical image is determined to be overexposed, and the highlight areas lack texture information; therefore, it is discarded. If P... low If the exposure rate is >40%, the initial optical image is considered underexposed, and details in shadow areas are lost; therefore, it is discarded. If P... high ≤30% and P low ≤40% and σ H If the value is ≤50, the illumination is considered uniform and the image is retained as a candidate optical image.

[0045] Step S2014: Based on the SfM algorithm, calculate the camera pose corresponding to each initial optical image and determine the viewpoint coverage integrity of each initial optical image.

[0046] Specifically, the electronic device can use the SIFT algorithm to extract local feature points from each initial optical image, calculate feature point descriptors, and achieve cross-image feature point matching through a FLANN matcher. Then, an initial optical image is selected as a reference image, and its camera coordinate system is used as the world coordinate system. The camera extrinsic parameters (rotation matrix R + translation vector t) of other initial optical images are solved through essential matrix factorization to obtain the camera pose of each initial optical image. Next, the electronic device uses the matched feature points and camera extrinsic parameters to generate a sparse point cloud of the target object through a triangulation algorithm to verify the rationality of the camera pose.

[0047] The electronic device can sort the azimuth angles (horizontal angles) of the camera attitude from 0° to 360°, calculate the azimuth angle difference between adjacent viewpoints, and if there is a continuous interval with a difference greater than 60°, then this interval is determined to be a viewpoint gap, and the gap angle is θ. gap Electronic devices can also count the number of visible viewpoints in key complex areas of a target object and calculate their proportion of the total number of viewpoints: C key =Number of visible viewpoints in key areas / Total number of initial viewpoints × 100%.

[0048] If θ gap ≤60° and C key If ≥80%, the viewing angle is considered to be fully covered, and candidate optical images are retained; if θ gap If the angle is >60°, it indicates that image acquisition of the missing viewpoint is needed; if Ckey is <80%, it indicates that close-up viewpoint images of the key areas are needed.

[0049] Step S2015: Select the target optical image from each initial optical image based on the sharpness, illumination uniformity, and viewing angle coverage integrity of each initial optical image.

[0050] Specifically, the electronic device can acquire weights for sharpness, illumination uniformity, and spectral coverage integrity corresponding to each initial optical image. Among these, the sharpness score S1 is: (Normalized to 0-1); Illumination uniformity score S2: If the illumination is uniform, S2=1; otherwise, S2=0; Viewing angle coverage contribution score S3: The contribution of a single initial optical image to the viewing angle coverage. For images without missing areas, S3=1; for images with missing areas, S3=1.2 (bonus points). The overall score of a single initial optical image is S=w1S1+w2S2+w3S3. The electronic device compares the overall score S of a single initial optical image with the overall score threshold TS (e.g., TS=0.8) and only retains target optical images where S≥TS.

[0051] Step S2016: For the initial point cloud data, statistical filtering and radius filtering are used to denoise the initial point cloud data to obtain the target point cloud data.

[0052] Specifically, the electronic device processes each point P in the initial point cloud data. i Search for the k nearest neighbors of point P within its neighborhood. (The electronic device calculates point P.) i The average distance d to k neighboring points i Then, calculate the mean μ of the average distance between all points. d With standard deviation σ d The electronic device will set an anomaly detection threshold T. d =μ d +3σ d (3 standard deviation principle), if d i >T d The point is identified as a noise point and removed to obtain candidate point cloud data. This efficiently eliminates discrete and isolated noise points while preserving the overall structure of the point cloud and without damaging the geometric contour of the target object.

[0053] Next, the electronic device sets the search radius r and the threshold number of neighboring points T. n For each point P in the candidate point cloud data i, count the number of neighboring points n within its radius r i . If n i <T n , determine that the area where the point is located is a low-density noise area and remove it; if n i ≥T n , retain it as a valid point to obtain the target point cloud data. Thus, accurately retain the high-density valid point cloud on the surface of the target object, remove the low-density noise area caused by occlusion or environmental interference, and improve the purity of the point cloud data.

[0054] And / or, Step S2017, for the initial infrared image, use the bilateral filtering algorithm to filter the initial infrared image to generate the target infrared image.

[0055] Specifically, the electronic device can set the core parameters of the bilateral filtering, and the selection of the parameters directly affects the filtering effect. Exemplarily, for the detection of hidden patterns of cultural heritage, the neighborhood window size (Ω) is 5×5, the spatial domain standard deviation (σ s ) is 2.0, and the gray domain standard (σ r ) is 30.0; for the detection of thermal defects of industrial parts, the neighborhood window size (Ω) is 3×3, the spatial domain standard deviation (σ s ) is 1.5, and the gray domain standard (σ r ) is 20.0; for the infrared imaging of medical human tissues, the neighborhood window size (Ω) is 7×7, the spatial domain standard deviation (σ s ) is 2.5, and the gray domain standard (σ r ) is 35.0.

[0056] The electronic device can process each target pixel point (x i , y i ) in the initial infrared image I(x, y). Taking (x i , y i ) as the center, obtain the above-set neighborhood window Ω (such as 5×5), and all pixel points within the window are denoted as (x j , y j ).

[0057] The electronic device can calculate the spatial domain weight w s , which reflects the spatial distance relationship between (x j , y j ) and (x i , y i ), and the formula is: , where the closer the spatial distance, the larger the value of w s (range: 0 < ws ≤ 1). The electronic device can also calculate the gray domain weight w r , which reflects (x j,y j ) and (x i ,y i The formula for the grayscale value difference relationship is: The smaller the difference in grayscale values, the better. r The larger the value (range: 0) <wr≤1)。

[0058] The electronic device then multiplies the spatial domain weights and the grayscale domain weights to obtain the final weighting coefficients: Then, the electronic device performs a weighted average of the gray values of all pixels in the neighborhood to obtain the gray value of the filtered target pixel, using the following formula: After the electronic device has traversed all pixels of the initial infrared image, it filters the grayscale value of each pixel. Combine them in their original positions to obtain a complete infrared image of the target.

[0059] Step S202: Model the target object based on the target optical image and target auxiliary data to obtain an initial target object model.

[0060] Specifically, step S202 above may include the following steps: Step S2021: Identify the target optical image and determine the target camera parameters corresponding to the target optical image.

[0061] Specifically, electronic devices can extract feature points from each target optical image using the SIFT (Scale Invariant Feature Transform) algorithm. The process involves: constructing a Gaussian difference pyramid (6 layers by default), detecting local extrema, and completing sub-pixel localization without additional filtering (retaining all feature points). A 128-dimensional SIFT descriptor is generated for each feature point and normalized to eliminate the influence of illumination.

[0062] Then, one target optical image is selected as the reference image, and the rest are images to be matched. The electronic device calculates the Euclidean distance between each feature point in the image to be matched and all feature points in the reference image, and selects the one with the smallest distance as the initial matching pair. A distance ratio threshold of 0.7 is set, and matching pairs with a "nearest neighbor distance / second nearest neighbor distance" ≤ 0.7 are retained, while obvious mismatches are eliminated to obtain candidate matching pairs.

[0063] Next, the RANSAC iteration count is set to 1000, and 8 candidate matching pairs are randomly selected in each iteration. The fundamental matrix F (satisfying) is then solved. (where x1 and x2 are the homogeneous coordinates of the matching points). The electronic device calculates the reprojection error of all candidate matching pairs and retains matching pairs with an error ≤ 2 pixels as valid matching pairs. The fundamental matrix F with the largest number of valid matching pairs is selected to complete the matching pair purification. Then, the electronic device transforms the fundamental matrix F into the essential matrix E: E = K TFK (where K is the initial intrinsic parameter matrix of the camera). Then, the electronic device performs SVD decomposition on the intrinsic matrix E, obtaining 4 possible pose solutions. Based on "depth consistency", the unique valid rotation matrix R and translation vector t are selected, which are the camera extrinsic parameters. Finally, with the goal of "minimizing the reprojection error of all valid matching points", the electronic device uses the LM algorithm to fine-tune the intrinsic parameters (focal length, principal point coordinates) and extrinsic parameters (R, t). Optimization stops when the average reprojection error is <0.5 pixels, yielding the final target camera parameters (intrinsic parameter matrix K + extrinsic parameter R / t).

[0064] Step S2022: The SfM+MVS algorithm is used to perform dense matching on the target optical image and output a depth map.

[0065] Specifically, the electronic device can call the SfM+MVS fusion algorithm to perform dense matching on each of the target optical images, output the initial depth image corresponding to each of the target optical images, and generate an initial depth image set. (Recommended tool: COLMAP+PMVS2)

[0066] The MVS algorithm, based on an optimized target camera parameter set, performs pixel-level dense matching of breast regions from adjacent viewpoints, calculating the disparity (the difference in pixel position under different viewpoints) of each pixel, with a matching accuracy ≤1 pixel. Then, based on the relationship between disparity and camera focal length, the disparity is converted into depth values, with the depth value range set to 0.5~2.0m (consistent with the actual spatial depth of the human breast). The electronic device can focus on optimizing the depth calculation of key areas such as the nipple and breast edge, ensuring that the depth error in these areas is ≤0.3mm. Finally, a set of initial depth images from multiple viewpoints is obtained (each optical image corresponds to one depth map, with depth values corresponding one-to-one with pixels).

[0067] Step S2023: Based on the target camera parameters, back-project each pixel in the depth map to the unified world coordinate system to generate initial dense point cloud data; Specifically, the electronic device can select the geometric center of the target object as the origin of the world coordinate system and the principal axis of symmetry of the object as the Z-axis to construct a right-handed coordinate system. For each pixel (u,v) in the depth map, combined with the camera intrinsic matrix K and extrinsic parameter R / t, it is back-projected to the world coordinate system (X,Y,Z) using the following formula: Where d is the depth value corresponding to pixel (u,v). The electronic device retains only pixels with depth values within the valid range (e.g., 0.1~5m), and removes invalid points that exceed the reconstruction range to generate initial dense point cloud data.

[0068] Step S2024: Optimize the initial dense point cloud data using a region growing algorithm to generate the target dense point cloud data.

[0069] Specifically, the electronic device can call the region growing algorithm in the PCL point cloud library and set the following core screening conditions: normal vector deviation threshold ≤ 15°. This condition ensures that the normal vector directions of adjacent points within the growing region are consistent, thereby guaranteeing that the generated point cloud surface has good smoothness. Curvature threshold ≤ 0.02mm -1 This condition is used to remove isolated points with excessive curvature, which are often noise or artifacts.

[0070] Electronic devices can select salient feature points on the surface of the target object as seed points. Then, starting from the seed points, the region growing algorithm iteratively adds neighboring points that meet the above two selection criteria to the current region until no further growth is possible. After growth is complete, points outside the region are considered outliers and are removed. The outlier percentage is required to be ≤1%. Then, the grown point cloud undergoes local smoothing, for example, using moving least squares (MLS), to eliminate jagged noise while preserving 0.5mm-level physiological protrusion details, finally generating dense point cloud data of the target. This point cloud is noise-free, has a regular topological structure, and accurately reflects the true surface morphology of the breast.

[0071] Step S2025: The Poisson reconstruction algorithm is used to convert the dense point cloud data of the target into a triangular mesh model.

[0072] Specifically, the electronic device can fit a tangent plane based on neighborhood points, calculate the normal vector of each point in the dense point cloud data of the target, and unify the direction of the normal vector (towards the outside of the object). Then, it automatically adjusts the key parameters of Poisson reconstruction according to the complexity of the target object, for example, octree depth (8~12 levels) and reconstruction accuracy (0.01~0.1mm). The electronic device can treat the dense point cloud data of the target as sampling points in three-dimensional space and fit a continuous implicit surface (describing the object surface) using the Poisson equation. The fitted implicit surface is then subjected to isosurface extraction (MarchingCubes algorithm) to generate a triangular mesh model.

[0073] Step S2026: Based on the target auxiliary data, optimize the triangular mesh model to obtain the initial target object model.

[0074] Specifically, step S2026 above may include the following steps: Step a1: Perform point cloud sampling on the triangular mesh model to obtain the sampled point cloud data corresponding to the triangular mesh model.

[0075] Specifically, the electronic device can perform area-weighted sampling of the triangular mesh model, with the number of sampling points for each triangular facet proportional to its area, avoiding missed sampling of small faces and oversampling of large faces. This ensures that the sampled point cloud completely covers the mesh surface (including edges and texture detail areas), generating sampled point cloud data consistent with the geometry of the triangular mesh model. The point cloud format is XYZ three-dimensional coordinates, with no redundant noise points.

[0076] Step a2: Perform an initial matching between the target point cloud data and the sampled point cloud data to obtain the initial registration result.

[0077] Specifically, the electronic device calculates a 336-dimensional Fast Point Feature Histogram (FPFH) descriptor for each point in both the target point cloud data and the sampled point cloud data. The FPFH descriptor captures local geometric features by statistically analyzing the angle and distance distribution of normal vectors within the point's neighborhood (search radius = 2 mm), thus adapting to the three-dimensional geometric characteristics of the target point cloud data.

[0078] Then, the electronic device uses the FLANN algorithm to construct a KD-Tree index for the FPFH descriptors of the sampled point cloud data, sets a matching threshold (similarity ≥ 0.85), quickly filters out high-quality matching pairs between two sets of point clouds, removes matching pairs with descriptor similarity < 0.85, and retains corresponding points with similar geometric structures.

[0079] Next, the electronic device calculates a rigid transformation matrix (rotation + translation) for the selected matching pairs using singular value decomposition (SVD). This rigid transformation matrix maps the target point cloud data to the unified world coordinate system of the triangular mesh model. Finally, based on the rigid transformation matrix, the electronic device performs initial registration between the target point cloud data and the sampled point cloud data. The Euclidean distance between corresponding points after registration is calculated to ensure that the overall alignment error is ≤5mm, providing a reliable initial value for subsequent fine registration.

[0080] Step a3: Using the initial registration result as the initial value, the kd-tree nearest point search algorithm is used to match the nearest point on the triangular mesh model for each point in the target point cloud data in each iteration, and construct the corresponding point pair.

[0081] Specifically, the electronic device can construct a kd-tree index based on the coarsely registered grid sampled point cloud data to quickly find the nearest point. For each point in the target point cloud data, the kd-tree is used to find its nearest point in the grid sampled point cloud data, forming an initial corresponding point pair. Outlier pairs with a Euclidean distance greater than 3mm are removed, and only geometrically similar valid corresponding point pairs are retained to reduce iteration errors.

[0082] Step a4: Based on the corresponding point pairs, solve for the optimal rigid transformation matrix using the least squares method.

[0083] Specifically, the electronic device can construct a least-squares optimization function with the objective of minimizing the sum of squared Euclidean distances between corresponding point pairs: , where P i For the target point cloud data coordinates, Q i Let R be the coordinates of the corresponding points in the grid-sampled point cloud data, R be the rotation matrix, and t be the translation vector. The optimal R and t are obtained by iteratively solving using the least squares method, yielding a rigid transformation matrix that minimizes the error at the corresponding points.

[0084] Step a5: Based on the optimal rigid transformation matrix, update the coordinates of the target point cloud data and calculate the root mean square error of the current iteration. If the root mean square error is less than the preset convergence threshold or the maximum number of iterations is reached, stop the iteration and obtain the initial target object model.

[0085] Specifically, the electronic device updates the coordinates of all points in the target point cloud data using the optimal rigid transformation matrix obtained in step a4, to obtain the aligned target point cloud data coordinates, as shown in the formula: .

[0086] in The coordinates of the updated target point cloud data are given. Then, the electronic device can calculate the root mean square error (RMSE) of the corresponding point pairs after the update, using the following formula: .

[0087] If RMSE < preset convergence threshold (usually set to 0.5mm), or the number of iterations reaches the maximum value (default 50), the iteration stops; if convergence fails, return to step a3 to reconstruct the corresponding point pairs and repeat the iteration process. Finally, the electronic device merges the converged and aligned target point cloud data with the original triangular mesh model to obtain the initial target object model with corrected geometric accuracy.

[0088] And / or, Step a6: The ORB algorithm is used to extract the edge feature points corresponding to the infrared image and the optical image of the target, respectively, and generate binary descriptors corresponding to the infrared image and the optical image of the target, respectively.

[0089] Specifically, the electronic device can employ the FAST-9 algorithm (neighborhood radius = 3), selecting nine equally spaced pixels on the circumference of a candidate pixel as the determination neighborhood (covering a sufficient range of grayscale abrupt changes). Then, an octree structure is used to divide both the target infrared image and the target optical image into 8×8 sub-blocks, with a preset maximum of 100 corner points extracted from each sub-block to avoid corner points concentrating in local high grayscale abrupt change areas (such as large object outlines), ensuring that subsequent corner points are evenly distributed.

[0090] The electronic device scans the infrared and optical images of the target pixel by pixel, filters out candidate pixels whose grayscale difference with neighboring pixels is ≥15, and determines them as potential grayscale abrupt change points to prepare for FAST corner point generation.

[0091] Then, traverse all pixels row by row and column by column from the top left corner to the bottom right corner of both the target infrared and optical images (skipping the 3-pixel edge region to avoid neighbor overflow). For each candidate pixel P (grayscale value IP), calculate the grayscale values I of its 9 neighboring pixels. Pn (n=1,2,...,9), the statistics satisfy |I P I Pn | ≥ 15 pixels. If the number of neighboring pixels that meet the grayscale difference threshold is ≥ 5 (FAST-9 algorithm judgment standard), then P is judged as a candidate point for grayscale abrupt change (core pixel of object outline, texture boundary); if there are less than 5, they are directly removed to reduce invalid calculations.

[0092] For each candidate pixel, the electronic device calculates its "corner response value." This is the sum of the grayscale differences between the candidate pixel and its neighboring pixels, expressed by the formula: A higher response value indicates a more significant gray-level abrupt change at that point (such as the vertices of an object's outline). The electronic device selects a 3×3 neighborhood centered on the candidate point, retaining only the candidate point with the highest response value within that neighborhood and discarding other redundant candidate points (to avoid generating multiple corner points from the same gray-level abrupt change region). Then, the electronic device traverses the sub-blocks divided by the octree. If a sub-block has more than 100 candidate points, it sorts them by response value from highest to lowest, retaining only the top 100; otherwise, it retains all, ensuring that corner points are evenly distributed in the image. The electronic device then aggregates all candidate points that have undergone non-maximum suppression and block equalization to form the initial FAST corner point set.

[0093] Then, the initial FAST corner points in the initial FAST corner point set are sorted by response value, and invalid points with response values <15 and neighborhood pixel gradient changes <5 are removed to ensure that each image has ≥800 effective feature points. The electronic device calculates the gradient direction histogram of the neighborhood (31×31 pixels) of each effective feature point, and selects the direction with the largest gradient magnitude as the main direction of the feature point to achieve rotation invariance of the descriptor. The electronic device selects a 31×31 pixel neighborhood centered on the feature point and divides it into 16 2×2 sub-regions. Eight directional gradient histograms are calculated for each sub-region, and the four strongest gradient directions are extracted to generate 16-bit binary codes. The codes of the 16 sub-regions are concatenated to obtain a 256-bit binary descriptor.

[0094] Step a7: Calculate the Hamming distance between the binary descriptors corresponding to the infrared image and the optical image of the target, respectively.

[0095] Specifically, for each pair of feature point descriptors in the infrared image and the optical image of the target, the number of different corresponding bits is calculated, which is the Hamming distance (the smaller the value, the higher the similarity of the feature points).

[0096] Step a8: Based on the Hamming distance, determine the valid matching pairs from the binary descriptors corresponding to the target infrared image and the target optical image, respectively.

[0097] Specifically, the electronic device can compare the calculated Hamming distances with a preset Hamming distance threshold, and then, based on the comparison results, retain the matching pairs whose Hamming distances are less than or equal to the preset Hamming distance thresholds as valid matching pairs.

[0098] Step a9: Based on the valid matching pairs, the homography matrix is solved using the RANSAC algorithm.

[0099] Specifically, the electronic device can set a preset number of iterations (e.g., 1000 times) and a preset pixel threshold for inliers (e.g., 2 pixels, where a projection error ≤ 2 pixels is considered an inlier). In each iteration, four non-collinear valid matching pairs are randomly selected, and the initial value of the homography matrix is solved using the Direct Linear Transform (DLT) algorithm. Then, based on the DLT algorithm, the projection constraint is satisfied using the least squares method: x′ = Hx (where x is the homogeneous coordinate of a pixel in the optical image, x′ is the homogeneous coordinate of a pixel in the infrared image, and H is the homography matrix).

[0100] The electronic device can substitute all valid matching pairs into an initial matrix (which can be pre-set), calculate the reprojection error, and count the number of interior points. The matrix with the highest number of interior points is retained as the optimal candidate matrix. Based on the interior point set of the optimal candidate matrix, the electronic device further optimizes the matrix parameters using the least squares method to obtain the final homography matrix.

[0101] Step a10: Based on the homography matrix, perform coordinate transformation on the target infrared image to obtain the transformed infrared image of the target infrared image in the target optical image coordinate system.

[0102] Specifically, the electronic device can substitute the homogeneous coordinates (u, v, 1) of all pixels in the target infrared image into the homography matrix to calculate their corresponding coordinates (u', v') in the target optical image coordinate system, as shown in the formula: This allows us to obtain the transformed infrared image of the target in the target optical image coordinate system.

[0103] Step a11: Map the edge features of the converted infrared image to a triangular mesh model according to the pixel coordinate correspondence to obtain the initial target object model.

[0104] Specifically, the electronic device can perform Canny edge detection on the converted infrared image, setting a low threshold of 50 and a high threshold of 150 to extract clear edge contours. Then, morphological closing operations with 3×3 convolution kernels are used to connect broken edge segments, forming continuous structured edges. All edge pixels are traversed, and their (u,v) coordinates in the target optical image coordinate system are recorded to construct an edge feature coordinate set.

[0105] Then, based on the camera intrinsic (K) and extrinsic (R / t) parameters during the reconstruction of the target optical image, the electronic device back-projects the pixel points (u,v) in the edge feature coordinate set to the world coordinate system, as shown in the formula: ; where d is the depth value corresponding to the pixel (obtained from the depth map of the optical image).

[0106] The electronic device locates the corresponding surface region in the triangular mesh model based on the back-projected spatial coordinates (X, Y, Z). It marks weakly textured regions with blurred edges (judgment criteria: mesh vertex normal vector angle > 15°, surface curvature change < 0.01). The electronic device then marks the edge features of the target infrared image at the corresponding positions in the triangular mesh model according to the pixel-mesh coordinate correspondence, generating an initial target object model.

[0107] Step S203: Identify the initial target object model and determine the target morphological features corresponding to the target object.

[0108] Specifically, step S203 above may include the following steps: Step S2031: Obtain the target area corresponding to the target object.

[0109] Specifically, the electronic device can receive the target domain corresponding to the target object input by the user, and can also identify the initial target object model corresponding to the target object to determine the target domain corresponding to the target object.

[0110] Step S2032: Based on the target domain, determine the preset feature extraction model corresponding to the target object.

[0111] The preset feature extraction model includes a size-based feature extraction module, a curvature-based feature extraction module, a symmetry-based feature extraction module, a topological structure-based feature extraction module, and a dedicated feature extraction module for the target object.

[0112] Specifically, electronic devices can search for a preset feature extraction model corresponding to the target object based on the target domain.

[0113] Specifically, the feature extraction models for all domains are built on a modular architecture of "general modules + specific modules". The general modules (fixed) include four types of feature extraction modules: size, curvature, symmetry, and topology, covering the basic geometric features of all objects. The specific modules (dynamic matching) automatically match scene-specific extraction modules according to the target domain. For example, in the industrial field (mechanical parts), the specific module "thread detection + thread parameter calculation" is matched, and the general module defaults to prioritizing size and topology extraction; in the cultural heritage field (ceramics), the specific module "pattern segmentation + pattern curvature calculation" is matched, and the general module defaults to prioritizing curvature and symmetry extraction; in the medical field, the specific module "lesion region identification + lesion morphology parameter extraction" is matched, and the general module defaults to prioritizing size and curvature extraction.

[0114] Electronic devices can load deformable Transformer networks as the basic architecture of geometric feature encoders, which include feature input layers (adapting to point cloud / triangular mesh model input), dynamic encoding layers (adapting to different feature weights), and feature aggregation layers (fusing multi-dimensional features), with default configuration of general feature extraction parameters (such as 3-layer network, 3×3 convolution kernel).

[0115] Step S2033: Obtain the feature extraction weights corresponding to the size feature extraction module, curvature feature extraction module, symmetry feature extraction module, topology feature extraction module, and dedicated feature extraction module.

[0116] Specifically, the electronic device can receive the feature extraction weights corresponding to the size-type feature extraction module, curvature-type feature extraction module, symmetry-type feature extraction module, topology-type feature extraction module, and dedicated feature extraction module, respectively, input by the user. It can also receive the feature extraction weights corresponding to the size-type feature extraction module, curvature-type feature extraction module, symmetry-type feature extraction module, topology-type feature extraction module, and dedicated feature extraction module, respectively, sent by other devices. The electronic device can also determine the feature extraction weights corresponding to the size-type feature extraction module, curvature-type feature extraction module, symmetry-type feature extraction module, topology-type feature extraction module, and dedicated feature extraction module, respectively, according to the target scene requirements corresponding to the target object.

[0117] Step S2034: Allocate computing resources to each feature extraction module based on the feature extraction weights.

[0118] Specifically, the electronic device allocates computing resources to each feature extraction module based on the weights of each feature extraction. For example, high-priority features (weight ≥ 0.8) are allocated 80% of the core computing resources, employing sub-pixel-level computation, high-density sampling (e.g., 10 points / mm²), and sophisticated algorithms (e.g., PointNet++ encoder, 7-10 layer network); medium-priority features (0.5-0.8) are allocated 15% of the computing resources, employing conventional sampling density (5 points / mm²) and standard algorithms (e.g., GNN encoder, 5-layer network); and low-priority features (< 0.5) are allocated 5% of the computing resources, employing efficient extraction algorithms (e.g., voxel network, 3-layer network) and low-density sampling (2 points / mm²). At the hardware level, high-priority feature extraction tasks utilize GPU core computing power, while low-priority tasks utilize CPU computing power. At the algorithm level, complex features are configured with multi-scale convolutional kernels (3×3, 5×5, 7×7), while simple features use only 3×3 convolutional kernels. Finally, the electronic device generates a list of computing resource configurations for each module (such as the thread-specific module: 80% GPU computing power, sampling density of 10 points / square millimeter, and 7-layer network).

[0119] Step S2035: Based on the computing resources corresponding to each feature extraction module, control each feature extraction module to extract features, and obtain size-type features, curvature-type features, symmetry-type features, topological structure-type features, and specific features corresponding to the target object.

[0120] Specifically, for size-related feature extraction, the electronic device inputs the vertex coordinates of the target object's geometric prior data (triangular mesh model), fits key structures (fitting holes as circles, fitting planes as polygons), and calculates quantified dimensions (such as the diameter of holes in mechanical parts, the length and width of artifact shapes, and the dimensions of anatomical structures of organs). For curvature-related feature extraction, the electronic device inputs the surface normal vector of the point cloud corresponding to the target object, calculates Gaussian curvature and average curvature, and captures local concavity and convexity features (such as the curvature of chamfers in parts, the curvature of decorative patterns on artifacts, and the curvature of lesion surfaces). For symmetry-related feature extraction, the electronic device inputs the coordinates of key points on the surface corresponding to the target object, identifies the axis of symmetry based on a mirror point matching algorithm, and calculates the morphological differences in symmetrical regions (such as facial symmetry error and deviation of symmetry planes in parts). For topological structure-related feature extraction, the electronic device inputs the face / edge / point topological relationships of the triangular mesh model corresponding to the target object, identifies structures such as holes, closed loops, and gaps, and extracts parameters (such as the number of holes in parts and the depth of missing edges in artifacts).

[0121] For domain-specific feature extraction, if the target object is an industrial machinery part (thread type), the thread detection submodule identifies the thread contour based on a helix detection algorithm and removes noise interference; the thread parameter calculation submodule extracts core parameters such as pitch, thread angle, and thread depth. If the target object is a cultural heritage ceramic (decorative type), the decoction segmentation submodule separates the decoction from the vessel body based on a semantic segmentation network; the decoction curvature calculation submodule fits the decoction curve and calculates the curvature value point by point. If the target object is a medical organ (lesion type), the lesion region identification submodule segments the lesion from normal tissue based on a U-Net network; the lesion morphology parameter extraction submodule calculates the lesion volume, surface area, and aspect ratio.

[0122] Step S2036: Based on size-related features, curvature-related features, symmetry-related features, topological features, and specific features, determine the target morphological features corresponding to the target object.

[0123] Specifically, electronic devices can use a weighted summation formula to fuse multi-dimensional features, as follows: ;in For overall morphological characteristics, w i Let F be the weight of the i-th type of feature. i The normalized value (0~1) is the feature of the i-th class.

[0124] Step S204: Based on the target optical image and target auxiliary data, determine the multimodal visual features corresponding to the target object.

[0125] Specifically, step S204 above may include the following steps: Step S2041: Extract features from each target optical image at different levels to generate features corresponding to each level of the target optical image.

[0126] Specifically, step S2041 above may include the following steps: Step b1: Perform first-level feature extraction on the target optical image to obtain first-level edge features, first-level texture features, and first-level color features.

[0127] Specifically, electronic devices can perform first-level feature extraction on target optical images based on feature extraction networks.

[0128] The electronic device can initialize the feature extraction network. For example, the convolution kernel configuration uses 3×3 small-sized convolution kernels with stride = 1 and padding = 1. The network is configured in layers, with the first three layers focusing on edge, texture, and color features respectively, and each layer outputting a multi-channel feature map (64 channels → 128 channels progressively).

[0129] Specifically, the edge feature extraction layer in the feature extraction network uses a convolutional kernel optimized by the Sobel operator to calculate the gray-level gradients in the x and y directions of the target optical image. A gradient magnitude threshold of 20 is set, and only regions with a gradient magnitude ≥ 20 are retained as first-level edge features (capturing details such as object contours and texture boundaries). The texture feature extraction layer uses a convolutional kernel optimized by Local Binary Pattern (LBP) to statistically analyze the gray-level distribution pattern (uniform, edge, etc.) within a 3×3 pixel neighborhood, quantifying texture roughness and density to generate first-level texture features. The color feature extraction layer directly extracts the RGB three-channel pixel values (0~255), fuses the three-channel information using a 1×1 convolutional kernel, and preserves the color distribution on the object surface to obtain first-level color features.

[0130] Ultimately, the electronic device obtains independent first-level edge, first-level texture, and first-level color features of the target optical image, covering the core details of the image surface.

[0131] Step b2: The first-level edge features, first-level texture features, and first-level color features are fused to generate a first-level feature map.

[0132] Specifically, the electronic device can stitch together the first-level edge features, first-level texture features, and first-level color features according to the channel dimension to generate a first-level feature map.

[0133] In one optional embodiment of this application, the electronic device can automatically adjust the contribution of the first-level edge features, the first-level texture features, and the first-level color features through convolutional layer weight learning, and generate a first-level feature map with 128 channels and the same size as the input image, thus fully preserving the surface detail features.

[0134] Step b3: Max pooling is used to downsample the first-level feature map to obtain a downsampled feature map.

[0135] Specifically, the electronic device can employ a 2×2 pooling kernel with a stride of 2 for max pooling to downsample the first-level feature map. While halving the size of the first-level feature map (output size H / 2×W / 2), the maximum response value within each 2×2 local region is retained, highlighting key details in edges and textures, filtering out secondary information, and reducing parameter redundancy in subsequent calculations. Finally, a downsampled feature map is obtained that maintains the same dimension, compresses the size, and enhances key features.

[0136] Step b4: Perform sliding convolution aggregation operation on the downsampled feature map using multi-channel convolution kernels to obtain the aggregated feature map.

[0137] Specifically, electronic devices can perform sliding convolution on the downsampled feature map using multi-channel convolution kernels. Each kernel focuses on a local feature pattern (such as edge combinations or texture arrangements), and 256 channels simultaneously capture multiple local features. The kernel is configured as a 3×3 convolution kernel with 256 channels, set to padding of 1 and stride of 1. During convolution, the relationships between different features within a local region are automatically learned (such as the combination pattern of "vertical straight edges + horizontal textures"), integrating scattered low-level features into more representative local aggregated features. This generates a 256-channel aggregated feature map, significantly improving the correlation of local features and laying the foundation for subsequent shape extraction.

[0138] Step b5: Perform a convolution operation on the aggregated feature map based on dilated convolution to obtain an expanded receptive field feature map.

[0139] Specifically, electronic devices can introduce 3×3 dilated convolutional layers with a dilation rate of 2 to replace ordinary convolutions. Then, convolution operations are performed on the aggregated feature map based on the dilated convolutional layer to obtain a 256-channel expanded receptive field feature map. When the dilation rate is 2, the sampling interval of the convolutional kernel on the feature map is 2, effectively expanding the receptive field from 3×3 to 5×5. Without increasing network parameters or computational cost, the convolutional layer covers a wider range of feature combinations, thereby accurately identifying medium-scale local geometry and avoiding shape misjudgments caused by an insufficiently small receptive field.

[0140] Step b6: Compare the expanded receptive field feature map with the preset response threshold to determine the shape candidate region from the expanded receptive field feature map.

[0141] Specifically, the electronic device can receive preset response thresholds input by the user, or preset response thresholds sent by other devices. The electronic device can also set preset response thresholds according to actual conditions.

[0142] Then, the electronic device compares the expanded receptive field feature map with a preset response threshold, and then filters out continuous regions in the expanded receptive field feature map whose response values are greater than or equal to the preset response threshold. The position and range of each region are marked with a bounding box, thereby obtaining the shape candidate region.

[0143] Step b7: Based on the shape candidate region, identify and extract structural features to generate a structured feature map.

[0144] Specifically, the electronic device can apply the Hough circle detection algorithm to the shape candidate region, using an accumulator to count the frequency of different combinations of center coordinates and radius, and filter out the combination with the highest accumulated value to determine the optimal circle parameters (center coordinates, diameter). Then, the electronic device uses local feature blocks within the shape candidate region as templates and employs the Normalized Cross-Correlation (NCC) template matching algorithm to search for similar patterns in the feature map, and counts parameters such as texture repetition frequency and spacing between adjacent texture blocks.

[0145] Next, the electronic device binds the quantized shape parameters (such as circle diameter and texture repetition spacing) to the feature vectors of the corresponding candidate regions, maps them back to the feature map according to their spatial location, and forms a local feature map with shape attributes, thus realizing a structured feature map of "features + parameters".

[0146] Step b8: Perform global average pooling on the structured feature map to obtain the second-level feature vector.

[0147] Specifically, the electronic device can calculate the average value of all pixels for each channel of the structured feature map, transforming the 256-channel, H / 2×W / 2 structured feature map into a 256-dimensional one-dimensional vector, thus obtaining the second-level feature vector. This preserves the global feature distribution pattern, eliminates the influence of spatial location on features, focuses on the overall representation of features, and reflects the global shape and structural features of the image.

[0148] Step b9: Perform global average pooling on the second-level feature vectors to obtain feature vectors of a preset dimension; Specifically, electronic devices can map the second-level feature vector to a preset dimension through linear transformation and global average pooling to obtain a feature vector of the preset dimension. The preset dimension is set according to the multimodal fusion requirements (usually 2048 dimensions, adapted to the infrared feature vector dimension).

[0149] Step b10: L2 normalization is used to process the feature vectors of the preset dimensions as a whole, mapping the feature values of each dimension to a uniform scale to obtain the third-level feature vectors.

[0150] Specifically, electronic devices can use L2 normalization to process the feature vectors of preset dimensions as a whole, mapping the feature values of each dimension to a unified scale to obtain the third-level feature vectors.

[0151] The L2 normalization formula is as follows: Where F is a feature vector of a preset dimension, d is a preset dimension, and F i Let be the i-th eigenvalue; map each eigenvalue to the interval [0,1] so that the magnitude of the eigenvector is 1, and generate a third-level eigenvector with uniform scale.

[0152] Step S2042: Determine the image weight corresponding to each target optical image based on the viewpoint type corresponding to each target optical image.

[0153] Specifically, the electronic device can receive the correspondence between the viewpoint type and image weight corresponding to the target optical image input by the user. Then, based on the viewpoint type corresponding to each target optical image, the image weight corresponding to each target optical image is determined.

[0154] Step S2043: Based on image weights, the features at each level are fused to obtain the basic visual feature vector.

[0155] Specifically, the electronic device can convert the first-level feature map into a first-level feature vector. Then, it preprocesses the features at each level to unify them to the same dimension. Next, the electronic device performs a weighted summation of the first-level, second-level, and third-level feature vectors for all viewpoints, element-wise according to the image weights corresponding to the viewpoint type, using the following formula: Where s is the feature scale (first level / second level / third level), w k Let the image weights be the values for the k-th target image. Let be the s-scale feature vector of the k-th target image. Finally, a 2048-dimensional basic visual feature vector is generated, integrating optical image features from multiple views and scales.

[0156] Step S2044: Extract edge features from the target infrared image in the target auxiliary data to obtain the infrared edge features corresponding to the target infrared image.

[0157] Specifically, the electronic device can load a preset convolutional network optimized based on the Canny algorithm and input the infrared grayscale image of the target infrared image. Then, the Canny dual thresholds are dynamically adjusted, where the low threshold is 0.3 to 0.5 times (50 to 80) the grayscale mean, and the high threshold is 2 to 3 times (150 to 200) the low threshold, to adapt to different grayscale differences in infrared images.

[0158] Electronic devices can enhance image contrast based on a pre-defined convolutional network, stretching pixel values to 0-255 to highlight hidden details (such as cracks in parts or faded patterns on artifacts). They can also connect broken edge segments using morphological closing operations (3×3 rectangular convolution kernels) to fill small gaps ≤3 pixels. Then, the electronic device extracts geometric parameters such as the length, curvature, and number of inflection points of the edge contour, mapping these contour features into a 1024-dimensional structured infrared edge feature vector through a fully connected layer.

[0159] Step S2045: The basic visual feature vector and infrared edge features are fused to generate multimodal visual features.

[0160] Specifically, the 1024-dimensional infrared edge feature vector is linearly transformed to 2048 dimensions, matching the dimension of the basic visual feature vector. Then, the feature weights of the basic visual feature vector and the infrared edge features are determined according to the scenario requirements. Finally, based on these feature weights, the basic visual feature vector and the infrared edge features are fused to generate a 4096-dimensional multimodal feature vector.

[0161] Step S205: Based on the target's morphological features and multimodal visual features, generate the final target object model corresponding to the target object.

[0162] Specifically, step S205 above may include the following steps: Step S2051: Perform feature fusion on the target morphological features and multimodal visual features to obtain the target fused features.

[0163] Specifically, the electronic device can perform feature concatenation on the target's morphological features and multimodal visual features to generate a target fusion feature. Optionally, the electronic device can also acquire the weighted features corresponding to the target's morphological features and multimodal visual features respectively. Then, based on the weighted features corresponding to the target's morphological features and multimodal visual features respectively, feature fusion is performed on the target's morphological features and multimodal visual features to obtain the target fusion feature.

[0164] Step S2052: Dimensionality reduction of the target fusion features is performed based on principal component analysis to generate target dimensionality reduction features.

[0165] Specifically, the electronic device can standardize the target fusion features (mean = 0, variance = 1), then calculate the feature covariance matrix and solve for the eigenvalues and eigenvectors. The electronic device selects principal components with a cumulative variance contribution rate ≥ 95% (e.g., compressing from 10240 dimensions to 2000 dimensions), projects the fusion features onto the selected principal component dimensions, and generates the target dimensionality-reduced features.

[0166] Step S2053: Based on the target application scenario corresponding to the target object, determine the rule constraint vector corresponding to the target object.

[0167] Specifically, electronic devices can determine the rule constraint vector corresponding to a target object based on the correspondence between the target application scenario and the rule constraint vector. For example, for industrial quality inspection scenarios, the rule constraint vector includes: dimensional tolerance, structural interference, and surface roughness constraints, with core constraint types including numerical (tolerance ±0.01mm) and logical (whether there is interference); for medical diagnosis scenarios, the rule constraint vector includes: anatomical structure, tissue thickness, and lesion morphology constraints, with core constraint types including numerical (organ thickness range) and logical (whether it conforms to anatomical specifications); for cultural heritage scenarios, the rule constraint vector includes: symmetry of the object shape, arrangement of patterns, and size ratio constraints, with core constraint types including numerical (symmetry ≥90%) and logical (no missing patterns).

[0168] Step S2054: Multiply the rule constraint vector and the target dimensionality reduction feature to obtain the enhanced feature vector.

[0169] Specifically, electronic devices can element-wise multiply the rule constraint vector with the target dimensionality reduction feature (e.g., the "aperture" dimension value in the dimensionality reduction feature multiplied by the "aperture tolerance" dimension value in the constraint vector) to obtain an enhanced feature vector. This allows the enhanced feature vector to carry explicit geometric / semantic rule information, avoiding the generation of models that do not meet the requirements of the scenario.

[0170] Step S2055: Determine the preset 3D potential generation model based on the target application scenario.

[0171] Specifically, electronic devices can determine a preset 3D latent generative model based on the correspondence between the target application scenario and the preset 3D latent generative model. For example, Table 1 below shows the preset 3D latent generative model selection rules table.

[0172] Table 1 Preset 3D Potential Generative Model Screening Rules

[0173] Step S2056: Input the enhanced feature vector into the preset 3D latent generation model to generate the final target object model corresponding to the target object.

[0174] Specifically, step S2056 above may include the following steps: Step c1 involves splitting the enhanced feature vector into a key vector and a value vector.

[0175] Specifically, electronic devices can divide enhanced feature vectors according to preset dimensions to obtain key vectors and value vectors (e.g., the first 1000 dimensions are key vectors, and the last 1000 dimensions are value vectors). Among them, the key vector stores feature index information (such as the identifier of "part aperture" and "texture curvature"); the value vector stores the quantified data of the corresponding feature (e.g., aperture 8mm, sag 15°, curvature 0.85 / mm).

[0176] Step c2: Preset the initial latent features in the 3D latent generative model generation process as the query vector.

[0177] Specifically, after the preset 3D generative model is initialized, random initial latent features (e.g., 512-dimensional) are generated as the query vector for the cross-attention mechanism. The initial latent features cover the basic geometric structure of the 3D model (e.g., overall outline, topological relationships) but do not carry scene constraint information.

[0178] Step c3: In each injection layer, the association weights between latent features and enhanced feature vectors are calculated using a cross-attention mechanism.

[0179] Specifically, in each injection layer, the electronic device can use a cross-attention mechanism to calculate the similarity between the Query (initial latent feature) and the Key (enhanced feature key vector), generating association weights of 0 to 1, as shown in the formula. Electronic devices can normalize weights using the Softmax function, ensuring that the weights of highly relevant features (such as industrial aperture) approach 1, while the weights of secondary features approach 0. This allows the generative model to focus on key constraints in enhancing features (such as tolerances and anatomical specifications) and weaken irrelevant details.

[0180] Step c4: The key vector and value vector are weighted based on the association weight to obtain the weighted feature vector.

[0181] Specifically, the electronic device can multiply the original key vector by the normalized association weights to obtain a weighted key vector, and multiply the original value vector by the normalized association weights to obtain a weighted value vector. Then, the electronic device recombines the weighted key vector and the weighted value vector to form an enhanced weighted feature vector (the values of high-association features are amplified, while secondary features are weakened).

[0182] Step c5: Update the initial latent features based on the weighted feature vector to obtain the updated latent features.

[0183] Specifically, the electronic device can perform a linear transformation on the weighted feature vector to obtain a weighted feature vector with the same dimensions as the initial latent feature vector. Then, the linearly transformed weighted feature vector replaces the initial latent feature vector to obtain the updated latent feature vector.

[0184] Step c6: Based on the updated latent features, the preset 3D latent generation model outputs a virtual target object model.

[0185] Specifically, the preset 3D generation model generates a pseudo-target object model (such as a triangular mesh format) through a decoding layer based on updated latent features.

[0186] Step c7: Calculate the target loss value between the virtual target object model and the initial target object model based on the preset loss function.

[0187] The preset loss functions include geometric consistency loss, feature matching loss, and domain constraint loss.

[0188] Among them, geometric consistency loss measures the deviation of the vertex coordinates and topology between the virtual target object model and the initial target model (such as mean square error, MSE); feature matching loss measures the matching degree between the features extracted from the virtual target object model and the enhanced features (such as cosine similarity loss); and domain constraint loss measures whether the virtual target object model conforms to the scene rules (such as industrial dimensional tolerance loss and medical anatomy specification loss).

[0189] Specifically, the electronic device can calculate the geometric consistency loss based on the following formula: ; ; ;in, for The weight, for The weight, This represents the total number of vertices in the initial target object model. Let be the three-dimensional coordinates of the i-th vertex of the virtual target object model. Let be the 3D coordinates of the i-th vertex of the initial target object model. Given the square of the L2 norm, calculate the square of the Euclidean distance between two points. Let A be the vertex adjacency matrix of the virtual target object model (N×N, where A is the adjacency matrix of vertex i and j if they are connected). ij =1, otherwise 0). Let A be the vertex adjacency matrix of the initial target object model (N×N, where A is the adjacency matrix of vertex i and j if they are connected). ij =1, otherwise 0). The Frobenius norm of the matrix ( ).

[0190] Specifically, the electronic device can calculate the feature matching loss based on the following formula: ;in, The feature vector (with the same dimension as the augmented features, such as 2000-dimensional) is extracted from the virtual target object model and includes features such as size, curvature, symmetry, and texture. The enhanced feature vector generated in step S2054.

[0191] Specifically, electronic devices can be based on the following formula: , computational domain constraint loss. Among them, This represents the total number of domain constraints (e.g., industrial scenarios include three types of constraints: aperture tolerance, structural interference, and surface roughness). The weights of the k-th type of constraint (core constraint w) k =0.6, secondary constraint w k =0.2); The loss term for the k-th type of constraint (differentiated design).

[0192] For example, in an industrial setting (dimensional tolerance loss), taking aperture tolerance as an example (tolerance range)... (Target value d0): ;in, d0 is the measured aperture value of the virtual target object model; d0 is the target aperture value of the initial target object model (e.g., 8mm).

[0193] Regarding the loss of proportions in artifact scenes (using the aspect ratio of artifacts as an example, with a target ratio r0=2:1 and an allowable deviation of ±5%): ;in, The aspect ratio of the virtual cultural relic model; The aspect ratio of the initial target object model.

[0194] The target loss value is: .

[0195] in, , as well as These are the weights of the geometric consistency loss, feature matching loss, and neighborhood constraint loss, respectively. The electronic device can receive the weights of the geometric consistency loss, feature matching loss, and neighborhood constraint loss input by the user, or it can receive the weights of the geometric consistency loss, feature matching loss, and neighborhood constraint loss sent by other devices.

[0196] Step c8: Based on the target loss value, the parameters in the preset 3D potential generative model are corrected until the target loss value is less than the preset loss function value, and the final target object model is obtained.

[0197] Specifically, electronic devices can use the backpropagation algorithm to adjust the network parameters (such as attention weights and decoding layer weights) of a preset 3D potential generative model based on the target loss value. When the target loss value is less than the preset loss function value (e.g., 0.05), or the number of iterations reaches the upper limit (e.g., 100 rounds), optimization stops, and the final 3D model of the target object is generated.

[0198] The target object modeling method provided in this embodiment generates a depth map based on SfM+MVS dense matching and a triangular mesh model based on the Poisson reconstruction algorithm. An initial model is obtained through optimization using target auxiliary data. Multi-view images combined with target auxiliary data are used to quickly generate an initial target model with basic geometric morphology through a 3D reconstruction process, providing a benchmark for subsequent feature extraction and optimization. Then, based on a target domain-adapted modular feature extraction model, computational resources are dynamically allocated through weights to accurately extract core morphological features such as the object's size and curvature, characterizing the object's geometric essence. Next, surface features from multi-view optical images and hidden edge features from infrared images are integrated to form a multimodal visual feature set, enhancing the recognizability of key object details. Finally, geometric and visual features are fused and enhanced through domain rule constraints, inputting an adapted 3D generation model to output a final target object model with high-precision geometric structure, rich surface details, and scene compliance.

[0199] Although embodiments of the invention have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations all fall within the scope defined by the appended claims.

Claims

1. A method for modeling a target object, characterized in that, The method includes: Acquire multiple optical images of the target object from different perspectives, as well as corresponding target auxiliary data; the target auxiliary data includes target point cloud data and / or target infrared images. The target object is modeled based on the target optical image and the target auxiliary data to obtain an initial target object model; The initial target object model is identified to determine the target morphological features corresponding to the target object; Based on the target optical image and the target auxiliary data, determine the multimodal visual features corresponding to the target object; Based on the target morphological features and the multimodal visual features, a final target object model corresponding to the target object is generated.

2. The method according to claim 1, characterized in that, The acquisition of multiple optical images of the target object from different perspectives includes: Acquire multiple initial optical images of the target object from different perspectives. Calculate the mean value of the image gradient of each initial optical image to determine the sharpness of each initial optical image; Calculate the image grayscale histogram distribution of each initial optical image to determine the illumination uniformity of each initial optical image; The camera pose corresponding to each initial optical image is calculated based on the SfM algorithm to determine the viewpoint coverage integrity of each initial optical image; The target optical image is selected from the initial optical images based on their sharpness, illumination uniformity, and spectral coverage integrity.

3. The method according to claim 1, characterized in that, Obtaining target auxiliary data corresponding to the target object includes: For the initial point cloud data, statistical filtering and radius filtering are used to denoise the initial point cloud data to obtain the target point cloud data; And / or, A bilateral filtering algorithm is used to filter the initial infrared image to generate the target infrared image.

4. The method according to claim 1, characterized in that, The step of modeling the target object based on the target optical image and the target auxiliary data to obtain an initial target object model includes: The target optical image is identified to determine the target camera parameters corresponding to the target optical image; The SfM+MVS algorithm is used to perform dense matching on the target optical image and output a depth map; Based on the target camera parameters, each pixel in the depth map is back-projected to a unified world coordinate system to generate initial dense point cloud data; The initial dense point cloud data is optimized using a region growing algorithm to generate the target dense point cloud data. The Poisson reconstruction algorithm is used to convert the dense point cloud data of the target into a triangular mesh model; Based on the target auxiliary data, the triangular mesh model is optimized to obtain the initial target object model.

5. The method according to claim 4, characterized in that, The optimization of the triangular mesh model based on the target auxiliary data to obtain the initial target object model includes: Point cloud sampling is performed on the triangular mesh model to obtain the sampled point cloud data corresponding to the triangular mesh model; The target point cloud data is initially matched with the sampled point cloud data to obtain the initial registration result. Using the initial registration result as the initial value, the kd-tree nearest point search algorithm is used to match the nearest point on the triangular mesh model for each point in the target point cloud data in each iteration, and construct corresponding point pairs; The optimal rigid transformation matrix is solved using the least squares method based on corresponding point pairs. Based on the optimal rigid transformation matrix, the coordinates of the target point cloud data are updated, and the root mean square error of the current iteration is calculated. If the root mean square error is less than the preset convergence threshold or the maximum number of iterations is reached, the iteration is stopped, and the initial target object model is obtained. And / or, The ORB algorithm is used to extract the edge feature points corresponding to the infrared image and the optical image of the target respectively, and generate binary descriptors corresponding to the infrared image and the optical image of the target respectively; Calculate the Hamming distance between the binary descriptors corresponding to the infrared image and the optical image of the target, respectively; Based on the Hamming distance, valid matching pairs are determined from the binary descriptors corresponding to the target infrared image and the target optical image, respectively; Based on the effective matching pairs, the homography matrix is solved using the RANSAC algorithm; Based on the homography matrix, the target infrared image is transformed to obtain the transformed infrared image of the target infrared image in the target optical image coordinate system; The edge features of the converted infrared image are mapped to the triangular mesh model according to the pixel coordinate correspondence to obtain the initial target object model.

6. The method according to claim 1, characterized in that, The step of identifying the initial target object model and determining the target morphological features corresponding to the target object includes: Obtain the target region corresponding to the target object; Based on the target domain, a preset feature extraction model corresponding to the target object is determined; the preset feature extraction model includes a size-based feature extraction module, a curvature-based feature extraction module, a symmetry-based feature extraction module, a topological structure-based feature extraction module, and a dedicated feature extraction module corresponding to the target object. Obtain the feature extraction weights corresponding to the size-type feature extraction module, the curvature-type feature extraction module, the symmetry-type feature extraction module, the topology-type feature extraction module, and the dedicated feature extraction module, respectively; Based on the aforementioned feature extraction weights, computing resources are allocated to each feature extraction module; Based on the computing resources corresponding to each feature extraction module, each feature extraction module is controlled to extract features, thereby obtaining size-related features, curvature-related features, symmetry-related features, topological structure-related features, and specific features corresponding to the target object; Based on the size features, curvature features, symmetry features, topological features, and specific features, the target morphological features corresponding to the target object are determined.

7. The method according to claim 1, characterized in that, The target auxiliary data includes a target infrared image; determining the multimodal visual features corresponding to the target object based on the target optical image and the target auxiliary data includes: Each of the target optical images is subjected to feature extraction at different levels, and features corresponding to each level of the target optical image are generated respectively; Based on the viewpoint type corresponding to each of the target optical images, determine the image weight corresponding to each of the target optical images; Based on the image weights, the features at each level are fused to obtain the basic visual feature vector; Edge features are extracted from the infrared image of the target in the target auxiliary data to obtain the infrared edge features corresponding to the infrared image of the target; The basic visual feature vector and the infrared edge feature are fused to generate the multimodal visual feature.

8. The method according to claim 7, characterized in that, The features at each level include a first-level feature map, a second-level feature vector, and a third-level feature vector; the step of extracting features from each of the target optical images at different levels to generate features at each level corresponding to the target optical image includes: The target optical image is subjected to first-level feature extraction to obtain first-level edge features, first-level texture features, and first-level color features. The first-level edge features, the first-level texture features, and the first-level color features are fused to generate a first-level feature map. Max pooling is used to downsample the first-level feature map to obtain a downsampled feature map; The downsampled feature map is aggregated by performing a sliding convolution aggregation operation on the multi-channel convolution kernel to obtain an aggregated feature map; The aggregated feature map is convolved using dilated convolution to obtain an expanded receptive field feature map. The expanded receptive field feature map is compared with a preset response threshold to determine the shape candidate region from the expanded receptive field feature map; Based on the shape candidate regions, structural features are identified and extracted to generate a structured feature map; Global average pooling is performed on the structured feature map to obtain the second-level feature vector; Perform global average pooling on the second-level feature vectors to obtain feature vectors of a preset dimension; The preset dimension feature vector is processed by L2 normalization to map the feature values of each dimension to a uniform scale, thus obtaining the third-level feature vector.

9. The method according to claim 1, characterized in that, The step of generating a final target object model corresponding to the target object based on the target morphological features and the multimodal visual features includes: The target morphological features and the multimodal visual features are fused to obtain the target fused features; The target fusion features are reduced in dimensionality using principal component analysis to generate target dimensionality-reduced features. Based on the target application scenario corresponding to the target object, determine the rule constraint vector corresponding to the target object; The enhanced feature vector is obtained by multiplying the rule constraint vector and the target dimensionality reduction feature. Based on the target application scenario, a preset 3D potential generation model is determined; The enhanced feature vector is input into a preset 3D latent generation model to generate the final target object model corresponding to the target object.

10. The method according to claim 9, characterized in that, The step of inputting the enhanced feature vector into a preset 3D latent generation model to generate the final target object model corresponding to the target object includes: The enhanced feature vector is split into a key vector and a value vector; The initial latent features generated during the generation process of the preset 3D latent generative model are used as query vectors; In each injection layer, the association weights between the latent features and the enhanced feature vectors are calculated using a cross-attention mechanism; Based on the association weights, the key vector and the value vector are weighted to obtain a weighted feature vector; The initial latent features are updated based on the weighted feature vector to obtain the updated latent features; The preset 3D latent generation model outputs a virtual target object model based on the updated latent features; Based on a preset loss function, the target loss value between the virtual target object model and the initial target object model is calculated; the preset loss function includes geometric consistency loss, feature matching loss, and domain constraint loss. Based on the target loss value, the parameters in the preset 3D potential generative model are corrected until the target loss value is less than the preset loss function value, thus obtaining the final target object model.