Multi-person interaction reconstruction method based on semantic-geometric graph optimization

CN122223232APending Publication Date: 2026-06-16TIANJIN UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: TIANJIN UNIV
Filing Date: 2026-03-20
Publication Date: 2026-06-16

Application Information

Patent Timeline

20 Mar 2026

Application

16 Jun 2026

Publication

CN122223232A

IPC: G06T17/00; G06V10/774; G06V10/82; G06V40/10; G06V10/80; G06T7/60; G06V10/75; G06V20/70; G06V10/766; G06F40/30; G06N3/09; G06N3/045; G06N5/04; G06N3/0464

AI Tagging

Application Domain

Image analysis Semantic analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122223232A_ABST

Patent Text Reader

Abstract

The application discloses a multi-person interaction reconstruction method based on semantic-geometry graph optimization, and relates to the technical field of computer vision and computer graphics. The application proposes an interactive graph construction and semantic reasoning method based on a multi-modal visual language model, uses an initial pose parameter to construct a geometry-semantic fusion field to estimate fine-grained human contact relations, further iteratively optimizes the pose parameter under multi-modal observation constraints through a flow matching framework, establishes a bidirectional collaborative mechanism between contact estimation and pose optimization, and finally realizes accurate, physically reasonable and semantically consistent multi-person interaction reconstruction based on a single frame of RGB image.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and computer graphics, and in particular to a multi-user interactive reconstruction method based on semantic-geometric graph optimization. Background Technology

[0002] Human interaction reconstruction is crucial for the development of intelligent systems and is widely used in fields such as social robots, virtual reality, and video analytics. With the advancement of artificial intelligence and computer vision technologies, their applications in areas such as social behavior modeling and collaborative activity understanding are deepening, leading to a growing market demand for 3D models capable of accurately reconstructing multi-person interaction relationships. Existing multi-person interaction reconstruction methods primarily focus on the geometric accuracy of individual poses, neglecting fine-grained contact relationships and semantic interaction information between people. Traditional methods typically optimize poses based on geometric constraints or collision avoidance, but lack explicit modeling of interaction semantics, resulting in insufficient physical consistency and semantic plausibility in reconstruction results for complex multi-person scenarios. In recent years, some research has attempted to introduce visual-language models for semantic reasoning, but these still face challenges such as illusion interference, high computational costs, and difficulty in scaling to multi-person scenarios.

[0003] Therefore, combining geometric modeling with semantic reasoning from visual-language models and formulating interaction reconstruction as a semantic-geometric collaboration graph optimization problem is expected to promote the application of efficient and accurate multi-user interaction reconstruction technology. Summary of the Invention

[0004] The purpose of this invention is to propose a multi-user interactive reconstruction method based on semantic-geometric graph optimization to solve the problems mentioned in the background art.

[0005] To achieve the above objectives, the present invention adopts the following technical solution:

[0006] A multi-user interactive reconstruction method based on semantic-geometric graph optimization includes the following steps: S1. Training Data Construction A dataset containing multi-person interaction scenarios is constructed for supervised training of subsequent modules. The dataset includes RGB images of multiple people, corresponding ground truth 3D human poses, mesh vertex coordinates, and annotation information on contact relationships between the people.

[0007] S2. Reconstructing the network The reconstructed network consists of three parts: an initial interaction graph construction module, a geometry-aware contact estimation module, and an observation-guided pose refinement module. The initial interaction graph construction module, based on a multimodal large language model, parses interpersonal relationships in images, constructs an undirected interaction graph, and assigns initially estimated 3D pose parameters to each node. The geometry-aware contact estimation module predicts the probability and region of interpersonal contact by fusing semantic features from the visual language model with the 3D geometric features of the human body mesh. The observation-guided pose refinement module, based on a flow matching method, iteratively optimizes the human pose parameters using predicted contact, 2D keypoints, depth order, and other observations as constraints.

[0008] S3. Network Training A phased strategy was adopted to train the reconstruction network. The geometrically aware contact estimation module and the observation-guided pose refinement module were trained independently, enabling them to predict human contact relationships from multimodal features and continuously update human pose parameters without explicit interaction constraints, respectively.

[0009] S4. Multi-user Interaction Reconstruction The system takes a single image of multiple people interacting as input, performs human detection and pose initialization, and then feeds it into a trained reconstruction network. The initial interaction graph construction module generates the graph structure, which, combined with contact constraints from the contact estimation module and other observation information, iteratively updates the poses and relative layouts of the multiple people within a semantic-geometric co-optimization framework. Finally, it outputs a physically plausible and semantically consistent 3D human mesh sequence, completing the reconstruction of the interactive scene.

[0010] Furthermore, S2 specifically includes: S21. Initial Interaction Graph Construction Module This module takes a single RGB image as input. Its core function is to extract information from two levels simultaneously: firstly, it uses a multimodal large language model to parse the relationships and interactions between people in the image, forming semantic cues for interaction; secondly, it estimates the initial 3D pose of each detected person, establishing a preliminary geometric representation. Specifically, it includes the following steps: S211. Interaction Relationships and Graph Structure Reasoning: The input image and a predefined interaction reasoning prompt are fed into a pre-trained multimodal large language model (this method uses Qwen2.5-VL 7B). Based on the prompt, the model outputs a set of all nodes in the graph. With edge set Each node This corresponds to a person in the image. If the model determines that there is interaction between the two people, then an undirected edge is created between the corresponding nodes. Each edge is associated with two types of semantic attributes: one is the interaction type inferred by the model (e.g., "hug"), and the other is the initial body part contact pair described in natural language (e.g., "right hand-face").

[0011] The default interaction reasoning prompt is: "Given an image containing multiple people, identify them as nodes (P1, P2, ...) from left to right; detect any interactions between them and output a JSON graph structure containing nodes and undirected edges; each edge should contain possible interaction types or body parts that are touched." S212. Individual Geometric Feature Initialization: Global features of the image are extracted using a ViT backbone network pre-trained based on DINOv2. These features are then fused with camera parameters and passed through a regression network to initialize each node in the interaction graph constructed in S211. Estimate its initial SMPL parameters (including attitude) ,shape and location ), which serves as the geometric property of the node.

[0012] The undirected interaction graph constructed through the above steps It also encodes the interpersonal interaction semantics in the scene and the initial three-dimensional geometric state of each individual.

[0013] S22. Geometric Sensing Contact Estimation Module This module uses the undirected interaction graph constructed in step S21. Using interactive edges as the unit, and by fusing semantic and 3D geometric information, we can achieve refined prediction of contact areas between human bodies. Specifically, this includes the following steps: S221. Multimodal Feature Fusion: For each pair of people connected by interactive edges, project all 6890 vertices of their SMPL mesh onto the image plane, and sample the visual features at the corresponding positions from the ViT feature map. For each vertex, assign its three-dimensional coordinates. Learnable identity embedding Learnable body part embedding With visual features By concatenating the features, a fused feature vector is obtained. The fused features are then input into the PointNet++ network to extract high-level geometric features.

[0014] S222. Contact Signature Encoding: Construct an initial contact signature matrix based on the initial contact part pairs associated with the interaction edges (e.g., "right hand - face"). ,in This represents the number of body parts. The contact signature matrix is encoded into feature representations using a lightweight CNN.

[0015] S223. Text Semantic Extraction: Input the text description "{Person A} and {Person B} are interacting with {interaction type}, and the contact occurs at {contact point}" generated by the multimodal large language model into the CLIP text encoder to extract the corresponding text semantic features.

[0016] S224. Contact Prediction: The geometric features, contact signature features and text semantic features mentioned above are concatenated and input into a multilayer perceptron (MLP) to predict the following outputs in parallel: (1) Contact Label (2) Refined regional-level contact signature matrix (3) Regional-level contact segmentation .

[0017] This module achieves reliable, fine-grained estimation of interpersonal contact by combining semantic guidance and geometric constraints.

[0018] S23. Observation-guided attitude optimization module This module, based on the flow matching method, performs differentiable iterative optimization of human pose under various observation constraints. Flow matching directly models the evolution process from initial prediction to refined pose by constructing a vector field over a continuous time interval. Specifically, the pose optimization process is expressed as the following ordinary differential equation: ,in It is a vector field obtained through learning. Indicates time The SMPL attitude parameters are set below. This process uses the initial SMPL parameters. Starting from this point, and driving the state in When it approaches the true distribution of the target .

[0019] Furthermore, S3 specifically includes: S31. Independently train the geometrically perceived contact estimation module to enable it to predict human contact relationships from multimodal features. According to S224, the prediction output of the geometrically perceived contact estimation module consists of three parts: (1) contact label (whether an interaction has occurred). (2) Refined regional-level contact signature matrix (3) Regional-level contact segmentation .

[0020] To train this module, each predicted output is supervised using cross-entropy loss, and the total loss function is: The updated contact signature matrix will be used for subsequent 3D pose estimation optimization.

[0021] S32. An independently trained observation-guided pose optimization module enables it to continuously update human pose parameters without explicit interaction constraints. To train the flow matching model, a straight-line path is defined from the initial pose to the target pose: The model learns a vector field to match the derivative of that path: The model is trained by minimizing the following objective function: This training enables the model to learn to predict reasonable directions of pose evolution.

[0022] Furthermore, S4 specifically includes: S41. Semantic-Geometric Co-optimization Framework: Given an initial interaction pair, a geometry-aware contact estimation module predicts a contact signature from the geometric-semantic field. This signature is then used to generate contact constraints and compute the loss function. ,in This is a set of body parts that are expected to come into contact. To further ensure the pose accuracy of each individual, a reprojection error is introduced. and posture regularization As a guide, among These are predicted 2D key points. This refers to the initial body pose. A depth ranking loss is also proposed, utilizing the depth estimation model Depth anything v2 to recover a reasonable spatial layout: ,in and These are the 3D root location and depth map, respectively. In addition, a penetration loss is introduced. ,in It is the set of all colliding triangle pairs detected by Bounding Volume Hierarchies (BVH); They are triangles and The vertex on; Then the surface normal vector at the corresponding vertex; Then it represents the vertex. to the other triangle The value of the 3D distance field of the local area of the human body. Penetration loss is used to further penalize physically unreasonable overlap. The total loss function is... .

[0023] S42. To incorporate observation constraints, the flow matching model employs conditional sampling techniques to guide pose optimization. Specifically, the trained vector field is modified as follows: ,in It is a scaling factor. Represents the observed variable, gradient term The guidance vector field is calculated from various observation losses (such as contact loss, reprojection loss, etc.). This field iteratively updates the attitude parameters through numerical integration (such as the Euler method), generating a physically more reasonable new attitude and mesh. The new mesh is then re-inputted into the contact estimation branch, initiating the next iteration. This process is repeated cyclically (e.g., 10 times) until the attitude and contact predictions stabilize, forming a closed loop of "estimation-constraint-optimization-re-estimation," ultimately outputting a high-quality interactive reconstruction result that is geometrically and semantically consistent.

[0024] Compared with the prior art, the present invention has the following beneficial effects: (1) This invention breaks through the limitation of existing methods that rely on fixed two-person interaction templates and proposes a general representation method based on undirected interaction graphs, which can flexibly model complex interaction relationships between any number of individuals, and significantly improves the scalability and scenario applicability of the method.

[0025] (2) This invention combines high-level semantic reasoning from visual-language models with traditional geometric modeling, and explicitly introduces semantic information such as interaction type and contact intention during the reconstruction process, thereby improving the interpretability and semantic rationality of the reconstruction results.

[0026] (3) This invention proposes a bidirectional iterative semantic-geometric co-optimization mechanism. Through the closed-loop interaction between the contact estimation and attitude optimization modules, semantic constraints are used to alleviate the geometric ambiguity in single-view reconstruction, and the physical accuracy and semantic consistency of the reconstruction are improved simultaneously.

[0027] (4) The invention relies on only a single RGB camera, which is inexpensive and easy to use. Attached Figure Description

[0028] Figure 1 This is an overall flowchart of the multi-user interactive reconstruction method based on semantic-geometric graph optimization proposed in this invention; Figure 2 This is a diagram of the semantic-geometric collaborative optimization framework proposed in Embodiment 1 of the present invention; Figure 3 This is a reconstruction result of the crowded outdoor scene proposed in Embodiment 1 of the present invention. Detailed Implementation

[0029] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments.

[0030] Example 1: Please see Figure 1 This invention proposes a multi-user interactive reconstruction method based on semantic-geometric graph optimization, comprising the following steps: S1, Training Data Construction like Figure 1 As shown, the training data required for this method is constructed based on existing publicly available datasets. A dataset containing 3D human body annotations is selected. For data that only provides 3D joint point annotations, the 3D joint points of each human body are fitted frame-by-frame using a skeleton-skinned 3D human body mesh model to obtain the corresponding human body mesh model parameters, which serve as the complete 3D human body annotation. The 2D joint points, 3D joint points, and human body mesh model parameters are organized and stored in a unified format as ground truth values for the training data. These ground truth values will be used as supervisory signals in parameter optimization during subsequent network training, but will not be used during the inference phase. Furthermore, to train the subsequent geometry-aware contact estimation module, a dataset containing fine-grained human-to-human contact annotations is also required to provide supervisory information on contact relationships.

[0031] S2, Network Construction S2.1, such as Figure 1 As shown, a pre-trained multimodal large language model (Qwen2.5-VL) is used, with the input image and predefined interaction inference cues, to obtain an undirected interaction graph. Each node in the output graph corresponds to a person in the image; if the model infers that two people interact, an undirected edge is established between the corresponding nodes. The edge contains two semantic attributes: the interaction type inferred by the model and the initial body part contact pair described in natural language. A ViT backbone network based on DINOv2 pre-training is constructed to extract image features, which are then fused with camera parameters. The SMPL parameters are estimated through a regression head as the geometric attributes of the nodes.

[0032] S2.2. Using the PointNet++ network, visual features of the 3D coordinates, identity and location embeddings, and projections of human body mesh vertices are fused as input to extract geometric features representing interpersonal spatial relationships. A lightweight CNN is used to encode the binary contact signature matrix constructed based on initial contact location pairs into a dense feature representation. The structured text description of the interaction type and contact location is input into the CLIP text encoder to extract high-level semantic features. A multilayer perceptron (MLP) is constructed, and the above geometric features, contact signature features, and text semantic features are concatenated as input to predict contact labels in parallel. Detailed regional-level contact signature matrix and regional contact segmentation Based on the above structure, implement Figure 1 The geometrically sensed contact estimation module in [the system].

[0033] S2.3. Based on the flow matching model, the evolution process from the initial predicted pose to the refined pose is modeled by constructing a vector field over continuous time. The pose optimization is described by ordinary differential equations:

[0034] The process uses the initial SMPL parameters. Starting from this point, and driving the state in When it approaches the true distribution of the target Based on the above structure, implement Figure 1 The observation-guided attitude refinement module in the middle.

[0035] S3, Network Training S3.1, such as Figure 1 As shown, the initial single-frame human pose regression network training is as follows: Using the dataset constructed in S1, a human mesh regression model based on the ViT backbone network is trained. The training loss typically includes 2D reprojection loss, 3D joint and vertex loss, and parameter regularization loss. This trained model will be fixed and used to provide initial geometric parameters for each node in the interaction graph during subsequent inference processes.

[0036] S3.2, such as Figure 1 As shown, the contact estimation module for geometry awareness is trained as follows: The contact estimation module is specifically trained using the dataset containing fine-grained human-to-human contact annotations constructed in S1. The obtained geometric, signature, and text features are fused and input into a multilayer perceptron (MLP) for parallel prediction of contact labels. Detailed regional-level contact signature matrix and regional contact segmentation The model uses cross-entropy loss to supervise the prediction of contact labels, contact signatures, and contact segmentation. The loss function is: .

[0037] S3.3, such as Figure 1 As shown, the observation-guided pose optimization module is trained as follows: using the dataset constructed in S1, a flow-matching-based pose optimizer is trained. For each training sample, its initial pose is defined as... The target attitude is In normalized time Construct a linear interpolation path: Subsequently, a vector field parameterized by a neural network is trained. This makes it approximate the path at any time. The instantaneous velocity. The corresponding training loss function is: .

[0038] S4, Multi-person Interactive Reconstruction (Model Inference) S4.1, such as Figure 1As shown, the input is a single RGB image. The multimodal large language model analyzes the image based on the image and cue words, identifies all people as nodes, and establishes edges for node pairs with interactions, while providing initial interaction semantics. A parallel single-person pose estimator regresses initial SMPL parameters for each node as node geometric attributes. Then, the core semantic-geometric co-optimization framework is implemented. This is an iterative closed-loop process among trained modules, aiming to jointly optimize geometry and semantics. The semantic-geometric co-optimization framework is as follows: Figure 2 As shown.

[0039] S4.2, such as Figure 2 As shown, firstly, geometric, contact, and semantic features are obtained based on the interaction graph obtained in S4.1 and the textual description of MLLM, and then fused with the input MLP prediction to obtain a refined contact signature matrix. Then, by combining observations such as 2D keypoints estimated from the image and monocular depth maps, the contact loss is calculated. Reprojection loss Posture regularization Depth-sorting loss and penetration loss This constitutes the total guiding loss. .

[0040] S4.3 The gradient of the loss with respect to the current SMPL parameters is used as a strong guiding signal and input into a pre-trained flow matching model. The flow matching model modifies the vector field using conditional sampling techniques. Guided by the current attitude and the observation gradient, new SMPL attitude parameters that are more physically reasonable and more consistent with the observation are generated through numerical integration.

[0041] S4.4, such as Figure 2 As shown, the updated mesh is fed back to the contact estimation network in step S4.2 to re-predict more accurate contacts. This closed loop of "contact estimation → loss calculation → pose optimization → mesh update" is executed iteratively several times (e.g., 10 times), allowing geometric reconstruction and semantic understanding to continuously correct and enhance each other until the results converge. Finally, the framework outputs geometrically accurate, physically reasonable, and semantically consistent multi-person 3D interactive reconstruction results. Examples of the reconstruction results are shown below. Figure 3 As shown.

[0042] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the specific embodiments described above. The specific embodiments and descriptions in the specification are merely for further illustrating the principles of the invention. Various changes and modifications can be made to the present invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of the present invention is defined by the claims and their equivalents.

Claims

1. A multi-user interactive reconstruction method based on semantic-geometric graph optimization, characterized in that, Specifically, the following steps are included: S1, Training Data Construction A dataset containing multi-person interaction scenarios is constructed for supervised training of subsequent modules; the dataset includes multi-person RGB images, corresponding 3D human pose ground truth values, mesh vertex coordinates, and annotation information on contact relationships between human bodies; S2, Reconstructing the Network The reconstructed network consists of three parts: an initial interaction graph construction module, a geometry-aware contact estimation module, and an observation-guided pose refinement module. The initial interaction graph construction module, based on a multimodal large language model, parses interpersonal relationships in images, constructs an undirected interaction graph, and assigns initially estimated 3D pose parameters to each node. The geometry-aware contact estimation module predicts the probability and region of interpersonal contact by fusing semantic features from the visual language model with the 3D geometric features of the human body mesh. The observation-guided pose refinement module, based on a flow matching method, iteratively optimizes the human pose parameters using predicted contact, 2D keypoints, and depth order as constraints. S3, Network Training A phased strategy was adopted to train the reconstruction network constructed in S2, and the geometrically perceived contact estimation module and the observation-guided pose refinement module were trained independently, so that they could respectively predict human contact relationships from multimodal features and continuously update human pose parameters without explicit interaction constraints. S4, Multi-user Interactive Reconstruction Input a single multi-person interactive image, first perform human detection and pose initialization, then input it into the pre-trained reconstruction network in S3; obtain the graph structure through the initial interaction graph construction module, combine the contact constraints output by the geometric perception contact estimation module and other observation information, and iteratively update the multi-person pose and relative layout under the semantic-geometric co-optimization framework, finally output a physically reasonable and semantically consistent 3D human mesh sequence, and complete the reconstruction of the interactive scene.

2. The method according to claim 1, characterized in that, The initial interaction graph construction module described in S2 takes a single RGB image as input and extracts information from two levels: firstly, it uses a multimodal large language model to parse the relationships and interaction semantics between people in the image, forming semantic cues for interaction; secondly, it estimates the initial 3D pose of each detected person and establishes a preliminary geometric representation; specifically, it includes the following steps: S211, Interaction Relationships and Graph Structure Reasoning: The input image and a predefined interaction reasoning prompt are input into a trained multimodal large language model; the model outputs a set of all nodes in the graph based on the prompt. With edge set Each node This corresponds to a person in the image; if the model determines that two people are interacting, then an undirected edge is established between the corresponding nodes. Each edge is associated with two types of semantic attributes: one is the interaction type inferred by the model, and the other is the initial body part contact pair described in natural language. S212, Individual Geometric Feature Initialization: Global features of the image are extracted using a ViT backbone network pre-trained based on DINOv2; these global features are then fused with camera parameters and used through a regression network to initialize the geometric features of each node in the interaction graph constructed in S211. Estimate its initial SMPL parameters as the geometric properties of the node.

3. The method according to claim 2, characterized in that, The contact estimation module for geometry perception described in S2 uses the undirected interaction graph constructed by the initial interaction graph construction module. Using the interaction edges as the unit, and by fusing semantic and 3D geometric information, we can achieve refined prediction of the contact area between human bodies; specifically, it includes the following steps: S221. Multimodal Feature Fusion: For each pair of people connected by interactive edges, project all vertices of their SMPL mesh onto the image plane, and sample the visual features at the corresponding positions from the ViT feature map. For each vertex, assign its three-dimensional coordinates. Learnable identity embedding Learnable body part embedding With visual features By concatenating the features, a fused feature vector is obtained. The fused features are then input into the PointNet++ network to extract high-level geometric features. S222, Contact Signature Encoding: Construct an initial contact signature matrix based on the initial contact part pairs associated with the interactive edges. ,in The number of body parts is represented; this contact signature matrix is encoded into a feature representation using a lightweight CNN. S223. Text semantic extraction: Input the text description generated by the multimodal large language model into the CLIP text encoder to extract the corresponding text semantic features; S224. Contact Prediction: The geometric features, contact signature features, and text semantic features mentioned above are concatenated and input into a multilayer perceptron to predict the following outputs in parallel: (1) Contact label This indicates whether the two people have had contact; (2) Refined regional-level contact signature matrix ; (3) Regional-level contact segmentation .

4. The method according to claim 3, characterized in that, The observation-guided posture optimization module described in S2 is based on the flow matching method and performs differentiable iterative optimization of human posture under various observation constraints. Flow matching directly models the evolution process from initial prediction to refined posture by constructing a vector field in a continuous time period. The attitude optimization process can be represented by the following ordinary differential equation: in, It is a vector field obtained through learning. Indicates time The SMPL attitude parameters are as follows; this process uses the initial SMPL parameters. Starting from this point, and driving the state in When it approaches the true distribution of the target .

5. The method according to claim 4, characterized in that, S3 specifically includes: S31. Independently train the contact estimation module of geometric perception, enabling it to predict human contact relationships from multimodal features; During training, each predicted output is supervised using cross-entropy loss, and the total loss function is: in, and These represent the predicted contact tags and the actual contact tags, respectively. and These represent the predicted regional-level contact signature matrix and the actual matrix, respectively. and These represent the predicted region-level contact segmentation and the actual segmentation, respectively. Represents the cross-entropy loss function; The updated contact signature matrix is used for subsequent 3D pose estimation optimization; S32, an independent training observation-guided posture optimization module, which enables it to continuously update human posture parameters without explicit interactive constraints. Define a straight path from the initial pose to the target pose: Train the flow matching model; the model learns a vector field to match the derivative of the path: The model is trained by minimizing the following objective function: ; The above training enables the model to learn to predict reasonable directions of attitude evolution.

6. The method according to claim 5, characterized in that, S4 specifically includes: S41. Semantic-Geometric Co-optimization Framework: Given an initial interaction pair, a contact signature is predicted from the geometric-semantic field using a geometry-aware contact estimation module; contact constraints are then generated using the contact signature, and a loss function is calculated. in, This represents a set of body parts that are expected to come into contact. Introducing reprojection error and posture regularization As a guideline to further ensure the posture accuracy of each individual, Represents the predicted 2D key points. Indicates the initial body posture; A depth ranking loss is proposed, which utilizes the depth estimation model Depth anything v2 to recover a reasonable spatial layout: in and These represent the 3D root position and depth map, respectively. Introduce a penetration loss: in, This represents the set of all triangle pairs that collide, detected by Bounding Volume Hierarchies. Representing triangles and The vertex on; This represents the surface normal vector at the corresponding vertex; Represents vertices to the other triangle The value of the local 3D distance field of the human body; further penalize physically unreasonable overlap through penetration loss; Therefore, the total loss function is expressed as: in, Indicates contact loss, Indicates reprojection loss. This indicates pose regularization. Represents the depth-sorting loss. Indicates penetration loss; S42. Using conditional sampling techniques to guide pose optimization, the trained vector field is modified as follows: in, This represents a scaling factor. Represents the observed variable, gradient term Calculated from the losses observed; The guiding vector field updates the attitude parameters iteratively through numerical integration, generating a physically more reasonable new attitude and new mesh; the new mesh is then re-inputted into the contact estimation branch to start the next round of iteration; the above process is repeated until the attitude and contact prediction tend to stabilize, forming a closed loop of "estimation-constraint-optimization-re-estimation", and finally outputting a high-quality interactive reconstruction result that is geometrically and semantically consistent.