A method for generating 4D character interaction of invisible objects based on text description
By employing a two-stage approach combining an object position anchoring network and a contact perception diffusion model, a natural and realistic 4D human-object interaction sequence is generated, solving the problem of insufficient unknown object generation capability in existing technologies and achieving robust generalization for diverse objects.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
- Filing Date
- 2025-04-30
- Publication Date
- 2026-06-23
Smart Images

Figure CN120491813B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, specifically to a method for generating 4D human interaction from invisible objects based on text descriptions. Background Technology
[0002] Human-environment interaction generation: Current research on human-environment interaction synthesis can be divided into two main directions: static object interaction and dynamic object interaction.
[0003] Static object interaction: Based on regression models, diffusion models, and reinforcement learning, existing technologies can generate static scene actions such as sitting, lying down, and navigating in confined spaces, but they struggle to handle dynamic environments with moving, deformed, or changing states of objects (such as pushing open doors or rearranging furniture). Although recent research has begun to integrate dynamic interaction, limitations remain in multi-step interaction, object state transitions, and real-time adaptability, restricting the generalization ability to complex scenes.
[0004] Dynamic object interaction: Early research predicted interactions through historical motion; however, high-quality 4D human-object interaction synthesis remains challenging due to the scarcity of 4D datasets and physical constraints. Recent studies have integrated physical and kinematic methods to improve the realism of full-body dynamic interactions, but the diversity of movements and the range of object interactions are still limited.
[0005] Zero-shot interaction generation: A major challenge in HOI synthesis lies in the scarcity of labeled datasets. While existing datasets provide a foundation, their size is far smaller than general text-action datasets.
[0006] In summary, the main drawbacks of existing technologies are: due to the limitation of existing 4D human-object interaction datasets on the singleness of object categories and interaction modes, supervised training methods based on these datasets exhibit poor generalization ability when facing unknown objects, and cannot achieve natural and realistic 4D human-object interaction sequence generation for unknown objects. Summary of the Invention
[0007] In view of this, the present invention provides a method for generating 4D human interaction based on invisible objects with text description, so as to at least solve the above-mentioned technical problems.
[0008] According to a first aspect of the present invention, a method for generating 4D human-object interaction based on text description of invisible objects is provided, comprising: a first stage, 3D human-object interaction keyframe recovery: obtaining a human motion sequence through a human motion model and uniformly downsampling the human motion sequence to extract keyframes of the human motion sequence; for each keyframe, reconstructing a human mesh through an SMPL-X model and extracting the vertex positions of the human mesh to form a human point cloud; an object position anchoring network using the human point cloud, object template point cloud and text prompts as input to predict the object position and generate sparse 3D human-object interaction keyframes; a second stage, 4D human-object interaction sequence generation: constructing a contact perception diffusion model, using the sparse 3D human-object interaction keyframes as input, and extracting conditional signals containing human posture and contact information from the sparse 3D human-object interaction keyframes through the contact perception encoder of the contact perception diffusion model; based on the conditional signals, performing temporal interpolation on the sparse 3D human-object interaction keyframes through the contact perception diffusion model to generate a temporally coherent dense 4D human-object interaction sequence.
[0009] Optionally, the object location anchoring network recovers the object location by inferring the spatial relationship between the human body and the object template, and is trained on a hybrid dataset including the Grab and Behave datasets.
[0010] Optionally, the contact sensing encoder adopts the PointNet++ architecture to encode 3D human-object interaction keyframes and extract contact sensing features as the conditional signal.
[0011] Optionally, the contact perception diffusion model further includes a contact perception human-object interaction attention module, which dynamically aligns the contact perception features with the latent variables of the contact perception diffusion model through a cross-attention mechanism to ensure the accurate integration of fine-grained space and contact information.
[0012] Optionally, the contact perception diffusion model is pre-trained on the OMOMO dataset to learn basic action paradigms and object type priors, thereby obtaining robust human-object interaction spatial and temporal priors.
[0013] Optionally, the human motion model is an MDM model, and when extracting keyframes, keyframes are selected by time averaging, wherein the selection of the number of keyframes balances computational efficiency and motion fidelity.
[0014] According to a second aspect of the present invention, a text-based invisible object 4D human interaction generation system is provided, comprising: a keyframe recovery module, configured to: acquire a human motion sequence through a human motion model, and uniformly downsample the human motion sequence to extract keyframes of the human motion sequence; for each keyframe, reconstruct a human mesh through an SMPL-X model and extract the vertex positions of the human mesh to form a human point cloud; an object position anchoring network, taking the human point cloud, object template point cloud and text prompts as input, predicts the object position and generates sparse 3D human-object interaction keyframes; and a sequence generation module, configured to: construct a contact perception diffusion model, taking the sparse 3D human-object interaction keyframes as input, and extract conditional signals containing human posture and contact information from the sparse 3D human-object interaction keyframes through a contact perception encoder of the contact perception diffusion model; and based on the conditional signals, perform temporal interpolation on the sparse 3D human-object interaction keyframes through the contact perception diffusion model to generate a temporally coherent dense 4D human-object interaction sequence.
[0015] According to a third aspect of the present invention, an electronic device is provided, including a processor and a memory storing a program. The program includes instructions that, when executed by the processor, cause the processor to perform the steps performed by the method of the first aspect described above.
[0016] According to a fourth aspect of the present invention, a computer storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the method of the first aspect described above.
[0017] In summary, this invention proposes a novel, general framework for 4D character-object interaction synthesis. By decoupling spatial and temporal modeling, it achieves natural and realistic 4D character-object interaction synthesis for unseen objects, effectively reducing reliance on large-scale 4D character-object interaction datasets. In the temporal modeling stage, the proposed contact-aware diffusion model explicitly utilizes prior interaction knowledge during 4D sequence generation, ensuring robust generalization ability for diverse object geometries. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings.
[0019] Figure 1 This is a flowchart illustrating the steps of a text-based method for generating 4D human interaction from invisible objects according to the present invention.
[0020] Figure 2 To and Figure 1 The architecture diagram of the corresponding text-based method for generating 4D character interactions from invisible objects.
[0021] Figure 3 Display images showing the results of generating 4D human-object interaction content.
[0022] Figure 4 A diagram showing the results of generating the network for anchoring the object's position. Detailed Implementation
[0023] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0024] See Figure 1 , Figure 2 The present invention provides a method for generating interactive 4D characters from invisible objects based on text descriptions, comprising:
[0025] Phase 1, 3D Human-Object Interaction Keyframe Reconstruction:
[0026] S11. Obtain the human motion sequence through the human motion model, and perform uniform downsampling on the human motion sequence to extract the key frames of the human motion sequence.
[0027] S12. For each keyframe, reconstruct the human body mesh using the SMPL-X model and extract the vertex positions of the human body mesh to form a human body point cloud.
[0028] S13. The object position anchoring network takes human point cloud, object template point cloud and text prompt as input, predicts the object position and generates sparse 3D human-object interaction keyframes.
[0029] Phase Two, Generation of 4D Human-Object Interaction Sequences:
[0030] S21. Construct a contact perception diffusion model, using sparse 3D human-object interaction keyframes as input, and extract conditional signals containing human posture and contact information from the sparse 3D human-object interaction keyframes through the contact perception encoder of the contact perception diffusion model.
[0031] S22. Based on the conditional signal, the sparse 3D human-object interaction keyframes are temporally interpolated using a contact perception diffusion model to generate a temporally coherent dense 4D human-object interaction sequence.
[0032] Optionally, the object location anchoring network recovers the object location by inferring the spatial relationship between the human body and the object template, and is trained on a hybrid dataset including the Grab and Behave datasets.
[0033] Optionally, the contact sensing encoder adopts the PointNet++ architecture to encode 3D human-object interaction keyframes and extract contact sensing features as the conditional signal.
[0034] Optionally, the contact perception diffusion model further includes a contact perception human-object interaction attention module, which dynamically aligns the contact perception features with the latent variables of the contact perception diffusion model through a cross-attention mechanism to ensure the accurate integration of fine-grained space and contact information.
[0035] Optionally, the contact perception diffusion model is pre-trained on the OMOMO dataset to learn basic action paradigms and object type priors, thereby obtaining robust human-object interaction spatial and temporal priors.
[0036] Optionally, the human motion model is an MDM model, and when extracting keyframes, keyframes are selected by time averaging, wherein the selection of the number of keyframes balances computational efficiency and motion fidelity.
[0037] In summary, this invention proposes a novel 4D human-object generation framework that utilizes a spatiotemporally decoupled two-stage modeling method to achieve natural and realistic 4D human-object interaction sequences for unknown objects. Specifically, in the first stage, 3D human-object interaction keyframes are reconstructed. For this purpose, an object position anchoring network was developed, which only requires human point clouds and object geometric templates to reconstruct 3D interaction keyframes, reducing dependence on 4D datasets. In the second stage, 4D human-object interaction sequences are generated. For this purpose, a contact-aware diffusion model was designed. A contact-aware encoder extracts contact condition signals from keyframes, achieving interpolation generation from sparse keyframes to dense temporal sequences.
[0038] Specifically, the solution of the present invention is further described with reference to the following examples:
[0039] To bridge the gap between datasets and real-world human-object interaction scenarios, this invention proposes a novel 4D human-object interaction sequence generation framework for unknown objects. This framework decomposes 4D human-object interaction sequence generation into two operable tasks:
[0040] (1) Reconstruct 3D human-object interaction keyframes for unknown objects;
[0041] (2) Interpolate sparse 3D human-object interaction keyframes into a temporally coherent dense 4D human-object interaction sequence.
[0042] For these two sub-tasks, this invention develops a two-stage processing flow:
[0043] The first stage learns the human-object interaction pattern through an object position anchoring network. It only requires input of human point cloud and object geometric template to reconstruct 3D human-object interaction keyframes. The network is trained based on a 3D human-object interaction dataset, avoiding the need for large-scale 4D human-object interaction data.
[0044] The second stage employs a contact-aware diffusion model, using a contact-aware encoder to extract conditional signals containing human posture and contact information from keyframes, achieving temporal interpolation from keyframes to 4D sequences. Through a spatiotemporal decoupling modeling strategy, this invention significantly reduces its dependence on 4D human-object interaction datasets, enabling the generation of 4D sequences of human-unknown object interactions.
[0045] First, given a text and an object geometry template, our goal is to generate a natural and realistic sequence of human-object interactions that conforms to the text description. The model architecture diagram of the method of this invention is shown below. Figure 2 As shown, in the first stage, this invention utilizes prior knowledge of object geometry and human pose to reconstruct keyframes for human-object interaction. In the second stage, the contact-aware diffusion model uses the human-object interaction keyframes and encoded contact codes to generate a 4D human-object interaction sequence. After training, this invention can generalize to unseen objects based on their geometry and relevant textual cues.
[0046] The goal of this invention is to synthesize 4D human-object interaction sequences based on text descriptions and unseen objects. This faces two main challenges:
[0047] (1) Generalize to unseen objects while maintaining spatial accuracy;
[0048] (2) Ensure temporal continuity and realistic contact dynamics.
[0049] To address this, this invention proposes a two-stage process: 3D human-object interaction keyframe recovery, followed by 4D interpolation, maintaining temporal coherence. In the first stage, this invention proposes an object position anchoring network to learn human-object interaction patterns, thereby enabling the recovery of 3D human-object interaction keyframes. In the second stage, this invention proposes a contact-aware diffusion model to interpolate sparse 3D human-object interaction keyframes into a temporally coherent 4DHOI sequence, and uses a contact-aware encoder to encode the 3D human-object interaction keyframes into conditional signals.
[0050] This invention represents human motion as x h ∈R N×D Where N is the number of frames and D is the dimension of the human pose. In each frame n, the human pose x nIncludes global joint positions and local 6D continuous rotations. The human body mesh is reconstructed from pose and shape parameters using the SMPL-X model. Object motion is represented by its global 3D position (centroid) and rotation. Specifically, human and object motion is defined as follows:
[0051] x h =[j, q], x o =[o,r]
[0052] Phase 1, 3D human-object interaction keyframe recovery, includes two parts: human body keyframe sampling and object position anchoring network design.
[0053] Starting with the text description p, human motion x is obtained using the existing human motion diffusion model (MDM). h First, keyframes are extracted by uniformly downsampling the input human motion sequence x.
[0054] Specifically, given a motion sequence containing N frames, K=5 keyframes are selected by time averaging to preserve key motion dynamics while minimizing redundancy. The choice of K balances computational efficiency and motion fidelity, which has been verified in experiments. For each keyframe, a human body mesh is reconstructed using the SMPL-X model (SMPL Extended), and vertex positions V∈R are extracted. (K×M×3) , where M represents the number of mesh vertices. These vertices are considered as a human point cloud. By operating on sparse keyframes rather than dense sequences, error propagation and computational overhead are reduced while capturing diverse interaction states.
[0055] The Object Position Anchoring Network recovers object positions by inferring the spatial relationship between a human body and an object template. It employs an object pop-up architecture. The network is trained on a hybrid dataset combining the existing Grab and Behave datasets and augmented with single-frame human-object interaction poses extracted from existing 3DIR image datasets using the existing method CONTHO (Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer). This multi-source training strategy enhances the Object Position Anchoring Network's ability to generalize to unseen object shapes and interaction dynamics. By training on point clouds that capture key topological information, rather than relying on coarse SMPL parameters and object poses, the network effectively captures fine-grained contact dynamics, crucial for accurate and realistic 3D object position recovery. For each human keyframe V... k ∈R (M×3)The network takes human point cloud, object template point cloud, and text cue p as input to predict the position of the object, thereby forming a complete HOI frame.
[0056] Phase Two: 4D Human-Object Interaction Interpolation Based on Contact Perception. Having established 3D human-object interaction keyframe reconstruction, these sparse keyframes are now interpolated into a temporally coherent motion sequence.
[0057] To address this, this invention designs a contact-aware diffusion model. This model generates interaction sequences based on text descriptions, object geometry, and point cloud contact information, ensuring temporal consistency and geometric plausibility. ContactDM follows a noise addition and removal framework to generate temporally coherent motion. The complete data representation in this invention's model is as follows:
[0058] τ=(x h x o ),
[0059] It encapsulates human and object motion. This pose-based representation captures the 3D arrangement of key body joints (such as shoulders, elbows, and knees) as coordinates, providing a lightweight yet expressive representation for efficient training and inference. The model conditionally conditions its generation process to a set of signals c, including object geometry and textual descriptions.
[0060] To further enhance the model's ability to capture fine-grained human-object interactions, this invention introduces a contact-aware encoder and a contact-aware human-object interaction attention mechanism, which are key components of the contact-aware diffusion model. The contact-aware encoder efficiently processes sparse 3D human-object interaction keyframes to extract contact-aware features, while the contact-aware human-object interaction attention module dynamically aligns these features with the latent variables of the diffusion model through a cross-attention mechanism. This ensures the accurate integration of fine-grained spatial and contact information, enabling the model to generate realistic and temporally coherent 4D human-object interaction sequences.
[0061] Contact sensing encoder:
[0062] While pose-based diffusion models are lightweight and computationally efficient, they struggle to capture fine-grained details of 3D human-object contact areas. To address this limitation, this invention proposes a contact-aware encoder to encode HOI point clouds and extract contact-aware features to enrich the representation with accurate spatial and interaction information, crucial for realistic synthesis.
[0063] Specifically, given 3D human-object interaction keyframes This invention uses the PointNet++ architecture to encode 3D human-object interaction point clouds. It directly uses PointNet++ to encode human and object point clouds. and This can lead to significant memory overhead.
[0064] To alleviate this problem, this invention employs an efficient sampling strategy. First, by selecting M... o The farthest point of the object point cloud Downsampling is performed to obtain a sampling point cloud that preserves the geometry of the object. Secondly, in order to accurately infer contact relationships, M was sampled from human point clouds. h Let's focus on the nearest body part using the nearest point, denoted as [missing information]. In the experiment, M o =500, M h =1000. To distinguish between human and object vertices, one-hot encoding is introduced:
[0065]
[0066] in, and Let represent the one and zero vectors of the human body and object points, respectively. The final input point cloud is written as:
[0067]
[0068] The obtained point cloud is then processed by a point cloud encoder, which uses a multi-scale grouping strategy to extract hierarchical spatial features:
[0069] F i =PointEncoder(V k )∈R d ,
[0070] Among them, F i The denot represents the encoded features for each frame, and d is the output feature dimension. The point cloud encoder aggregates features at multiple scales using local neighborhoods, ensuring geometrically and tactilely aware representation learning.
[0071] Contact perception human-object interaction attention mechanism:
[0072] The human-object interaction attention module for touch perception uses a cross-attention mechanism to encode the touch perception features F i It is connected to the conditional diffusion model. Unlike static labeled connections that attach contact embeddings to the input, cross-attention dynamically connects F... i Aligning with the latent variables of the diffusion model enables accurate and efficient feature integration.
[0073] Specifically, F i The projection is the key K and the value V, while the pose embedding E poseAs the query Q, this allows the diffusion model to selectively focus on key contact regions during generation, ensuring that fine-grained spatial information guides the synthesis process. Fusion feature F fused The calculation is as follows:
[0074]
[0075] Where Q, K, and V are respectively E pose F i and F i Linear projection of.
[0076] Human-object interaction pre-training:
[0077] To enable the model to learn basic action paradigms (such as lifting, picking up, and putting down) and object type priors, this invention pre-trains a conditional diffusion model and a contact-aware encoder on the OMOMO dataset. The pre-training process utilizes sampled real-world human-object meshes, object geometry, and textual descriptions as conditional inputs. This ensures that the model acquires robust spatial and temporal priors for human-object interactions.
[0078] Furthermore, this invention has been validated on multiple datasets, achieving state-of-the-art performance in the richness and realism of human motion and object trajectory generation, demonstrating the effectiveness of the invention. Quantitative results for generating objects already present in the training set are shown in Table 1, while quantitative results for generating objects not visible in the training set are shown in Table 2. Qualitative results are as follows: Figure 3 As shown. Figure 4 This is the generation result of the object position anchoring network in stage one. Simultaneously, as a design modification, a stronger multi-view image diffusion model can be used as the base model.
[0079] Table 1. Quantitative results of object generation in the training set.
[0080]
[0081] Table 2 Quantitative results of generation on invisible objects in the training set
[0082]
[0083] In summary, this invention proposes a novel, general framework for 4D character-object interaction synthesis. By decoupling spatial and temporal modeling, it achieves natural and realistic 4D character-object interaction synthesis for unseen objects. This invention effectively reduces reliance on large-scale 4D character-object interaction datasets by decoupling spatial and temporal modeling. In the temporal modeling stage, our proposed contact-aware diffusion model explicitly utilizes prior interaction knowledge during 4D sequence generation, ensuring robust generalization ability for diverse object geometries.
[0084] As another example, embodiments of the present invention also provide a 4D character interaction generation system for invisible objects based on text descriptions, comprising:
[0085] The keyframe recovery module is used for:
[0086] Human motion sequences are obtained through a human motion model, and the human motion sequences are uniformly downsampled to extract keyframes from the human motion sequences.
[0087] For each keyframe, the human body mesh is reconstructed using the SMPL-X model, and the vertex positions of the human body mesh are extracted to form a human body point cloud;
[0088] The object position anchoring network takes human point cloud, object template point cloud and text prompt as input, predicts object position and generates sparse 3D human-object interaction keyframes.
[0089] The sequence generation module is used for:
[0090] A contact perception diffusion model is constructed, and sparse 3D human-object interaction keyframes are used as input. The contact perception encoder of the contact perception diffusion model extracts conditional signals containing human posture and contact information from the sparse 3D human-object interaction keyframes.
[0091] Based on the conditional signal, a temporally coherent dense 4D human-object interaction sequence is generated by temporally interpolating sparse 3D human-object interaction keyframes using a contact perception diffusion model.
[0092] It should be understood that the invisible object 4D character interaction generation system based on text description in this embodiment is used to implement the corresponding methods in the aforementioned multiple method embodiments and has the beneficial effects of the corresponding method embodiments.
[0093] In summary, the novel, general framework for 4D character-object interaction synthesis proposed in this invention achieves natural and realistic 4D character-object interaction synthesis for unseen objects by decoupling spatial and temporal modeling, effectively reducing the dependence on large-scale 4D character-object interaction datasets. In the temporal modeling stage, the contact perception diffusion model proposed in this invention explicitly utilizes prior interaction knowledge during 4D sequence generation, ensuring robust generalization ability for diverse object geometries.
[0094] As another example, the present invention also provides an electronic device, which will now be described as an example of a hardware device that can be applied to various aspects of the present invention, serving as a server or client of the invention. The term "electronic device" is intended to represent various forms of digital electronic computer devices, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.
[0095] The electronic device may include a processor, a communications interface, memory, and a communication bus.
[0096] The processor, communication interface, and memory communicate with each other via a communication bus. The communication interface is used to communicate with other electronic devices or servers.
[0097] The processor is used to execute programs, specifically the relevant steps in the above method embodiments.
[0098] Specifically, the program may include program code, which includes computer operation instructions.
[0099] The processor may be a CPU, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in a smart device may be of the same type, such as one or more CPUs; or they may be of different types, such as one or more CPUs and one or more ASICs.
[0100] The memory is used to store programs. The memory may include high-speed RAM, and may also include non-volatile memory, such as at least one disk drive.
[0101] When executed by a processor, the program is used to enable an electronic device to perform a text-based method for generating 4D human interaction of invisible objects.
[0102] Furthermore, the specific implementation of each step in the program can be found in the corresponding descriptions of the steps and units in the above method embodiments, and will not be repeated here. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the devices and modules described above can be referred to the corresponding process descriptions in the foregoing method embodiments, and will not be repeated here.
[0103] An exemplary embodiment of the present invention also provides a computer storage medium storing a computer program, wherein when the computer program is executed by a processor, it implements the methods of the various embodiments of the present invention. The corresponding process descriptions in the foregoing method embodiments can be referred to, and will not be repeated here.
[0104] The methods described above according to embodiments of the present invention can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as a CD-ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or as computer code originally stored on a remote recording medium or a non-transitory machine-readable medium and subsequently stored on a local recording medium, downloaded via a network. Thus, the methods described herein can be processed by software stored on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware (such as an ASIC or FPGA). It is understood that the computer, processor, microprocessor controller, or programmable hardware includes storage components (e.g., RAM, ROM, flash memory, etc.) capable of storing or receiving software or computer code, which, when accessed and executed by the computer, processor, or hardware, implements the methods described herein. Furthermore, when a general-purpose computer accesses code used to implement the methods shown herein, the execution of the code transforms the general-purpose computer into a dedicated computer for executing the methods shown herein.
[0105] Specific embodiments of the invention have now been described. Other embodiments are within the scope of the appended claims. In some cases, the actions described in the claims can be performed in a different order and still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired result. In some embodiments, multitasking and parallel processing can be advantageous.
[0106] It should be understood that although this specification is described according to various embodiments, not every embodiment contains only one independent technical solution. This way of describing the specification is only for clarity. Those skilled in the art should regard the specification as a whole. The technical solutions in each embodiment can also be appropriately combined to form other implementation methods that can be understood by those skilled in the art.
[0107] Finally, it should be noted that the above embodiments are only used to illustrate the embodiments of the present invention, and are not intended to limit the embodiments of the present invention. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present invention. Therefore, all equivalent technical solutions also fall within the scope of the embodiments of the present invention, and the patent protection scope of the embodiments of the present invention should be defined by the claims.
Claims
1. A method for generating interactive 4D characters from invisible objects based on text description, characterized in that, include: Phase 1, 3D Human-Object Interaction Keyframe Reconstruction: Human motion sequences are obtained through a human motion model, and the human motion sequences are uniformly downsampled to extract keyframes from the human motion sequences. For each keyframe, the human body mesh is reconstructed using the SMPL-X model, and the vertex positions of the human body mesh are extracted to form a human body point cloud; The object position anchoring network takes human point cloud, object geometric template and text prompt as input, predicts object position and generates sparse 3D human-object interaction keyframes. Phase Two, Generation of 4D Human-Object Interaction Sequences: A contact perception diffusion model is constructed, and sparse 3D human-object interaction keyframes are used as input. The contact perception encoder of the contact perception diffusion model extracts conditional signals containing human posture and contact information from the sparse 3D human-object interaction keyframes. Based on the conditional signal, a temporally coherent dense 4D human-object interaction sequence is generated by temporally interpolating sparse 3D human-object interaction keyframes using a contact perception diffusion model.
2. The method according to claim 1, characterized in that, The object location anchoring network recovers the object location by inferring the spatial relationship between the human body and the object template, and is trained on a hybrid dataset including the Grab and Behave datasets.
3. The method according to claim 1, characterized in that, The contact sensing encoder adopts the PointNet++ architecture to encode 3D human-object interaction keyframes and extract contact sensing features as the conditional signal.
4. The method according to claim 3, characterized in that, The contact perception diffusion model also includes a contact perception human-object interaction attention module. The contact perception human-object interaction attention module dynamically aligns the contact perception features with the latent variables of the contact perception diffusion model through a cross-attention mechanism to ensure the accurate integration of fine-grained space and contact information.
5. The method according to claim 4, characterized in that, The contact perception diffusion model is pre-trained on the OMOMO dataset to learn basic action paradigms and object type priors, thereby obtaining robust human-object interaction spatial and temporal priors.
6. The method according to claim 1, characterized in that, The human motion model is an MDM model, and keyframes are selected by time averaging when extracting keyframes. The selection of the number of keyframes balances computational efficiency and motion fidelity.
7. A 4D character interaction generation system for invisible objects based on text description, characterized in that, include: The keyframe recovery module is used for: Human motion sequences are obtained through a human motion model, and the human motion sequences are uniformly downsampled to extract keyframes from the human motion sequences. For each keyframe, the human body mesh is reconstructed using the SMPL-X model, and the vertex positions of the human body mesh are extracted to form a human body point cloud; The object position anchoring network takes human point cloud, object geometric template and text prompt as input, predicts object position and generates sparse 3D human-object interaction keyframes. The sequence generation module is used for: A contact perception diffusion model is constructed, and sparse 3D human-object interaction keyframes are used as input. The contact perception encoder of the contact perception diffusion model extracts conditional signals containing human posture and contact information from the sparse 3D human-object interaction keyframes. Based on the conditional signal, a temporally coherent dense 4D human-object interaction sequence is generated by temporally interpolating sparse 3D human-object interaction keyframes using a contact perception diffusion model.
8. An electronic device, characterized in that, include: processor; Memory for stored programs; The program includes instructions that, when executed by the processor, cause the processor to perform the steps of the method as described in any one of claims 1-6.
9. A computer storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the method as described in any one of claims 1-6.