An action sequence generation method, a model training method, and an electronic device

By employing a wordless training strategy and an action sequence generation method with open vocabulary, combined with a point cloud encoder and a conditional variational autoencoder, the problem of limited training sample data in existing technologies is solved, achieving high-quality action sequence generation, high-quality generation and 3D design of scene interaction, and improving user experience.

CN122312840APending Publication Date: 2026-06-30HONOR DEVICE CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HONOR DEVICE CO LTD
Filing Date
2024-12-30
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies for generating 3D designs that allow human interaction with scenes suffer from limited training sample datasets, resulting in poor model training performance, insufficient generalization and applicability. Furthermore, the fixed format of descriptive text restricts the flexibility of application scenarios and user input.

Method used

We employ a wordless training strategy and an open-ended vocabulary-based action sequence generation method. We use a point cloud encoder, an MLP model, and an action generator to generate action sequences that match the scene. We supplement the action sequences with a conditional variational autoencoder network model to generate continuous target action sequences, thus avoiding reliance on descriptive text formats.

Benefits of technology

It achieves high-quality, coherent, and natural action sequence generation, improves the model's generalization and applicability, meets diverse user needs, reduces the difficulty of obtaining training samples, and enhances the user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122312840A_ABST
    Figure CN122312840A_ABST
Patent Text Reader

Abstract

This application provides a method for generating action sequences, a model training method, and an electronic device. The electronic device can train a first model using training samples to obtain a target model. The training samples include action sequence samples and scene information. The action sequence samples represent action sequences, eliminating the need for descriptive text corresponding to the action sequence samples to train the first model, thus reducing the difficulty of obtaining training samples, ensuring the training effect of the first model, and improving the generalization ability of the first model. The training samples indicate the interaction process between the human body and the scene. After obtaining the target model, the electronic device can determine a single human action that matches the description based on the descriptive text and its corresponding scene information. Then, the electronic device uses the target model to supplement the single human action with additional actions, obtaining a complete target human action sequence, making the human actions coherent and natural, meeting the user's needs.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and more particularly to an action sequence generation method, a model training method, and an electronic device. Background Technology

[0002] In the field of 3D creation, such as virtual reality (VR), augmented reality (AR), and animation, it is often necessary to generate design content that involves human interaction with the scene. For example, the sequence of human motions in the 3D design of a person walking to an office desk indicates the human walking process.

[0003] Therefore, there is an urgent need for a method to automatically generate human motion sequences, so that after combining the human motion sequences with the scene, a 3D design that meets the user's needs can be obtained, thereby improving user satisfaction. Summary of the Invention

[0004] This application provides an action sequence generation method, a model training method, and an electronic device for generating action sequences that meet user needs, thereby improving the user experience.

[0005] To achieve the above objectives, the embodiments of this application adopt the following technical solutions:

[0006] In a first aspect, an action sequence generation method is provided, which is applied to a first device. The first device acquires action interaction information to be processed, which includes first text and scene information of the corresponding first scene. The first text indicates the interaction process between the object and the first scene.

[0007] Then, the first device can generate a single action of the object that matches the first text based on the action interaction information.

[0008] Subsequently, the first device can supplement actions based on a single action and a first scene to generate a target action sequence for the object; the target action sequence is composed of multiple consecutive actions that match the first text.

[0009] In this application, after receiving action interaction information, the first device indicates a need to generate an action sequence for object-scene interaction. The first device can then use the first text and the scene information of the first scene in which the object is located to generate a single action that matches the first text. Subsequently, the first device can supplement the single action with actions related to the first scene to obtain a continuous target action sequence. When this target action sequence is applied to the first scene, a 3D design matching the first text can be obtained, meeting the user's needs. Furthermore, since this target action sequence is determined based on the scene information of the first scene, the scene information constrains the generated actions, ensuring that the target action sequence is adapted to the first scene, thereby guaranteeing high-quality 3D design. Additionally, by supplementing actions based on a single action to obtain continuous actions, the continuity and naturalness of the actions are ensured, thus guaranteeing high-quality target action sequences and improving user experience.

[0010] Optionally, the format of the first text is not fixed, thereby improving its applicability.

[0011] In one possible design approach, the process of supplementing actions based on a single action and scene information from the first scene to generate the target action sequence of the object may include:

[0012] The first device can use a single action and scene information of the first scene as input parameters for the target model, run the target model, and supplement the single action based on the scene information of the first scene that matches the first text to generate a target action sequence.

[0013] The target model is obtained by training the first model using training samples; the training samples include action sequence samples and scene information of the sample scene.

[0014] Based on this, the first device can generate the aforementioned target action sequence using the target model, achieving rapid generation of the target action sequence. Furthermore, the training samples used by the target model include action sequence samples and corresponding scene information of the sample scenarios, but do not include descriptive text matching the action sequence samples. Therefore, it is unnecessary to train the first model using descriptive text matching the action sequence samples, reducing the difficulty of obtaining training samples and enabling the acquisition of a large number of training samples. This ensures the training effect of the target model and improves its generalization and applicability.

[0015] In one possible design approach, the target model includes a point cloud encoder, an MLP model, and an action generator. Correspondingly, the process of generating a target action sequence using the target model can be:

[0016] First, the first device can extract features from the scene information of the first scene using a point cloud encoder to obtain the spatial features of the first scene. Then, the first device can use an MLP model to predict the distribution parameters of the implicit space of the first action that matches the spatial features of the first scene; that is, to predict the distribution parameters of the implicit space of the action that matches the first scene.

[0017] Subsequently, the first device, through an action generator, supplements a single action based on the target implicit encoding and the spatial features of the first scene, generating a target action sequence. The target implicit encoding is obtained by sampling the distribution parameters of the implicit space of the first action. The spatial features of the first scene are used to guide (or direct) the direction of action generation.

[0018] Based on this, by using the point cloud encoder, MLP model and action generator in the target model, action supplementation is performed on a single action to generate a target action sequence. Even if the target model does not use the first text, it can still ensure that the target action sequence matches both the first text and the first scene, avoiding unnecessary collisions between the target action sequence and objects in the scene, ensuring the quality of the target action sequence, and thus meeting the user's needs.

[0019] In one possible design approach, the process of generating the single action that matches the first text may include:

[0020] First, the first device can use the action interaction information as input parameters for a generative model, run the generative model, render a scene image based on the scene information and interactive object information of the first scene, and then perform 3D reconstruction of the object based on the first text and the scene image to obtain a single action. Here, the interactive object information represents the information of the object interacting with the first scene, and the interactive object information is indicated by the first text or included in the action interaction information.

[0021] Based on this, a generative model is used to generate a scene image that matches the first text by utilizing action interaction information and scene information of the first scene. Then, the object is reconstructed in three dimensions using the first text and the scene image to obtain a single action that matches the first text. The scene image also constrains the generation of the single action, thereby ensuring the adaptability of the single action to the first scene, and thus ensuring to a certain extent the adaptability of other actions supplemented based on the single action to the first scene.

[0022] In one possible design approach, the aforementioned object includes the human body. The process of determining a single action may include:

[0023] The first device renders N scene images from N perspectives based on the interactive object information and the scene information of the first scene, where N is greater than or equal to 1.

[0024] Then, based on the first text, the first device can add a two-dimensional human motion image to each of the N scene images to obtain N initial human images. Next, the first device can extract human information from the N initial human images, perform three-dimensional reconstruction of the human body, and obtain a single motion.

[0025] Based on this, the first device renders interactive objects in the first scene from different perspectives, obtaining N scene images. Then, the first device can add 2D human motions that match the first text to each scene image, thus obtaining initial human images from different perspectives. Next, the first device can use these initial human images from different perspectives to perform 3D reconstruction of the 2D human motions, obtaining individual human actions while preserving the details of the actions. Therefore, after applying the target action sequence to the first scene, the user can see a high-quality human body from different perspectives.

[0026] Secondly, this application provides a model training method applied to a second device. The second device can acquire training samples, which include a first action sequence and scene information of the sample scene; the action sequence sample consists of multiple consecutive actions.

[0027] Then, the second device can input the training samples into the first model to train the first model and obtain the predicted action sequence corresponding to the action sequence sample;

[0028] Then, the second device can determine the loss function based on the action sequence samples and their corresponding action sequences.

[0029] Then, the second device can determine whether the loss function is less than the preset loss value.

[0030] If the loss function is less than a preset loss value, the trained first model is used as the target model; wherein, the target model is used to generate a target action sequence that matches the first text, and the first text indicates the interaction action between the object and the first scene.

[0031] In this application, the training samples used to train the first model do not include descriptive text that matches the action sequence samples, which reduces the difficulty of obtaining training samples and thus enables the acquisition of a large number of training samples, thereby ensuring that the training effect of the target model is good and improving the generalization and applicability of the target model.

[0032] In one possible design approach, the first model described above includes a point cloud encoder, an encoder, and an action generator. Accordingly, the process of determining the predicted action sequence corresponding to the action sequence sample may include:

[0033] The point cloud encoder is used to extract features from the scene information of the sample scene to obtain the spatial features of the first scene.

[0034] The encoder determines the distribution parameters of the implicit space of the action sequence sample based on the action sequence sample and the mask vector corresponding to the action sequence sample; wherein, the mask vector is the vector corresponding to the mask action in the action sequence sample; and the distribution parameters of the implicit space of the action sequence sample represent the range of vectors corresponding to actions similar to the action sequence sample.

[0035] The action generator, based on implicit encoding and scene information of the sample scene, restores the masked actions in the masked action sequence corresponding to the action sequence sample to generate the predicted action sequence; wherein, the implicit encoding is obtained by sampling the distribution parameters of the action implicit space corresponding to the action sequence sample.

[0036] The aforementioned implicit space of actions represents action sequences similar to the action sequence sample, wherein some or all of the actions in the similar action sequence have a similarity to the corresponding actions in the action sequence sample that is greater than a certain threshold.

[0037] In one possible design approach, the first model also includes a multilayer perceptron (MLP) model. The process of determining the distribution parameters of the implicit action space described above can include:

[0038] The second device can determine the distribution parameters of the sample scene based on the spatial characteristics of the sample scene using an MLP model;

[0039] Then, using the encoder, based on the action sequence samples and the mask vectors corresponding to the action sequence samples, and combined with the distribution parameters of the sample scene, the distribution parameters of the action implicit space are determined; wherein, the distribution parameters of the sample scene serve as constraints for determining the distribution parameters of the action implicit space.

[0040] Based on this, the distribution parameters of the sample scene are used to constrain the determination of the distribution parameters of the action implicit space, thereby ensuring the fit between the action predicted by the trained target model and the scene, and reducing the sense of disconnect between the object's action and the scene.

[0041] The second device mentioned above may be the same device as the first device mentioned above, or it may be a different device.

[0042] Thirdly, this application provides a chip, which includes a communication interface and at least one processor:

[0043] A communication interface used for inputting and / or outputting signaling or data.

[0044] At least one processor is configured to execute a computer program that implements the action sequence generation method as described in any of the first aspects above.

[0045] Fourthly, this application provides a chip, which includes a communication interface and at least one processor:

[0046] A communication interface used for inputting and / or outputting signaling or data.

[0047] At least one processor is used to execute a computer program to implement a model training method as described in any of the second aspects above.

[0048] Fifthly, this application provides an electronic device including a memory and one or more processors. A display screen, the memory, and the processors are coupled. The memory stores computer program code, including computer instructions. When the processor executes the computer instructions, it causes the electronic device to perform the action sequence generation method as described in any of the first aspects above.

[0049] Sixthly, this application provides an electronic device including a memory and one or more processors. A display screen, the memory, and the processors are coupled. The memory stores computer program code, including computer instructions. When the processor executes the computer instructions, it causes the electronic device to perform a model training method as described in any of the second aspects above.

[0050] In a seventh aspect, this application provides a computer-readable storage medium including computer instructions that, when executed on an electronic device, cause the electronic device to perform an action sequence generation method as described in any of the first aspects above.

[0051] Eighthly, this application provides a computer-readable storage medium including computer instructions that, when executed on an electronic device, cause the electronic device to perform a model training method as described in any of the second aspects above.

[0052] Ninthly, this application provides a computer program product, including a computer program that, when executed by a processor, implements the action sequence generation method as described in any of the first aspects above.

[0053] In a tenth aspect, this application provides a computer program product, including a computer program that, when executed by a processor, implements a model training method as described in any of the second aspects above.

[0054] It is understood that the beneficial effects that can be achieved by the model training method described in the second aspect, the chip described in the third and fourth aspects, the electronic device described in the fifth and sixth aspects, the computer storage medium described in the seventh and eighth aspects, and the computer program product described in the ninth and tenth aspects can be referred to the beneficial effects in the first aspect and any possible implementation thereof, and will not be repeated here. Attached Figure Description

[0055] Figure 1 This application provides an illustration of human-scene interaction. Figure 1 ;

[0056] Figure 2A A flowchart illustrating an action sequence generation method provided in this application embodiment. Figure 1 ;

[0057] Figure 2B A flowchart illustrating an action sequence generation method provided in this application embodiment is shown in Figure 2.

[0058] Figure 3 A schematic diagram of the hardware structure of an electronic device provided in this application embodiment. Figure 1 ;

[0059] Figure 4 A flowchart illustrating an action sequence generation method provided in this application embodiment. Figure 3 ;

[0060] Figure 5 A flowchart illustrating an action sequence generation method provided in this application embodiment. Figure 4 ;

[0061] Figure 6 A schematic flowchart illustrating a model training method provided in an embodiment of this application;

[0062] Figure 7A An illustration of a scenario provided in an embodiment of this application. Figure 1 ;

[0063] Figure 7B A second schematic diagram illustrating a scenario provided in an embodiment of this application;

[0064] Figure 8A A second schematic diagram illustrating human-scene interaction provided as an embodiment of this application;

[0065] Figure 8B This application provides an illustration of human-scene interaction. Figure 3 ;

[0066] Figure 9This application provides an illustration of human-scene interaction. Figure 4 ;

[0067] Figure 10 A second schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application;

[0068] Figure 11 This is a schematic diagram of a chip system provided in an embodiment of this application. Detailed Implementation

[0069] To facilitate a clear description of the technical solutions in the embodiments of this application, the terms "exemplary" or "for example" are used in the embodiments of this application to indicate examples, illustrations, or explanations. Any embodiment or design scheme described as "exemplary" or "for example" in this application should not be construed as being more preferred or advantageous than other embodiments or design schemes. Specifically, the use of terms such as "exemplary" or "for example" is intended to present related concepts in a specific manner. In the embodiments of this application, "at least one" refers to one or more, and "more" refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can represent: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple. In the embodiments of this application, "first," "second," "1," and "2" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, features defined with "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this embodiment, unless otherwise stated, "multiple" means two or more.

[0070] In 3D creation fields such as virtual reality, augmented reality, and animation, there is a frequent need to generate content involving human interaction with a scene. Electronic devices can generate a sequence of human actions that matches scene information and descriptive text (or simply text). This sequence of human actions consists of continuous human movements, which can be considered as the actions of a human interacting with the scene when the human is within it. The descriptive text describes the human actions to be generated; specifically, it describes each action of the human in the scene. In other words, the descriptive text indicates how the human interacts with the scene.

[0071] For example, if the descriptive text is "a person walks to a desk," the sequence of actions matching this descriptive text could include, for instance,... Figure 1 The movements of human body 11 shown in (a) and such Figure 1 The movement of human body 11 shown in (b) is achieved through... Figure 1 (a) and Figure 1 The action of human body 11 shown in (b) demonstrates the process of human body 11 moving towards desk 10, thus obtaining a sequence of human body actions that matches the description. It should be understood that desk 10 is part of the scene and also part of the scene in which human body 11 is located; both desk 10 and human body 11 are three-dimensional.

[0072] In some embodiments, such as Figure 2A As shown, the electronic device inputs training sample pairs and corresponding scene information of the sample scenes into the model to train the model and obtain the target model. The training sample pairs include action sequence samples and their corresponding descriptive text samples. Action sequence samples include continuous actions. The descriptive text samples describe each interaction action of the human body with the sample scene in the action sequence samples; simply put, the descriptive text samples indicate the interaction process between the human body and the sample scene.

[0073] The aforementioned target model can generate a sequence of human actions that matches the user-input descriptive text and scene information. The scene information acts as a constraint and guide in the generation of the human action sequence, enabling the human to interact with the scene through the generated action sequence within that scene.

[0074] However, because each action in a scene requires a corresponding description, the available action sequences and their corresponding descriptive texts are limited. In other words, the size of the open-source training sample dataset is finite, restricting model training and resulting in poor training performance. This significantly limits the generalization ability of the trained target model, potentially preventing it from generating the human action sequences required by the user. For example, the types of action sequences generated by the target model are limited. If the training sample pairs only include walking action sequences and their corresponding descriptive texts, the target model cannot effectively generate running action sequences because it cannot obtain walking action sequences and their corresponding descriptive texts.

[0075] Furthermore, the limited number of available training samples restricts the types of training scenarios that can be used. For example, the absence of training sample pairs that interact with office scenarios prevents the target model from effectively generating action sequences related to human interaction with office environments.

[0076] Furthermore, the descriptive text samples in the training samples are generally in the format of multiple action combinations, meaning they need to conform to certain rules, such as verb + noun + verb... + noun + verb. This fixed format restricts the descriptive text. Correspondingly, in the stage of inferring action sequences using the target model, the user-inputted descriptive text also needs to conform to certain rules, making it inconvenient for users. Moreover, the descriptive text usually describes the scene to some extent; due to its fixed format, the descriptive method also limits the application scenarios.

[0077] It is understandable that the input parameters of the target model mentioned above can include not only scene information and descriptive text, but also the initial position of the human body within the scene. Correspondingly, during the training phase, the model's input parameters can also include the initial position of the human body within the scene.

[0078] Therefore, to address the aforementioned problems, considering that human movement is a highly nonlinear, hinged structure, motion generation is performed based on forward / inverse dynamics to ensure that the generated human motion conforms to physical and biological constraints, thereby guaranteeing that the generated motion is sufficiently natural and realistic. Furthermore, motion generation requires text constraints to ensure a high degree of fit between the generated motion and the text. Additionally, interaction between the generated human motion and the scene is necessary. This application provides a scheme for generating human motion sequences based on a wordless training strategy, open vocabulary, and scene. The implementation of this scheme can be divided into two parts, such as... Figure 2B As shown, the first part involves an electronic device generating a single human action that matches the descriptive text input by the user, based on scene information and a generative model. This generative model requires no pre-training and can be used directly for inference. Furthermore, the generative model can also be trained during inference tasks, allowing for model adjustment and optimization.

[0079] The second part involves the electronic device using a target conditional variational autoencoder (CVA) network model to supplement the single human action, resulting in a continuous sequence of target human actions that matches the descriptive text. The CVA network model is trained using action sequence samples and information from the sample scene, rather than using training samples that include both action sequence samples and their corresponding descriptive text samples. Obtaining action sequence samples is easier, allowing for a larger pool of samples to ensure effective training and improve the generalization and applicability of the CVA network model. Furthermore, the descriptive text in this approach can consist of open-ended vocabulary, without restrictions on format, making it convenient for users and further enhancing the generalization of the solution.

[0080] For example, the electronic device in the embodiments of this application may be a mobile phone, tablet computer, desktop computer, laptop computer, handheld computer, notebook computer, ultra-mobile personal computer (UMPC), netbook, and electronic devices with computing capabilities such as personal digital assistant (PDA), augmented reality (AR) / virtual reality (VR) devices. The embodiments of this application do not impose special restrictions on the specific form of the electronic device.

[0081] Figure 3 A schematic diagram of the structure of the electronic device 100 is shown.

[0082] Electronic device 100 may include processor 110, external memory interface 120, internal memory 121, universal serial bus (USB) interface 130, charging management module 140, power management module 141, battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and subscriber identification module (SIM) card interface 195, etc.

[0083] It is understood that the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the electronic device 100. In other embodiments of this application, the electronic device 100 may include more or fewer components than illustrated, or combine some components, or split some components, or have different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

[0084] Processor 110 may include one or more processing units, such as: application processor (AP), modem processor, graphics processing unit (GPU), image signal processor (ISP), controller, memory, video codec, digital signal processor (DSP), baseband processor, and / or neural network processing unit (NPU), etc. Different processing units may be independent devices or integrated into one or more processors.

[0085] The controller can be the nerve center and command center of the electronic device 100. The controller can generate operation control signals according to the instruction opcode and timing signals to complete the control of fetching and executing instructions.

[0086] The processor 110 may also include a memory for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. This memory can store instructions or data that the processor 110 has just used or that are used repeatedly. If the processor 110 needs to use the instruction or data again, it can retrieve it directly from the memory. This avoids repeated accesses, reduces the waiting time of the processor 110, and thus improves the efficiency of the system.

[0087] The charging management module 140 receives charging input from the charger. While charging the battery 142, the charging management module 140 can also supply power to the electronic device through the power management module 141.

[0088] The wireless communication function of electronic device 100 can be realized through antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, modem processor and baseband processor, etc.

[0089] Electronic device 100 implements display functions through a GPU, a display screen 194, and an application processor. The GPU is a microprocessor for image processing, connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations and for graphics rendering. Processor 110 may include one or more GPUs, which execute program instructions to generate or modify display information.

[0090] Display screen 194 is used to display images, videos, etc. Display screen 194 includes a display panel. In some embodiments, electronic device 100 may include one or N displays screens 194, where N is a positive integer greater than 1.

[0091] Electronic device 100 can perform shooting functions through ISP, camera 193, video codec, GPU, display 194 and application processor.

[0092] In some embodiments, the electronic device 100 may include one or N cameras 193, where N is a positive integer greater than 1.

[0093] The external storage interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100. The external memory card communicates with the processor 110 through the external storage interface 120 to perform data storage functions. For example, music, video, and other files can be saved on the external memory card.

[0094] Internal memory 121 can be used to store computer executable program code, which includes instructions. Processor 110 executes various functional applications and data processing of electronic device 100 by running the instructions stored in internal memory 121. Internal memory 121 may include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function (such as sound playback, image playback, etc.), etc. The data storage area may store data created during the use of electronic device 100 (such as audio data, phonebook, etc.). Furthermore, internal memory 121 may include high-speed random access memory and may also include non-volatile memory, such as at least one disk storage device, flash memory device, universal flash storage (UFS), etc.

[0095] Electronic device 100 can implement audio functions, such as music playback and recording, through audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, and application processor.

[0096] The headphone jack 170D is used to connect wired headphones. The headphone jack 170D can also be a USB 130 connector.

[0097] The aforementioned sensor module 180 may include pressure sensors, gyroscope sensors, barometric pressure sensors, magnetic sensors, accelerometers, distance sensors, proximity sensors, fingerprint sensors, temperature sensors, touch sensors, ambient light sensors, bone conduction sensors, etc.

[0098] Buttons 190 include a power button, volume buttons, etc. Motor 191 can generate vibration feedback. Indicator 192 can be an indicator light, used to indicate charging status, battery level changes, and also to indicate messages, missed calls, notifications, etc.

[0099] The SIM card interface 195 is used to connect a SIM card. The electronic device 100 can support one or N SIM card interfaces, where N is a positive integer greater than 1.

[0100] This application provides a method for generating action sequences. An electronic device receives descriptive text input by a user and scene information of the corresponding scene. The descriptive text describes human actions; specifically, it indicates how the human interacts with the scene. The scene information represents the scene in which the human is situated. Then, based on the descriptive text and scene information, the electronic device automatically generates a single human action, achieving automatic generation of basic actions. Next, the electronic device uses a pre-trained conditional variational autoencoder network model (i.e., a target conditional variational autoencoder network model) to predict actions based on the single human action, obtaining a complete action sequence that ensures the coherence and naturalness of the actions, thereby guaranteeing the quality of the action sequence. Furthermore, this application does not restrict the format of the descriptive text, avoiding limitations on the action sequence and scene caused by the descriptive text, thus improving the generalization ability of the action sequence generation method provided in this application to a certain extent.

[0101] The following will combine Figure 4 The implementation process of the above action sequence generation method is described. For example... Figure 4 As shown, the method may include:

[0102] S301. The electronic device acquires descriptive text and its corresponding interactive object information and scene information. The descriptive text indicates information about human interaction with the scene. The information about human interaction with the scene includes interaction action information.

[0103] The description of the human body describes the interaction process between the human body and the scene.

[0104] The above scene information represents the scene where the human is located (or the first scene), which is the scene information where the interactive object is located. If it is three-dimensional (3D) information of the scene, it can be 3D mesh information or 3D point cloud information.

[0105] The aforementioned interactive object information represents information about the specific objects that the human body interacts with in the scene, such as the above. Figure 1 The desk 10 shown in (a) is an interactive object. Optionally, the information of this object can be its identifier or its position information. Specifically, the position information of the object can be 3D position information, such as the object's 3D coordinates (x, y, z).

[0106] In some embodiments, the source of the aforementioned descriptive text (or first text) and one or more of its corresponding interactive object information and scene information can be one or more of the following: user input, user selection, or transmission from other devices. Taking a mobile phone as an example, where the descriptive text and interactive object information are user input, and the scene information is user selection, the mobile phone can provide at least one preset scene. The user can select a preset scene according to their needs and input the descriptive text and interactive object information. Correspondingly, the mobile phone stores scene information for each of the at least one preset scene, enabling the user to directly utilize the preset scenes provided by the mobile phone.

[0107] In some embodiments, the aforementioned interactive object information can be independent information or carried within the descriptive text, and does not need to be obtained separately. Generally speaking, the aforementioned descriptive text and its corresponding scene information belong to action interaction information. Accordingly, the descriptive text indicates interactive object information. Alternatively, the aforementioned descriptive text and its corresponding interactive object information and scene information belong to action interaction information.

[0108] In some embodiments, the descriptive text described above may be in text form or speech form. Specifically, when the descriptive text is in speech form, the electronic device may convert it into text form.

[0109] S302. The electronic device inputs interactive object information and scene information into the renderer, so that the renderer renders from different perspectives based on the interactive object information and scene information, resulting in N perspective images. Where N is greater than or equal to 1.

[0110] Each of the N viewpoint images mentioned above includes an interactive object. Optionally, the viewpoint image format can be RGB format, or other formats such as YUV format; this application does not limit this.

[0111] For example, the renderer draws a sphere with the interactive object corresponding to the interactive object information as the center and a preset distance as the radius, combined with scene information, and renders it. The number of rendered viewpoints is N, thus obtaining N viewpoint maps for the interactive object. In other words, the viewpoint map includes the scene situation within a range centered on the interactive object and with a preset distance as the radius.

[0112] For example, the interactive object is a bench, and the number of rendered viewpoints is 2. Therefore, the electronic devices receive... Figure 7A The view diagram shown and Figure 7B The view shown.

[0113] In some embodiments, the above-described method of rendering N viewpoint images using a renderer is only one possible implementation process for determining N viewpoint images. Electronic devices can also determine N viewpoint images using other methods. For example, electronic devices can directly render the aforementioned N viewpoint images using a rendering algorithm.

[0114] S303. The electronic device inputs N viewpoint images and descriptive text into the image generation model, so that the image generation model adds a 2D human body image to each of the N viewpoint images based on the descriptive text, thereby obtaining N initial human body images.

[0115] The aforementioned image generation model possesses text-to-image generation capabilities, enabling it to generate human figures that match the descriptive text, i.e., to generate individual human actions. Optionally, the image generation model can be a Latent Diffusion Inpainting network model.

[0116] For example, an electronic device takes N viewpoint images (or scene images) and descriptive text as input parameters to a human body generation model. For each viewpoint image, the human body generation model adds a human body matching the descriptive text to the viewpoint image based on the viewpoint corresponding to that viewpoint image. The human body's movements match the descriptive text, thereby obtaining individual human body movements from different viewpoints. Here, the human body is 2D.

[0117] Following the examples above, electronic devices in the above... Figure 7A Adding a 2D human image to the viewpoint shown yields... Figure 8A The initial human body image shown. Furthermore, the electronic device in the above... Figure 7B Add a 2D human body layer to the view shown, and you will get Figure 8B The initial human body image shown.

[0118] In some embodiments, the format of the initial human body image described above may be the same as the format of the viewpoint image described above, for example, both being RGB format. Of course, they may also be different, and this application does not limit them.

[0119] S304. The electronic device inputs N initial human body images into the three-dimensional reconstruction network model so that the three-dimensional reconstruction network model can perform three-dimensional reconstruction based on the N initial human body images to obtain a single human body action.

[0120] The aforementioned 3D reconstruction network model is used to convert a 2D human body into a 3D form. Optionally, the 3D reconstruction network model can be a human body model registration network model. Additionally, the weights in the 3D reconstruction network model can be frozen.

[0121] In this embodiment, the 3D reconstruction network model extracts 2D human figures with different orientations from each initial human image and performs 3D registration to obtain 3D human information (i.e., a 3D human model), enabling the automatic generation of individual human actions (or examples of individual actions that can serve as objects). Furthermore, this individual human action is generated based on descriptive text, ensuring the matching between the human action and the descriptive text. The use of 3D reconstruction from human images with different orientations ensures that the human figure in the individual action is more realistic and detailed, thereby guaranteeing the robustness of the individual human action. Additionally, the descriptive text is not limited by a fixed format, reducing the difficulty of action generation, facilitating 3D human creation for users, and effectively improving the user experience.

[0122] The aforementioned human movements can be represented using a digital human model (or parametric human model). Specifically, this digital human model can be an SMPL (skinned multi-person linear model). An SMPL includes 23 keypoints and a global spatial relationship, and is implemented using R. 24×6 Vectors represent human movements. Of course, this digital human model can also be other models, such as SCAPE (shape completion and animation through parametric estimation). This application does not impose any restrictions on the specific digital human model.

[0123] It should be noted that, similar to generating N viewpoint images via a renderer, generating an initial human image using a human generation model is only one possible method for creating an initial human image. Electronic devices can also generate initial human images in other ways, such as based on relevant algorithms. Similarly, the above method of generating a single human action using a 3D reconstruction network model is one possible method.

[0124] In some embodiments, the encoder, human body generation model, and 3D reconstruction network model described above can be independent or belong to the same model. For example, the generative model may include the renderer, human body generation model, and 3D reconstruction network model. Correspondingly, the electronic device can input descriptive text, information about interactive objects, and scene information into the generative model, causing the generative model to generate N viewpoint images based on the interactive object information and scene information. 2D human body images are then added to each of the N viewpoint images to obtain N initial human body images. Finally, 3D reconstruction is performed using these N initial human body images to obtain a single frame of human motion image. In short, the electronic device can input descriptive text, information about interactive objects, and scene information into the generative model to trigger the generative model to generate a corresponding single human body action.

[0125] Optionally, the generative model can be configured as needed, such as a GenZI model. Alternatively, the electronic device can directly add a 3D human model to the viewpoint image, i.e., add 3D human motion to obtain the aforementioned single human motion. Optionally, the number of viewpoint images can be one. In summary, the electronic device can generate a single human motion matching the descriptive text based on the descriptive text and its corresponding scene information.

[0126] The above section introduced the process of automatically generating a single human action that fits the descriptive text. The following section will further introduce the process of using a target-conditional variational autoencoder (TAG) network model to supplement the single human action with the generation of a complete sequence of human actions. In other words, it will introduce the inference process of the TAG network model.

[0127] S305. The electronic device inputs individual human actions and scene information into a target conditional variational autoencoder (DME) network model. Based on the scene information matching the description text, the DME network model supplements the individual human actions with additional actions, resulting in a target human action sequence. The scene information guides the direction of the generated target human action sequence.

[0128] The target human motion sequence represents a series of continuous human actions that align with the descriptive text. Applying this target human motion sequence to a scene enables interaction between the human body and the scene. In essence, after an electronic device applies the target human motion sequence to a scene, the user can trigger the device to display the interaction between the human body and the scene from different perspectives, thereby generating videos from various viewpoints.

[0129] The target conditional variational autoencoder (CVA) network model (or target model) can supplement a single action to generate multiple consecutive actions, thus obtaining a complete action sequence. The target CVA network model is a pre-trained conditional variational autoencoder (CVA) network model (or first model). The training process of this CVA network model can be found in the training phase section described below, and will not be detailed here.

[0130] In this embodiment, the target conditional variational autoencoder (DCE) network model treats a single human action as an action in the desired target human action sequence, such as the first or last human action. Then, based on the single human action, it uses a pre-trained implicit space of target actions matching the description text to supplement the actions, obtaining an action sequence matching the description text (i.e., the target human action sequence). This enables more accurate interaction between the human body and the scene, ensuring the continuity of the interaction process, making the actions more natural and realistic. For example, a single human action is based on the above... Figure 8A and Figure 8B The initial human image shown is used to determine the target conditional variational autoencoder network model. Based on this single human action, the model supplements the action, resulting in the following: Figure 9 The human body movements shown should be understood as follows. Figure 9 The human body movements shown are three-dimensional. Figure 8A and Figure 8B The human body shown is two-dimensional.

[0131] In some embodiments, the electronic device can also generate the target human action sequence using the initial human body position information, where the initial position of the human body in the scene within the target human action sequence is the position indicated by the initial human body position information. This initial human body position information can be a default value, or its source can be the same as the source of the descriptive text, interactive object information, or scene information described above.

[0132] In this embodiment, the electronic device uses descriptive text and its corresponding scene information and interactive object information as input parameters to a target conditional variational autoencoder (DCE) network model. Running the DCE network model yields a complete sequence of human actions that matches the descriptive text, thus achieving automatic generation of the human action sequence. Furthermore, the descriptive text can consist of open-ended vocabulary, without restrictions on its format, thereby reducing the limitations of generating human action sequences and improving the generalization and applicability of the DCE network model.

[0133] In some embodiments, the structure of the above-described conditional variational autoencoder network model may include a point cloud encoder, a multi-layer perceptron (MLP) model, a transformer encoder, and an action generator. Furthermore, during the inference phase, the modules utilized by the target conditional variational autoencoder network model may include the point cloud encoder, the MLP model, and the action generator. It should be understood that the modules utilized by the target conditional variational autoencoder network model in the inference phase described in this application are these three modules; this does not mean that the unused transformer encoder is inactive. This application merely uses these three modules as an example to illustrate the process of obtaining the target human action sequence through inference.

[0134] Among them, the point cloud encoder is used to extract the spatial features of the scene.

[0135] MLP models are used to determine the distribution parameters of the scene. Additionally, during the inference phase, MLP models can predict the implicit space of actions that match the scene.

[0136] The encoder is used to determine the distribution parameters of the implicit space of actions similar to human action sequence 1 in the action sequence sample, that is, to determine the distribution parameters of the implicit space of actions that match the descriptive text corresponding to human action sequence 1.

[0137] An action generator is used to generate sequences of human actions.

[0138] Accordingly, the implementation process of S305 described above may include Figure 5 The steps shown are as follows:

[0139] S305a, the target-conditional variational autoencoder structure network model extracts features from scene information through a point cloud encoder to obtain the spatial features of the scene.

[0140] The scene information in S305a may include the 3D point cloud information of the scene.

[0141] S305b, the target-conditional variational autoencoder structure network model uses the spatial features of the scene as input parameters of the MLP model, so that the MLP model can determine the distribution parameters of the action space in the scene based on the spatial features of the scene.

[0142] The distribution parameters of the action space (or first action implicit space) in the current scene are information that the MLP model predicts that can be used for the action implicit space of the current scene, that is, the information of the predicted action implicit space that matches the current scene.

[0143] The distribution parameters of the action space represent the distribution of actions matched with the current scene. Optionally, the distribution parameters may include the mean and / or variance. Of course, the distribution parameters can also be other parameters, such as the standard deviation.

[0144] The S305c target-conditional variational autoencoder structure network model samples the distribution parameters of the action space in the above scenario to obtain the target implicit code.

[0145] Here, the target implicit encoding represents the specific action implicit space (i.e., action space) that matches the scene. Based on this, since the determined individual human action matches the descriptive text, and the actions in the action implicit space indicated by the determined target implicit encoding match the scene, the action sequence obtained based on the individual human action and the target implicit encoding not only fits the descriptive text but also matches the scene, ensuring the quality of the action sequence.

[0146] Optionally, the above sampling can be Gaussian sampling. Accordingly, Gaussian sampling of the distribution parameters of the action space actually refers to Gaussian sampling of the Gaussian distribution (i.e., normal distribution) corresponding to the distribution parameters.

[0147] The S305d target-conditional variational autoencoder structure network model uses the target implicit encoding, the aforementioned individual human actions, and the spatial features of the scene as input parameters for the action generator.

[0148] The S305e motion generator, based on the target implicit encoding and combined with the spatial features of the scene, supplements the motion of a single human body to obtain the target human body motion sequence.

[0149] In this embodiment, the action generator, based on a single human action, reconstructs a target mask vector by concatenating the vector corresponding to that single human action with the target implicit encoding. This reconstructs the action to be supplemented corresponding to the target mask, thereby obtaining a target human action sequence that matches the descriptive text. The target mask vector can represent the initial vector corresponding to the human action to be generated. Specifically, the target mask vector refers to a pre-trained mask vector (such as embedding e).

[0150] Optionally, the spatial features of the scene described above are used to guide the direction of the actions generated by the action generator. Of course, the target conditional variational autoencoder structure network model may also not input the spatial features of the scene into the action generator.

[0151] In addition, when it is necessary to demonstrate the interaction between the human body and the scene, the electronic device restores the sample scene through the spatial features of the scene, combines the target human body's action sequence with the scene, and realizes the display of the interaction process between the human body and the scene.

[0152] In some embodiments, the specific processes of S305a-S305e described above can be referred to the relevant description of the training phase below, and will not be repeated here.

[0153] Furthermore, the target-conditional variational autoencoder (DUE) network model described in S305a-S305e is only one possible implementation for generating target human action sequences. The DUE network model can also generate target human action sequences in other ways. For example, the DUE network model does not utilize MLP to predict the implicit space of actions matched to the scene. Accordingly, the aforementioned MLP model is optional.

[0154] In some embodiments, the above-described generation of target human motion sequences through a target conditional variational autoencoder structure network model is only one example. Electronic devices can also generate target human motion sequences through the relevant algorithms corresponding to the target conditional variational autoencoder structure network model, as long as the algorithm can supplement the motion of a single human motion to generate the target human motion sequence.

[0155] In some embodiments, the target conditional variational autoencoder (DME) network model described above is trained using human action sequences, without requiring training with matching descriptive text-human action sequences. This reduces the difficulty of obtaining training samples and ensures the quantity of training samples, thereby guaranteeing the generalization and applicability of the DME network model. For example, the process of training the DME network model can be referred to the training phase described below. Specifically, the training phase may include, for instance, the following... Figure 6 The steps are shown.

[0156] Training phase:

[0157] S401. The electronic device acquires multiple training samples. Each training sample includes an action sequence sample and its corresponding sample scene information. The action sequence sample includes human action sequence 1.

[0158] The aforementioned sample scene information represents information that can reconstruct the sample scene. For example, as mentioned above, the sample scene information can be the 3D point cloud information of the sample scene.

[0159] The above-mentioned human action sequence 1 consists of continuous human actions.

[0160] S402. The electronic device inputs multiple training samples into the Conditional Variational Autoencoder (CVA) network model to train the CVA network model. The CVA network model includes a point cloud encoder, an MLP model, an encoder, and an action generator.

[0161] The roles of the point cloud encoder, MLP model, encoder, and motion generator can be found in the descriptions above.

[0162] It is understandable that the scene information in different training samples can be the same or different. Assuming that the scene information in multiple training samples is the same, then when training a conditional variational autoencoder network model using these multiple training samples, the scene information only needs to be input once. Of course, it can also be input multiple times; this application does not restrict this.

[0163] S403. For each sample scene information, the conditional variational autoencoder structure network model extracts features from the sample scene information through a point cloud encoder to obtain the spatial features of the sample scene.

[0164] In this embodiment of the application, during training, for the sample scene information in each training sample, the point cloud encoder in the conditional variational autoencoder structure network model extracts spatial features from the sample scene information to obtain the spatial features corresponding to the sample scene information, that is, to obtain the spatial features of the sample scene indicated by the sample scene information.

[0165] S404, the conditional variational autoencoder structure network model uses the spatial features of the sample scene as input parameters of the MLP model, so that the MLP model can determine the distribution parameters of the sample scene based on the spatial features of the sample scene.

[0166] Similar to the previous example, the distribution parameters of the sample scene represent the distribution of sample objects within the scene, such as a room, and the positions and sizes of all objects within that room. In other words, they refer to the range of specific values ​​for the elements in the vector corresponding to the sample objects in the scene. Optionally, the distribution parameters of the sample scene may include a mean of 1 (μ). token ) and / or variance 1 (∑ token Of course, the distribution parameter can also be other parameters, such as standard deviation.

[0167] S405, the conditional variational autoencoder structure network model uses the distribution parameters of the sample scene as the input parameters of the encoder.

[0168] For example, the electronic device inputs the distribution parameters of the sample scene into the encoder in the conditional variational autoencoder structure network model as a constraint condition for determining the action implicit space, thereby constraining and supervising the generation of the action implicit space.

[0169] In some embodiments, the above-described method of training a conditional variational autoencoder network model using the distribution parameters of the sample scene, i.e., determining the distribution parameters of the sample scene and using the distribution parameters of the sample scene as input parameters of the encoder, is merely an example; that is, steps S404-S405 described above are optional. Electronic devices may also not determine the distribution parameters of the sample scene, i.e., they may not use the distribution parameters as constraints to determine the implicit space of actions.

[0170] S406. For each training sample, the conditional variational autoencoder network model uses the human action sequence 1 and the corresponding mask vector in the training sample as input parameters for the encoder. The mask vector is the vector corresponding to the mask action in the human action sequence 1.

[0171] For example, a conditional variational autoencoder (VAE) network model (i.e., an electronic device) masks a portion of the actions in a human action sequence 1 (or the first action sequence), essentially replacing the vectors corresponding to those actions with mask vectors. Simply put, a mask vector is the vector corresponding to the randomly occluded actions in human action sequence 1. The mask vector actually refers to the specific range of values ​​for the elements in the vector corresponding to the action.

[0172] The mask vectors mentioned above are learnable vectors. By training a conditional variational autoencoder network model, the accuracy of the mask vectors in representing the corresponding actions can be improved.

[0173] Specifically, the mask vector can be represented by the embedding e. As mentioned earlier, human movements are actually represented by vectors. For each human movement sequence 1, the electronic device randomly masks the movements in the human movement sequence 1, that is, it replaces the actual vectors corresponding to some movements in the human movement sequence 1 with the learnable embedding e.

[0174] For example, a human action sequence 1 includes 10 actions. Masking 9 of the 10 actions involves replacing the vector corresponding to each of the 9 actions with the embedding e.

[0175] S407. For each training sample, the conditional variational autoencoder structure network model determines the distribution parameters of the implicit space of the human action sequence 1 through the encoder, based on the human action sequence 1 in the training sample and the mask vector corresponding to the human action sequence 1, combined with the distribution parameters of the sample scene in the training sample.

[0176] Among them, the implicit space of the human action sequence 1 represents multiple action sequences similar to the human action sequence 1, that is, different actions that fit the descriptive text corresponding to the human action sequence 1.

[0177] Correspondingly, the distribution parameters of the implicit space of the human action sequence 1 represent the range of vectors corresponding to different actions that fit the descriptive text corresponding to the human action sequence 1.

[0178] Optionally, as mentioned above, the distribution parameters of the implicit action space may include the mean² (i.e., μ) and / or the variance² (i.e., ∑). Of course, the distribution parameters of the implicit action space can also be other parameters, such as the standard deviation.

[0179] In this embodiment, for each human action sequence 1, after inputting the human action sequence 1, the mask vector corresponding to the human action sequence 1, and the distribution parameters of the sample scene in which the human action sequence 1 is located (i.e., the sample scene corresponding to the human action sequence 1) into the encoder, the encoder maps the discrete human action sequence 1 into a continuous action sequence based on the human action sequence 1 and the mask vector corresponding to the human action sequence 1. That is, it maps the discrete space into a continuous action implicit space, thereby obtaining the initial distribution parameters of the action implicit space corresponding to the human action sequence 1. Optionally, similar to the sample scene described above, the initial distribution parameters may also include variance 3 and / or mean 3.

[0180] For example, human action sequence 1 includes human action sequence 1A. The encoder learns multiple action sequences that match the description text corresponding to human action sequence 1A based on human action sequence 1A and the mask vector corresponding to human action sequence 1A, thereby obtaining the distribution parameters of the action implicit space corresponding to human action sequence 1A.

[0181] Considering that actions are usually subject to scene constraints, the encoder determines the distribution parameters of the implicit space of the action sequence 1 based on the initial distribution parameters of the implicit space of the action sequence 1 and the distribution parameters of the sample scene in which the action sequence 1 is located.

[0182] For example, the encoder can use the sum of the initial distribution parameters of the implicit space of the action corresponding to the personal action sequence 1 and the distribution parameters of the sample scene in which the personal action sequence 1 is located as the distribution parameters of the implicit space of the action corresponding to the personal action sequence 1.

[0183] Taking the distribution parameters, including mean and variance, as an example, the mean 2 in the distribution parameters corresponding to personal action sequence 1 is equal to the sum of the mean 3 in the initial distribution parameters corresponding to personal action sequence 1 and the mean 1 in the distribution parameters of the sample scene in which personal action sequence 1 is located. Similarly, the variance 2 in the distribution parameters corresponding to personal action sequence 1 is equal to the sum of the variance 3 in the initial distribution parameters corresponding to personal action sequence 1 and the variance 1 in the distribution parameters of the sample scene in which personal action sequence 1 is located.

[0184] S408. For each training sample, the conditional variational autoencoder structure network model samples the distribution parameters of the implicit space of the human action sequence 1 in the training sample to obtain the implicit code corresponding to the human action sequence 1.

[0185] Among them, the implicit code z corresponding to human action sequence 1 represents an action sequence whose fit with the description text corresponding to human action sequence 1 is greater than a preset fit.

[0186] Optionally, the distribution parameters of the implicit space of the human action sequence 1 are actually continuous functions, such as functions that conform to a Gaussian curve. Electronic devices can perform Gaussian sampling on the distribution parameters of the implicit space of the human action sequence 1 to learn more action sequences with higher probabilities, that is, action sequences that fit the descriptive text better.

[0187] S409. For each training sample, the conditional variational autoencoder structure network model uses the implicit encoding, masked action sequence, and spatial features of the sample scene corresponding to the human action sequence 1 in the training sample as input parameters of the action generator to obtain the predicted action sequence corresponding to the training sample.

[0188] Here, the masked action sequence corresponding to the above-mentioned human action sequence 1 represents the masked human action sequence 1, that is, the human action sequence 1 including the masked action corresponding to the mask vector. Continuing with the example above, a human action sequence 1 includes 10 actions. If 9 of the 10 actions are masked, then the masked action sequence corresponding to the human action sequence 1 includes the 9 masked actions (i.e., the masked actions) and 1 unmasked action.

[0189] In this embodiment, the electronic device inputs the implicit code corresponding to human action sequence 1, the masked action sequence, and the spatial features of the sample scene into the action generator to train the action generator. Based on the implicit code corresponding to human action sequence 1, the action generator reconstructs the masked action in the masked action sequence to obtain the predicted action sequence.

[0190] Continuing with the example above, the action generator reconstructs the nine masked actions in the masked action sequence corresponding to human action sequence 1, thereby obtaining the predicted action sequence corresponding to human action sequence 1. Here, one unmasked action in this masked action sequence corresponds to a single human action in the aforementioned inference stage.

[0191] The above describes the specific training process of the Conditional Variational Autoencoder (CVA) network model. After one training iteration, the electronic device can determine the training effectiveness of the CVA network model by comparing the differences between its predictions and the corresponding ground truth values, thereby determining whether further training is needed. The process of determining whether further training is required will be described below.

[0192] S410. The electronic device determines the loss function based on the human action sequence 1 in the action sequence sample and the predicted action sequence corresponding to the human action sequence 1.

[0193] The loss function represents the degree of difference between the human action sequence 1 (the true value) and the predicted action sequence (the prediction result) corresponding to the human action sequence 1. A larger loss function indicates a greater degree of difference, and a smaller loss function indicates a smaller degree of difference.

[0194] S411. The electronic device determines whether the loss function is less than the preset loss value.

[0195] In this embodiment of the application, if the loss function is less than the preset loss value, it indicates that the difference between the true value and the prediction result is small, and the accuracy of the predicted action sequence by the trained conditional variational autoencoder network model meets the requirements. Training can be stopped, and the electronic device can execute S412.

[0196] If the loss function is greater than or equal to the preset loss function value, it indicates that the difference between the true value and the prediction result is large. The accuracy of the trained conditional variational autoencoder network model in predicting action sequences does not meet the requirements and further training is needed. In this case, the electronic device can execute S413.

[0197] S412. The electronic device uses the trained conditional variational autoencoder structure network model as the target conditional variational autoencoder structure network model.

[0198] In this embodiment, after obtaining the target conditional variational autoencoder (DUE) network model, the electronic device can use this model to infer and supplement actions based on a single action, thus obtaining a complete action sequence. Simply put, the action to be supplemented is equivalent to the masked action described above, and the DUE network model reconstructs the action to be supplemented to obtain the target action sequence required by the user. The specific inference process can be found in the process described above for determining the target human action sequence using the DUE network model.

[0199] S413, Update the conditional variational autoencoder structure network model of the electronic device, and return to S402.

[0200] In this embodiment, when it is necessary to continue training the conditional variational autoencoder (CVA) network model, the electronic device can update the relevant parameters of the CVA network model, such as the parameters in the encoder and action generator, and the specific values ​​of the elements in the mask vector (e.g., embedding e). Furthermore, the electronic device continues to train the CVA network model to obtain a mask vector that can accurately represent the masked action, as well as parameters such as the action implicit space and action implicit encoding that closely match the descriptive text corresponding to the human action sequence 1, thereby obtaining a target CVA network model capable of accurately predicting actions.

[0201] In this embodiment, the Conditional Variational Autoencoder (CVA) network model (hereinafter referred to as the model) is trained using individual training samples. Compared to training sample pairs (i.e., action sequences and their corresponding descriptive text), obtaining training samples is less difficult, thus facilitating the acquisition of a large number of training samples and ensuring the prediction accuracy of the CVA network model (hereinafter referred to as the model). Furthermore, since no descriptive text is required, the scenarios covered by the individual training samples are broader, thereby improving the model's generalization ability. Additionally, because the model is not trained using descriptive text, it does not need to learn from descriptive text, thus avoiding the limitations of descriptive text and further improving the model's generalization ability.

[0202] It should be noted that the step numbers above do not represent the actual execution order of the steps. For example, the above... Figure 5 The use of scene distribution parameters as encoder input parameters as described in S305d and the use of individual human body movements as encoder input parameters as described in S305d can be performed sequentially or simultaneously. This application does not impose any restrictions on the execution order of the steps.

[0203] It is understandable that the operations performed by the above models (such as the conditional variational autoencoder structure network model, the target conditional variational autoencoder structure network model, etc.) are actually the operations performed by the electronic device on which the model resides.

[0204] In some embodiments, the above describes the model training and inference processes using the same device as an example. Of course, the device for training the model and the device for inference can also be different devices, and this application does not limit this. In general, the first device uses the model for inference, and the second device trains the model; the first device and the second device can be the same or different.

[0205] Furthermore, the above describes the process of training the model and using model inference to determine the human action sequence using the human body as an example. Of course, the model described in this application can also determine the action sequences of other objects (such as actions). Accordingly, the sample data used in the training model will no longer be the human body, but other objects. In summary, the action sequence generation method described in this application can be applied to action sequence generation scenarios for different objects.

[0206] The above mainly describes the solutions provided by the embodiments of this application from a methodological perspective. It is understood that, in order to achieve the above functions, the electronic device includes hardware structures and / or software modules corresponding to the execution of each function. Based on the units and algorithm steps of the various examples described in the embodiments disclosed in this application, the embodiments of this application can be implemented in hardware or a combination of hardware and computer software.

[0207] Whether a function is implemented through hardware or by a computer-driven hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described function for each specific application, but such implementations should not be considered beyond the scope of the technical solutions in this application.

[0208] This application provides embodiments for dividing an electronic device into functional modules based on the above method examples. For example, each function can be divided into its own functional modules, or two or more functions can be integrated into a single processing unit. The integrated unit can be implemented in hardware or as a software functional module. It should be noted that the unit division in this application embodiment is illustrative and represents only one logical functional division; in actual implementation, other division methods may be used.

[0209] like Figure 10 The diagram shown is a structural schematic of an electronic device provided in an embodiment of this application. This electronic device 1000 can be used to implement the methods executed by the electronic devices described in the above method embodiments. For example, the electronic device 1000 may include a processing unit 1001, a communication unit 1002, and a display unit 1003. The processing unit 1001 is used to support the electronic device 1000 in executing... Figures 1 to 9 The electronic device described in any one of the following embodiments includes a communication unit 1002 for supporting the communication function of the electronic device 1000, and a display unit 1003 for supporting the display function of the electronic device 1000.

[0210] Optional, Figure 10 The illustrated electronic device 1000 may also include a storage unit ( Figure 10 (not shown in the image), this storage unit stores a program or instruction. When the processing unit 1001 executes the program or instruction, it causes... Figure 10 The electronic device 1000 shown can perform the method described in the above-described method embodiments.

[0211] Figure 10 The technical effects of the electronic device 1000 shown can be referred to the technical effects described in the above method embodiments, and will not be repeated here. Figure 10 The processing unit 1001 in the illustrated electronic device 1000 can be implemented by a processor or processor-related circuit components, and can be a processor or processing module. The communication unit 1002 can be implemented by a transceiver or transceiver-related circuit components, and can be a transceiver or transceiver module. The display unit 1003 can be implemented by display screen-related components.

[0212] This application also provides a chip system, such as... Figure 11 As shown, the chip system includes at least one processor 1101 and at least one interface circuit 1102. The processor 1101 and the interface circuit 1102 are interconnected via lines. For example, the interface circuit 1102 can be used to receive signals from other devices. As another example, the interface circuit 1102 can be used to send signals to other devices (e.g., the processor 1101). Exemplarily, the interface circuit 1102 can read instructions stored in memory and send those instructions to the processor 1101. When the instructions are executed by the processor 1101, the electronic device can perform the various steps performed by the electronic device in the above embodiments. Of course, the chip system may also include other discrete components, and this application embodiment does not specifically limit this.

[0213] Optionally, the chip system may contain one or more processors. These processors can be implemented in hardware or software. When implemented in hardware, the processor can be a logic circuit, an integrated circuit, etc. When implemented in software, the processor can be a general-purpose processor, implemented by reading software code stored in memory.

[0214] Optionally, the chip system may contain one or more memories. The memory may be integrated with the processor or disposed separately from it; this application does not limit this. For example, the memory may be a non-transient processor, such as a read-only memory (ROM), which may be integrated with the processor on the same chip or disposed separately on different chips. This application does not specifically limit the type of memory or the arrangement of the memory and processor.

[0215] For example, the chip system may be a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a system-on-chip (SoC), a central processor unit (CPU), a network processor (NP), a digital signal processor (DSP), a micro controller unit (MCU), a programmable logic device (PLD), or other integrated chips.

[0216] It should be understood that each step in the above method embodiments can be completed by integrated logic circuits in the processor hardware or by instructions in software form. The method steps disclosed in the embodiments of this application can be directly manifested as being executed by a hardware processor, or being executed by a combination of hardware and software modules in the processor.

[0217] This application also provides a computer storage medium storing computer instructions. When the computer instructions are executed on an electronic device, the electronic device performs the action sequence generation method or, and / or model training method described in the above method embodiments.

[0218] This application provides a computer program product, which includes a computer program or instructions that, when executed on an electronic device, cause the electronic device to perform the action sequence generation method or, and / or model training method described in the above method embodiments.

[0219] In addition, this application embodiment also provides an apparatus, which may specifically be a chip, component, or module. The apparatus may include a connected processor and a memory. The memory stores computer execution instructions. When the apparatus is running, the processor executes the computer execution instructions stored in the memory to cause the apparatus to perform the warm-start method in the above-described method embodiments. The electronic device, computer storage medium, computer program product, or chip provided in this embodiment are all used to execute the corresponding methods provided above. Therefore, the beneficial effects they achieve can be referred to in the beneficial effects of the corresponding methods provided above, and will not be repeated here.

[0220] Through the above description of the embodiments, those skilled in the art will understand that, for the sake of convenience and brevity, only the division of the above functional modules is used as an example. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above.

[0221] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. The embodiments can be combined with or referenced to each other without conflict. The apparatus embodiments described above are merely illustrative; for example, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another device, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0222] The units described as separate components may or may not be physically separate. A component shown as a unit can be one or more physical units; that is, it can be located in one place or distributed in multiple different locations. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0223] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0224] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solutions of the embodiments of this application, in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, can be embodied in the form of a software product. This software product is stored in a storage medium and includes several instructions to cause a device (which may be a microcontroller, chip, etc.) or processor to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0225] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A method of action sequence generation, the method comprising: Applied to a first device, the method includes: Acquire action interaction information; wherein, the action interaction information includes a first text and scene information of a corresponding first scene; the first text indicates an interaction action between an object and the first scene; Based on the action interaction information, generate a single action of the object that matches the first text; Based on the single action and the scene information of the first scene, action supplementation is performed to generate the target action sequence of the object; the target action sequence includes multiple consecutive actions that match the first text.

2. The method of claim 1, wherein, The step of supplementing actions based on the single action and the scene information of the first scene to generate the target action sequence of the object includes: The single action and the scene information of the first scene are used as input parameters for the target model; The target model is run, and based on the scene information of the first scene that matches the first text, the single action is supplemented to generate the target action sequence; wherein, the scene information of the first scene is used to guide the direction of the action generated by the target model. The target model is obtained by training the first model using training samples; the training samples include action sequence samples and scene information of the sample scenes.

3. The method of claim 2, wherein, The target model includes a point cloud encoder, a multilayer perceptron (MLP) model, and an action generator. The step of running the target model, based on scene information of the first scene matching the first text, performs action supplementation on the single action to generate the target action sequence, including: The point cloud encoder is used to extract features from the scene information of the first scene to obtain the spatial features of the first scene. The distribution parameters of the first action space that match the spatial features of the first scene are determined using an MLP model. The action generator supplements the single action based on the target implicit encoding and the spatial features of the first scene to generate the target action sequence; wherein the target implicit encoding is obtained by sampling the distribution parameters of the first action space.

4. The method according to any one of claims 1 to 3, characterized in that, The step of generating a single action for an object matching the first text based on the action interaction information includes: The action interaction information is used as the input parameter of the generative model; The generative model is run to render a scene image based on the scene information and interactive object information of the first scene. Based on the first text and the scene image, the object is reconstructed in three dimensions to obtain the single action. The interactive object information represents the information of the object that interacts with the first scene. The interactive object information is indicated by the first text or included in the action interaction information.

5. The method of claim 4, wherein, The object includes a human body; the rendering of the scene image based on the scene information and interactive object information of the first scene includes: Render N scene images from N perspectives based on the interactive object information and the scene information of the first scene; The step of performing 3D reconstruction of the object based on the first text and scene image to obtain the single action includes: Based on the first text, a two-dimensional human motion image is added to each of the N scene images to obtain N initial human images; The human body is reconstructed in three dimensions based on the N initial human body images to obtain the single action.

6. A model training method, comprising: Applied to a second device, the method includes; Acquire training samples; wherein the training samples include a first action sequence and scene information of the sample scene; the action sequence sample consists of multiple consecutive actions; The training samples are input into the first model to train the first model and obtain the predicted action sequence corresponding to the action sequence sample. Based on the action sequence samples and their corresponding action sequences, determine the loss function; If the loss function is less than a preset loss value, the trained first model is used as the target model; wherein, the target model is used to generate a target action sequence that matches the first text, and the first text indicates the interaction action between the object and the first scene.

7. The method of claim 6, wherein, The first model includes a point cloud encoder, an encoder, and an action generator; The process of training the first model to obtain the predicted action sequence corresponding to the action sequence sample includes: The point cloud encoder is used to extract features from the scene information of the sample scene to obtain the spatial features of the first scene. The encoder determines the distribution parameters of the implicit space of the action sequence sample based on the action sequence sample and the mask vector corresponding to the action sequence sample; wherein, the mask vector is the vector corresponding to the mask action in the action sequence sample; and the distribution parameters of the implicit space of the action sequence sample represent the range of vectors corresponding to actions similar to the action sequence sample. The action generator, based on implicit encoding and scene information of the sample scene, restores the masked actions in the masked action sequence corresponding to the action sequence sample to generate the predicted action sequence; wherein, the implicit encoding is obtained by sampling the distribution parameters of the action implicit space corresponding to the action sequence sample.

8. The method of claim 7, wherein, The first model further includes a multilayer perceptron (MLP) model; the method further includes: Based on the spatial characteristics of the sample scene, the distribution parameters of the sample scene are determined using an MLP model. The step of determining the distribution parameters of the implicit space of the action sequence sample corresponding to the action sequence sample through the encoder, based on the action sequence sample and the mask vector corresponding to the action sequence sample, includes: The encoder determines the distribution parameters of the implicit action space based on the action sequence samples and the mask vectors corresponding to the action sequence samples, combined with the distribution parameters of the sample scene; wherein the distribution parameters of the sample scene serve as constraints for determining the distribution parameters of the implicit action space.

9. An electronic device, comprising: The electronic device includes a memory and one or more processors; the memory and the processors are coupled; the memory is used to store computer program code, the computer program code including computer instructions; when the processor executes the computer instructions, the electronic device performs the action sequence generation method as described in any one of claims 1 to 5 or the model training method as described in any one of claims 6 to 8.

10. A computer-readable storage medium, characterized in that, It includes computer instructions that, when executed on an electronic device, cause the electronic device to perform the action sequence generation method as described in any one of claims 1 to 5 or the model training method as described in any one of claims 6 to 8.

11. A computer program product comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the action sequence generation method as described in any one of claims 1 to 5 or the model training method as described in any one of claims 6 to 8.