Image generation method, interaction method, device, agent and equipment

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By decoupling the target mouth shape image generation process and combining lip movement, identity, and posture reference images, a virtual avatar is generated, solving the problem of balancing generation quality and efficiency on edge devices and achieving efficient virtual avatar generation.

CN122244245APending Publication Date: 2026-06-19BEIJING BAIDU NETCOM SCI & TECH CO LTD +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING BAIDU NETCOM SCI & TECH CO LTD
Filing Date: 2026-03-18
Publication Date: 2026-06-19

Application Information

Patent Timeline

18 Mar 2026

Application

19 Jun 2026

Publication

CN122244245A

IPC: G06T13/40; G06T13/20; G06F3/04842; G06F3/04845; G10L21/10

AI Tagging

Application Domain

Speech analysis Animation

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Raindrop acoustic signal assisted method for removing rain interference from a patrol image
CN121903890BImage enhancement Image analysis
A real-time audio Ethernet transmission and processing system based on double FPGA
CN122204843ASpeech analysis Transmission
Electronic device for detecting speech rate and method for detecting speech rate
CN122224204ASpeech analysis
A method, device and medium for intelligent control of light
CN117636911BElectrical apparatus Speech analysis
Method and apparatus for auditory training
US20260162561A1Data processing applicationsEar treatment

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies struggle to balance the generation quality and efficiency of virtual avatars on edge devices due to limitations in device computing power.

Method used

By performing lip movement processing on a reference mouth shape image to generate a target mouth shape image, and combining it with identity and pose reference images, the target image is generated. This decouples the generation process of the target mouth shape image, reducing the generation difficulty and improving efficiency.

Benefits of technology

It improves the efficiency and accuracy of image generation, ensures a balance between quality and efficiency in the generated virtual avatars, and achieves a unified fusion of efficient lip movements, identity, and posture.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244245A_ABST

Patent Text Reader

Abstract

This disclosure provides image generation methods, interaction methods, devices, intelligent agents, electronic devices, storage media, and program products, relating to the field of artificial intelligence technology, particularly deep learning, virtual avatars, digital humans, image processing, and human-computer interaction. The specific implementation scheme is as follows: based on audio data, a reference lip shape image is processed to obtain a target lip shape image, wherein the reference lip shape image includes reference mouth information, and the target lip shape image includes mouth information matching the lip movements in the audio data; and the target lip shape image, an identity reference image, and a posture reference image are processed to obtain a target image, wherein the identity reference image includes the identity information of the target object in the target image, and the posture reference image includes the posture information of the target object.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, and in particular to the fields of deep learning, virtual avatars, digital humans, image processing, and human-computer interaction. Specifically, it provides an image generation method, an interaction method, a device, an intelligent agent, an electronic device, a storage medium, and a program product. Background Technology

[0002] With the rapid development of digital human technology, edge-based digital humans have attracted much attention due to their unique advantages. These advantages include low latency, offline availability, privacy and security, and cost reduction and efficiency improvement. However, due to the limitations of computing power on edge devices, it is difficult to balance the generation quality and efficiency of virtual avatars. Summary of the Invention

[0003] This disclosure provides an image generation method, an interaction method, an apparatus, an intelligent agent, an electronic device, a storage medium, and a program product.

[0004] According to one aspect of this disclosure, an image generation method is provided, comprising: performing lip movement processing on a reference lip shape image based on audio data to obtain a target lip shape image, wherein the reference lip shape image includes reference mouth information, and the target lip shape image includes mouth information matching the lip movement of the audio data; and processing the target lip shape image, an identity reference image, and a posture reference image to obtain a target image, wherein the identity reference image includes identity information of a target object in the target image, and the posture reference image includes posture information of the target object.

[0005] According to another aspect of this disclosure, an interaction method is provided, comprising: in response to receiving interaction information input at an interactive interface, generating feedback voice for feedback based on the interaction information; generating feedback video based on the feedback voice; and playing the feedback video at the interactive interface; wherein video frames in the feedback video are generated according to the method described above.

[0006] According to another aspect of this disclosure, an image generation apparatus is provided, comprising: an image processing module for performing lip movement processing on a reference lip shape image based on audio data to obtain a target lip shape image, wherein the reference lip shape image includes reference mouth information, and the target lip shape image includes mouth information matching the lip movement of the audio data; and an image generation module for processing the target lip shape image, an identity reference image, and a posture reference image to obtain a target image, wherein the identity reference image includes identity information of a target object in the target image, and the posture reference image includes posture information of the target object.

[0007] According to another aspect of this disclosure, an interactive device is provided, comprising: a voice generation module, configured to generate feedback voice for feedback based on the interaction information received at an interactive interface; a video generation module, configured to generate feedback video based on the feedback voice; and a video feedback module, configured to play the feedback video at the interactive interface; wherein video frames in the feedback video are generated according to the device described above.

[0008] According to another aspect of this disclosure, an intelligent agent is provided, comprising: an input module for receiving input information; a processing module for determining a target task based on the input information received by the input module, determining a large model based on the target task, and obtaining output information by calling the large model to execute the method described above; and an output module for outputting the output information obtained by the processing module.

[0009] According to another aspect of this disclosure, an electronic device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method described above.

[0010] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause a computer to perform the methods described above.

[0011] According to another aspect of this disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the method described above.

[0012] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0013] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:

[0014] Figure 1 This illustration schematically shows an exemplary system architecture to which image generation methods, interaction methods, and apparatus can be applied according to embodiments of the present disclosure;

[0015] Figure 2 A flowchart illustrating an image generation method according to an embodiment of the present disclosure is shown schematically;

[0016] Figure 3 This schematically illustrates a data flow diagram for determining a target lip shape image based on audio coding features and image coding features according to an embodiment of the present disclosure;

[0017] Figure 4 This schematically illustrates a model architecture diagram for determining a target mouth shape image based on audio coding features and image coding features according to an embodiment of the present disclosure;

[0018] Figure 5 This schematically illustrates a data flow diagram for determining a target image based on a stitched image according to an embodiment of the present disclosure;

[0019] Figure 6 This schematically illustrates a model architecture diagram for determining a target image based on stitched images according to an embodiment of the present disclosure;

[0020] Figure 7 The illustration shows a schematic diagram of the division of voice data and image data according to an embodiment of the present disclosure;

[0021] Figure 8 A flowchart illustrating an interaction method according to an embodiment of the present disclosure is shown schematically;

[0022] Figure 9 A block diagram of an image generation apparatus according to an embodiment of the present disclosure is shown schematically;

[0023] Figure 10 A block diagram of an interactive device according to an embodiment of the present disclosure is shown schematically;

[0024] Figure 11 A schematic diagram illustrating the structure of an intelligent agent of artificial intelligence according to embodiments of the present disclosure; and

[0025] Figure 12 A schematic block diagram of an example electronic device that can be used to implement embodiments of the present disclosure is shown. Detailed Implementation

[0026] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0027] In related technologies, digital humans are usually generated through neural networks or 3D face modeling technology. However, generating digital humans through neural networks requires a large number of model parameters, and generating digital humans through 3D face modeling technology also requires the analysis of a large number of 3D parameters. Therefore, the above methods usually need to rely on the high computing power of the cloud to complete the generation of virtual images, and then obtain the generated virtual images from the edge. The processing time required is relatively long, and it is also difficult to balance the generation quality and generation efficiency of virtual images.

[0028] In view of this, the present disclosure provides an image generation method, comprising: performing lip movement processing on a reference lip shape image based on audio data to obtain a target lip shape image, wherein the reference lip shape image includes reference mouth information, and the target lip shape image includes mouth information matching the lip movement of the audio data; and processing the target lip shape image, an identity reference image, and a posture reference image to obtain a target image, wherein the identity reference image includes the identity information of a target object in the target image, and the posture reference image includes the posture information of the target object.

[0029] The embodiments of this disclosure decouple the target image generation process. First, a target mouth shape image is generated, and then the target image is generated using this image. Since the generation of the target mouth shape image only requires lip movement processing of reference mouth information, compared to directly generating a complete image of the target object, the generation difficulty is reduced, and the generation efficiency and accuracy are improved. The process of generating the target image based on the target mouth shape image does not require reference to audio data. It can quickly generate a target image including the target object's identity information, posture information, and lip movement, further improving image generation efficiency. Therefore, the above method balances image generation quality and generation efficiency.

[0030] Figure 1 The illustration schematically depicts an exemplary system architecture to which image generation methods, interaction methods, and apparatus can be applied according to embodiments of the present disclosure.

[0031] It is important to note that Figure 1 The examples shown are merely examples of system architectures that can be applied to embodiments of this disclosure, intended to help those skilled in the art understand the technical content of this disclosure. However, they do not imply that embodiments of this disclosure cannot be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which image generation methods, interaction methods, and apparatus can be applied may include a terminal device. However, the terminal device may implement the image generation methods, interaction methods, and apparatus provided in embodiments of this disclosure without interacting with a server.

[0032] like Figure 1As shown, the system architecture 100 according to this embodiment may include terminal devices 101, 102, and 103, a network 104, and a server 105. The network 104 serves as a medium for providing a communication link between the terminal devices 101, 102, and 103 and the server 105. The network 104 may include various connection types, such as wired and / or wireless communication links, etc.

[0033] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various communication client applications can be installed on terminal devices 101, 102, and 103, such as knowledge reading applications, web browser applications, search applications, instant messaging tools, email clients, and / or social platform software, etc. (for example only).

[0034] Terminal devices 101, 102, and 103 can be various electronic devices with displays and web browsing capabilities, including but not limited to smartphones, tablets, laptops, and desktop computers.

[0035] Server 105 can be a server that provides various services, such as a backend management server that supports the content browsed by users using terminal devices 101, 102, and 103 (for example only). The backend management server can analyze and process data such as received user requests, and feed back the processing results (such as videos generated based on user requests) to the terminal devices.

[0036] It should be noted that the image generation method and interaction method provided in the embodiments of this disclosure can generally be executed by terminal devices 101, 102, or 103. Accordingly, the image generation device and interaction device provided in the embodiments of this disclosure can also be disposed in terminal devices 101, 102, or 103.

[0037] Alternatively, the image generation method and interaction method provided in the embodiments of this disclosure can generally also be executed by server 105. Correspondingly, the image generation apparatus and interaction apparatus provided in the embodiments of this disclosure can generally be located in server 105. The image generation method and interaction method provided in the embodiments of this disclosure can also be executed by a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105. Correspondingly, the image generation apparatus and interaction apparatus provided in the embodiments of this disclosure can also be located in a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105.

[0038] It should be understood that Figure 1The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.

[0039] In the technical solutions disclosed herein, the collection, storage, use, processing, transmission, provision, disclosure, and application of any type of information, such as user personal information, comply with the provisions of relevant laws and regulations, necessary confidentiality measures have been taken, and they do not violate public order and good morals.

[0040] In the technical solution disclosed herein, the user's authorization or consent is obtained before acquiring or collecting the user's personal information.

[0041] It should be noted that the sequence numbers of the operations in the following methods are for descriptive purposes only and should not be considered as indicating the execution order of the operations. Unless explicitly stated otherwise, the method does not need to be executed in the exact order shown.

[0042] Figure 2 A flowchart illustrating an image generation method according to an embodiment of the present disclosure is shown schematically.

[0043] like Figure 2 As shown, the method includes operations S210~S220.

[0044] In operation S210, based on audio data, lip movement processing is performed on the reference lip shape image to obtain the target lip shape image.

[0045] In operation S220, the target mouth shape image, identity reference image, and posture reference image are processed to obtain the target image.

[0046] The audio data stores encoded audio information, and in embodiments of this disclosure, the audio information may include voice information.

[0047] The reference lip-sync image includes reference lip information, representing the lip information in its initial, undriven state. The target lip-sync image includes lip information that matches the lip movements in the audio data. In other words, the lip information in the target lip-sync image is in a state where it is eliciting sound according to the audio data.

[0048] The identity reference image includes the identity information of the target object in the target image. The target object can be a digital human. The identity information can include the appearance information of the digital human. The appearance information can include the appearance information of the digital human's head, such as hair color, facial features, skin color, etc. The appearance information can also include appearance information related to the digital human's body shape, such as height, body type, etc.

[0049] The pose reference image includes pose information of the target object, which can be used to represent the movement of the target object, except for the mouth, such as the direction of the face, the direction of the eyes' gaze, and limb movements.

[0050] For a single target object, multiple pose reference images can be set, with each pose reference image corresponding to a mouth shape image. Therefore, for a target mouth shape image, its corresponding pose reference image can be selected.

[0051] In the embodiments of this disclosure, the corresponding pose reference image and mouth shape image can be obtained by cropping a sample image of the target object. For example, for a sample image of the target object, the mouth part of the sample image can be cropped, and the cropped image can be used as the mouth shape image, while the remaining image can be used as the pose reference image.

[0052] Using the target lip shape image, identity reference image, and posture reference image, the overall posture image that matches the lip movements in the audio data is determined as the target image, and the identity information of the overall posture image is consistent with the identity information of the identity reference image.

[0053] According to embodiments of this disclosure, lip movement processing is performed on a reference lip shape image based on audio data, ensuring that the target lip shape image and audio data are matched in terms of timing and semantics. This guarantees a high degree of synchronization between lip movement and audio data, solving the problem of disconnect between traditional lip movement generation and audio. Furthermore, compared to generating a complete image of the target object, generating only a target lip shape image including lip movement results reduces the amount of data that needs to be generated, improving image generation efficiency. By jointly processing the target lip shape image, an identity reference image carrying identity information, and a posture reference image carrying posture information, the generated target image can simultaneously retain the target object's identity features, posture structure, and lip movement effect matching the audio data, achieving a unified fusion of lip movement, identity, and posture. This significantly improves the realism and completeness of the target image while ensuring image generation efficiency.

[0054] According to embodiments of this disclosure, in Figure 2 In the operation S210 shown, lip movement processing is performed on the reference mouth shape image based on audio data to obtain the target mouth shape image. This may include: processing multiple image coding features at different feature scales based on audio coding features to obtain the target mouth shape image. The audio coding features are obtained by encoding the audio features of the audio data, and the multiple image coding features are obtained by encoding the image features of the reference mouth shape image multiple times. The feature scale is used to indicate the index value of at least one indicator in the feature semantics and feature space.

[0055] At least one metric in the feature space may include the spatial size of the image coding feature, the spatial depth of the image coding feature, etc.

[0056] Multiple image coding features at different feature scales can be obtained by serially encoding a reference lip shape image multiple times. Audio coding features can be obtained by encoding the audio features of the audio data once, or by encoding the audio features of the audio data multiple times, using the result of the last encoding as the audio coding feature.

[0057] For example, the reference mouth shape image is encoded to obtain the image coding features at the first feature scale, and then the image coding features at the first feature scale are encoded to obtain the image coding features at the second scale.

[0058] Multiple image coding features at different feature scales can also be obtained by performing parallel encoding on the reference mouth shape image multiple times. The encoder configuration parameters used for each encoding are different. The configuration parameters may include the number of output channels of the encoder. When the encoding of the reference mouth shape image is achieved by convolution, the configuration parameters may also include the size of the convolution kernel used in the convolution operation.

[0059] Taking three convolution kernels of different sizes, 1*1, 3*3, and 5*5, as examples, the above convolution kernels are used to encode the reference mouth shape image respectively, and image coding features of three different feature scales can be obtained.

[0060] For each image coding feature at different feature scales, the image coding features can be processed based on the audio coding features to ensure that the mouth shape corresponding to the processed image coding features matches the speech corresponding to the audio coding features. After processing the image coding features, decoding the processed image coding features yields the target mouth shape image.

[0061] According to embodiments of this disclosure, audio data and a reference lip shape image are encoded separately to obtain audio-coded features and multi-scale image-coded features. Image-coded features at different scales can represent different levels of information such as lip details, contours, and spatial structures. Based on the audio-coded features, the multi-scale image-coded features are processed uniformly, allowing the semantic information of the audio data to permeate into the image-coded features at different levels. This achieves cross-scale, full-dimensional alignment between the audio signal and the visual features of the lip, significantly improving the accuracy of lip movement generation and the fidelity of lip detail reproduction compared to a single-scale fusion method.

[0062] According to embodiments of this disclosure, a target mouth shape image is obtained by processing multiple image coding features at different feature scales based on audio coding features, including: fusing the audio coding features and a first image coding feature among the multiple image coding features to obtain intermediate coding features, wherein the multiple image coding features further include multiple second image coding features, and the feature scale of the first image coding features is smaller than that of the multiple second image coding features; and decoding the intermediate coding features multiple times based on the multiple second image coding features to obtain the target mouth shape image.

[0063] Feature fusion can combine feature maps from multiple sources, with different depths, scales, and number of channels to obtain a new, more robust, and more comprehensive feature. Feature fusion methods can include feature concatenation, superposition of features, weighted fusion of features, and fusion of features through attention mechanisms.

[0064] By fusing the audio coding features with the first image coding features, an intermediate coding feature that contains information from both can be obtained.

[0065] The intermediate encoded features can be decoded multiple times based on multiple second image encoded features in ascending order of feature scale, so that the feature scale of the decoding result of each decoding gradually increases until the feature scale is restored to be consistent with the reference mouth shape image before encoding after the last decoding, thus obtaining the target mouth shape image.

[0066] Figure 3 The diagram illustrates a data flow graph for determining a target lip shape image based on audio coding features and image coding features according to an embodiment of the present disclosure.

[0067] like Figure 3 As shown, audio features can be serially encoded multiple times to obtain the encoding results of each encoding, such as f_v1, f_v2, f_v3, and f_v4. The encoding result of the last encoding is used as the audio encoding feature, i.e., f_v4 is the audio encoding feature. Similarly, the reference lip shape image can be serially encoded multiple times to obtain multiple image encoding features, such as f_p1, f_p2, f_p3, and f_p4. The encoding result of the last encoding is used as the first image encoding feature, and the other image encoding features are used as the second image encoding features, i.e., f_p4 is the first image encoding feature, and f_p1, f_p2, and f_p3 are the second image encoding features.

[0068] In this embodiment, the audio coding feature f_v4 and the first image coding feature f_p4 have the same feature scale, and both are smaller than any of the second image coding features. By fusing the audio coding feature f_v4 and the first image coding feature f_p4 through feature concatenation, an intermediate coding feature f_mix can be obtained.

[0069] The second image-coded features can be arranged in ascending order of feature scale. Then, the second image-coded feature with the smallest feature scale that has not yet been used to decode the intermediate coded features is selected sequentially. This second image-coded feature is then used to decode the intermediate coded features or the decoding result obtained from the previous decoding. This process continues until decoding is completed based on the second image-coded feature with the largest feature scale, resulting in the target mouth shape image.

[0070] After obtaining the intermediate coding feature f_mix, the intermediate coding feature f_mix can be decoded to obtain the first intermediate decoding feature f_1.

[0071] For the first intermediate decoding feature f_1, none of the three second image coding features f_p1, f_p2, and f_p3 are used to decode the intermediate coding feature. The second image coding feature f_p3 with the smallest feature scale is selected and used to decode the first intermediate decoding feature f_1 to obtain the second intermediate decoding feature f_2.

[0072] For the second intermediate decoding feature f_2, since the second image coding feature f_p3 has already been used for decoding, the second image coding features that are not currently used for decoding the intermediate coding features include f_p1 and f_p2. The second image coding feature f_p2 with the smallest feature scale is selected from these two features. The second image coding feature f_p2 is used to decode the second intermediate decoding feature f_2 to obtain the third intermediate decoding feature f_3.

[0073] Similarly, by using the second image coding feature f_p1 to decode the third intermediate decoding feature f_3, the target mouth shape image can be obtained.

[0074] According to embodiments of this disclosure, fusing audio coding features with first image coding features that are smaller in scale and more semantically focused can yield intermediate coding features that focus on the core information of audio and lip movement, ensuring the accuracy of lip movement driving. Based on multiple second image coding features with larger scale and more complete spatial structure, the intermediate coding features are decoded multiple times to gradually restore the spatial structure and detailed texture of the mouth image. This not only preserves the accuracy of audio-driven lip movement but also restores the structural integrity of the reference mouth image, avoiding problems such as deformation and blurring in lip movement generation.

[0075] According to embodiments of this disclosure, at least one decoding operation in multiple decoding operations on intermediate encoded features includes: upsampling the current intermediate decoded feature to obtain an upsampled intermediate decoded feature, wherein the current intermediate decoded feature is obtained by decoding the intermediate encoded feature at least once; and performing a depthwise separable convolution on a target second image encoded feature that matches the upsampled intermediate decoded feature and the upsampled intermediate decoded feature to obtain a next intermediate decoded feature, wherein the feature scale of the target second image encoded feature matches the feature scale of the upsampled intermediate decoded feature.

[0076] Upsampling can process feature maps to obtain feature maps with a larger feature scale. Upsampling methods can include deconvolution, unpooling, nearest neighbor interpolation, etc.

[0077] In the embodiments of this disclosure, the upsampling and encoding processes can be completely opposite inverse processes, so that the feature scales of multiple second image encoding features correspond one-to-one with the feature scales of multiple upsampled intermediate decoding features.

[0078] Depthwise separable convolution is a lightweight convolution method that breaks down ordinary convolution into two steps: depthwise convolution and pointwise convolution. This can significantly reduce the amount of computation and model parameters without losing feature extraction capabilities.

[0079] After each upsampling step, based on the feature scales of multiple second image coding features, a second image coding feature with the same feature scale as the intermediate decoded feature obtained after this upsampling can be selected as the target second image coding feature. The target second image coding feature is then subjected to a depthwise separable convolution with the intermediate decoded feature to obtain the next intermediate decoded feature.

[0080] Figure 4 The diagram illustrates a model architecture for determining a target mouth shape image based on audio coding features and image coding features according to an embodiment of the present disclosure.

[0081] like Figure 4 As shown, the Unet structure model based on the MobileNet architecture can be used to determine the target mouth shape image based on audio coding features and image coding features.

[0082] exist Figure 4 In the diagram, the black vector represents the audio coding feature f_v4, the gray vector represents the feature vector obtained after fusing the audio coding feature with the current image coding feature, the downward arrow represents downsampling, the upward arrow represents upsampling, the dashed arrow represents feature fusion, the rightward arrow labeled 3*3 conv represents 3*3 convolution, and the rightward arrow labeled 1*1 conv represents 1*1 convolution.

[0083] During the encoding process using the Unet structure model based on the MobileNet architecture, audio encoding features can be fused with the current image encoding features before encoding to obtain the next image encoding features. After completing multiple encodings and obtaining image encoding features f_p1, f_p2, f_p3, and f_p4, the image encoding feature f_p4 obtained from the last encoding can be used as the first image encoding feature, and the other image encoding features can be used as the second image encoding features. The first image encoding feature is then fused with the audio encoding features to obtain the intermediate encoding feature f_mix.

[0084] The first decoding in the multiple decoding of intermediate encoded features includes: upsampling the intermediate encoded feature f_mix to obtain the upsampled intermediate encoded feature; concatenating the target second image encoded feature that matches the upsampled intermediate encoded feature with the upsampled intermediate encoded feature to obtain the first intermediate decoded feature f_1.

[0085] In this embodiment, the upsampling and encoding processes are completely opposite inverse processes. Therefore, the feature scale of the first intermediate decoding feature is consistent with the feature scale of the second image coding feature before the last encoding. The second image coding feature before the last encoding can be used as the target second image coding feature. That is, the feature vector obtained by feature fusion of image coding feature f_p3 and audio coding feature f_v4 is the target second image coding feature.

[0086] A depthwise separable convolution can be performed on the first intermediate decoded feature f_1. Specifically, this involves performing two rounds of convolution on the first intermediate decoded feature f_1 and then randomly dropping out to obtain the feature vector corresponding to the first intermediate decoded feature f_1.

[0087] In this embodiment, the convolution kernels used in the two rounds of convolution processing can both be 3*3, and the dropout can be set to 0.1.

[0088] For the final decoding, the target second image coding feature used in this decoding process can be determined based on the second image coding feature f_p1 and the audio coding feature f_v4. The feature vector corresponding to the second intermediate decoding feature f_2 is upsampled, and the upsampled result is fused with the target second image coding feature to obtain the third intermediate decoding feature f_3. Depth convolution can then be completed through two rounds of convolution during the decoding process, or pointwise convolution can be performed using a 1*1 convolution kernel to achieve depth-separable convolution of the intermediate decoding features, thus obtaining the target mouth shape image.

[0089] According to embodiments of this disclosure, upsampling the current intermediate decoding features can progressively enlarge the feature size to match the feature scale of the second image-coded features. Target second image-coded features matching the upsampled feature scale are selected, and efficient fusion of the two is achieved through depthwise separable convolution. This reduces computational overhead while preserving mouth structure information. After multiple iterations of decoding, a target mouth shape image with clear details, accurate structure, and natural lip movement can be generated, solving the problems of feature loss and blurred details in traditional decoding processes.

[0090] According to embodiments of this disclosure, a lip-movement processing is performed on a reference lip-movement image based on audio data using a lip-movement generation model to obtain a target lip-movement image. The lip-movement generation model is trained as follows: a first target loss value is obtained based on a first loss value between sample audio data and sample lip-movement images, a second loss value between sample lip-movement images and standard lip-movement images, and a third loss value between multiple temporally correlated sample lip-movement images and multiple temporally correlated standard lip-movement images. The sample lip-movement images are obtained by performing lip-movement processing on sample reference lip-movement images based on sample audio data. The sample reference lip-movement images include reference lip information, and the standard lip-movement images include real lip information matching the sample audio data. The model parameters of the initial lip-movement generation model are adjusted using the first target loss value to obtain the lip-movement generation model.

[0091] The first loss value can be used to represent the difference between the speech content corresponding to the lip movement information represented by the sample mouth shape image and the speech content represented by the sample audio data. The second loss value can be used to represent the difference between the real lip movement information and the lip movement information represented by the sample mouth shape image.

[0092] Taking sample mouth shape images as an example, temporally correlated sample mouth shape images represent sample mouth shape images that are adjacent at multiple time points. In the embodiments of this disclosure, sample mouth shape images at three consecutive adjacent time points can be used as temporally correlated sample mouth shape images.

[0093] Based on the time corresponding to the sample mouth shape image, a time adjacent to and before that time and a time adjacent to and after that time can be determined. The sample mouth shape images at the above three times are used as temporally associated sample mouth shape images, and the prediction action difference result is determined based on the image pixels of the above sample mouth shape images.

[0094] Similar to sample mouth shape images, standard mouth shape images can also be used as temporally correlated standard mouth shape images at three consecutive adjacent time points. The standard action difference results can be determined based on the image pixels of each of the multiple standard mouth shape images.

[0095] By analyzing multiple consecutive frames of standard mouth shape images, the standard motion difference results used to describe mouth shape changes during actual movements can be determined. By analyzing multiple consecutive frames of sample mouth shape images, the predicted motion difference results used to describe mouth shape changes during predicted mouth shape movements can be determined.

[0096] The difference between the predicted action difference and the standard action difference can be used as the third loss value. The third loss value is used to represent the action difference between the predicted continuous lip-shape action and the standard continuous lip-shape action, thereby representing the constraint on the continuous lip-shape action process.

[0097] The first target loss value is determined based on the first loss value, the second loss value, and the third loss value. By adjusting the model parameters of the initial mouth shape generation model to minimize the first target loss value, the mouth shape generation model can be obtained.

[0098] Since the optimization objective is to minimize the first objective loss value, the speech content corresponding to the lip movement information in the lip shape image generated by the adjusted lip shape generation model is consistent with the speech content of the audio data. Furthermore, the lip movement information represented by the lip shape image is consistent with the lip movement information represented by the standard lip shape image, thus ensuring the output accuracy of a single frame of lip shape image. In addition, during the process of minimizing the first objective loss value, the third loss value can be reduced simultaneously, enabling a smooth transition between consecutively generated multiple frames of lip shape images and reducing jitter and abrupt changes in the generated lip shape images.

[0099] According to embodiments of this disclosure, a first loss value, a second loss value, and a third loss value are constructed from different dimensions such as the difference between the sample and standard lip shape images and temporal coherence, and a first target loss value is obtained by combining them. These values constrain the single-frame lip movement accuracy, lip shape reproduction accuracy, and temporal smoothness, respectively. The model parameters are optimized based on the first target loss value, enabling the lip shape generation model to simultaneously learn lip movement accuracy, lip shape reproduction accuracy, and motion smoothness. This avoids problems such as distortion and abrupt changes in the generated lip shape images, and improves the stability and naturalness of the content output by the lip shape generation model.

[0100] According to embodiments of this disclosure, the image generation method further includes: determining a first sub-loss value based on the image pixels of the sample mouth shape image and the standard mouth shape image respectively; determining a second sub-loss value based on the image features of the sample mouth shape image and the standard mouth shape image respectively; and obtaining a second loss value based on the first sub-loss value and the second sub-loss value.

[0101] The first sub-loss value can be determined by the difference between the pixels of the sample mouth shape image and the standard mouth shape image. For example, if the sample mouth shape image and the standard mouth shape image are the same size, the correspondence between multiple pixels in the sample mouth shape image and the standard mouth shape image can be determined first, and the difference between the pixel values of each corresponding pair of pixels can be accumulated to obtain the first sub-loss value.

[0102] The Visual Geometry Group (VGG) model can be used to process the sample mouth shape image and the standard mouth shape image separately to obtain the image features of each image. By using the second sub-loss value between the image features of the sample mouth shape image and the standard mouth shape image, the differences between the sample mouth shape image and the standard mouth shape image can be determined. These differences include the differences between features such as teeth and mouth shape.

[0103] According to embodiments of this disclosure, a first sub-loss is determined based on image pixels to constrain the pixel-level reproduction accuracy between the sample mouth shape image and the standard mouth shape image, ensuring accurate matching of mouth contours and positions. A second sub-loss is determined based on the image features of both to constrain lip movement consistency at the semantic level, avoiding stiff details and semantic deviations caused by focusing only on pixels. The second loss value obtained by combining the two can balance low-level pixel reproduction and high-level semantic alignment, further improving the accuracy and realism of lip movement generation.

[0104] According to embodiments of this disclosure, in Figure 2 In the operation S220 shown, the target mouth shape image, identity reference image and posture reference image are processed to obtain the target image. This may include: encoding the stitched image obtained by stitching the target mouth shape image, identity reference image and posture reference image multiple times to obtain first encoded features and multiple second encoded features at different feature scales, wherein the feature scale of the first encoded features is smaller than that of the multiple second encoded features; and decoding the first encoded features multiple times based on the multiple second encoded features to obtain the target image.

[0105] The target lip shape image, identity reference image, and posture reference image are stitched together to obtain a stitched image. The stitched image includes lip information from the target lip shape image that matches the lip movements in the audio data, posture information of the target object corresponding to the target lip shape image from the posture reference image, and identity information of the target object.

[0106] The stitched image can be encoded serially multiple times to obtain first-coded features and multiple second-coded features at different feature scales. For example, taking three encodings as an example, the stitched image is encoded for the first time to obtain the first second-coded feature, the first second-coded feature is encoded for the second time to obtain the second second-coded feature, and the second second-coded feature is encoded for the third time to obtain the first-coded feature.

[0107] Since the encoding is performed serially, the more times the encoding is performed, the smaller the feature scale of the resulting encoded feature becomes. That is, the feature scale of the first encoded feature is smaller than that of multiple second encoded features. According to the order of encoding, the feature scales of the multiple second encoded features are arranged from largest to smallest.

[0108] The first coding feature can be decoded multiple times based on multiple second coding features in ascending order of feature scale, so that the feature scale of the decoding result of each decoding gradually increases until the target image is obtained after the last decoding.

[0109] Figure 5 The diagram illustrates a data flow graph for determining a target image based on a stitched image according to an embodiment of the present disclosure.

[0110] like Figure 5 As shown in the figure, the right side of the figure, from top to bottom, represents the posture reference image, the identity reference image, and the target mouth shape image. After stitching the posture reference image, the identity reference image, and the target mouth shape image to obtain the stitched image, the stitched image can be encoded multiple times to obtain multiple encoded features f_e1, f_e2, f_e3, and f_e4. The encoding result f_e4 of the last encoding is used as the first encoded feature, and the other encoding results are used as the second encoded features.

[0111] The second-coded features can be arranged in ascending order of feature scale. Then, the second-coded feature with the smallest feature scale that has not yet been used to decode the first-coded features is selected as the target second-coded feature. The first-coded features or the decoding result obtained from the previous decoding are then decoded based on the target second-coded feature. This process continues until decoding is completed based on the second-coded feature with the largest feature scale, at which point the target image is obtained.

[0112] The first encoded feature f_e4 can be convolved with a 1*1 to obtain the intermediate result f_d1 corresponding to the first encoded feature f_e4. Alternatively, the first encoded feature f_e4 can be directly used as f_d1, and the intermediate result f_d1 can be upsampled to obtain the first decoded feature f_d2.

[0113] After obtaining the first decoded feature f_d2, the second encoded feature with the smallest feature scale, which was not currently used to decode the first encoded feature, i.e., f_e3, can be used as the target second encoded feature. The first decoded feature f_d2 is then decoded using the target second encoded feature f_e3 to obtain the second decoded feature f_d3. Similarly, the second encoded feature f_e1 is used as the target second encoded feature to decode the third decoded feature f_d4, thus obtaining the target image.

[0114] According to embodiments of this disclosure, the target lip movement image, identity reference image, and posture reference image are stitched together and then encoded multiple times to obtain first encoded features and multiple second encoded features at different scales. The first encoded features with smaller feature scales can carry more core semantic information, while the second encoded features with larger feature scales can carry more spatial structural information. Therefore, through the first encoded features and multiple second encoded features, the features of the stitched image can be expressed from multiple dimensions. By progressively decoding the first encoded features based on the second encoded features, the three features—lip movement, identity information, and posture information—can be gradually fused, ensuring that the generated target image simultaneously meets the requirements of the aforementioned features. This solves problems such as image distortion caused by feature conflicts during multi-source image fusion.

[0115] According to embodiments of this disclosure, at least one of the multiple decoding operations on the first encoded feature includes: upsampling the current decoded feature to obtain an upsampled decoded feature, wherein the current decoded feature is obtained by decoding the first encoded feature at least once; and performing a depthwise separable convolution on a target second encoded feature that matches the upsampled decoded feature and the upsampled decoded feature to obtain a next decoded feature, wherein the feature scale of the target second encoded feature matches the feature scale of the upsampled decoded feature.

[0116] In embodiments of this disclosure, the upsampling and encoding processes can be completely opposite inverse processes, so that the feature scales of multiple second encoded features correspond one-to-one with the feature scales of multiple upsampled decoded features.

[0117] After each upsampling step, based on the feature scales of multiple second-coded features, a second-coded feature with the same feature scale as the upsampled decoded feature obtained after this upsampling can be selected as the target second-coded feature. Performing a depthwise separable convolution between the target second-coded feature and the upsampled decoded feature yields the next decoding feature.

[0118] Figure 6 The diagram illustrates a model architecture for determining a target image based on stitched images according to an embodiment of the present disclosure.

[0119] like Figure 6As shown, the Unet structure model based on the MobileNet architecture can be used to determine the target image based on the stitched image.

[0120] exist Figure 6 In the diagram, downward arrows indicate downsampling, upward arrows indicate upsampling, dashed arrows indicate feature fusion, arrows pointing to the right labeled 3*3 conv indicate 3*3 convolution, and arrows pointing to the right labeled 1*1 conv indicate 1*1 convolution.

[0121] To more clearly explain the encoding and decoding process in this embodiment, Figure 6 A solid arrow pointing to the right without any label indicates that no operation is performed on the feature vector; it is only used to facilitate the description of the data flow of the vector.

[0122] In the process of encoding using the Unet structure model based on the MobileNet architecture, a serial encoding method can be adopted to encode the target image multiple times, which can obtain multiple encoding features f_e1, f_e2, f_e3, f_e4. The encoding feature obtained from the last encoding is used as the first encoding feature f_e4, and the other encoding features are used as multiple second encoding features.

[0123] The first decoding step in the multiple decoding of the first encoded feature includes: performing a 1*1 convolution on the first encoded feature f_e4 to obtain an intermediate result f_d1 corresponding to the first encoded feature f_e4; alternatively, the first encoded feature f_e4 can be directly used as f_d1. Upsampling the intermediate result f_d1 yields an upsampled feature, which is then fused with the target second encoded feature to obtain the first decoded feature f_d2. In this embodiment, upsampling and encoding are completely opposite processes; therefore, the feature scale of the first decoded feature is consistent with the feature scale of the second encoded feature before the last encoding. Thus, the second encoded feature before the last encoding can be used as the target second encoded feature, i.e., the second encoded feature f_e3 can be used as the target second encoded feature.

[0124] The first decoded feature f_d2 can be subjected to depthwise separable convolution and dropout operations to obtain the feature vector corresponding to the first decoded feature f_d2.

[0125] In this embodiment, the convolution kernels used in the two rounds of convolution processing can both be 3*3, and the dropout can be set to 0.1.

[0126] For the last decoding, the feature vector corresponding to the second decoding feature f_d3 is upsampled, and the upsampled result is fused with the second encoded feature f_e1 to obtain the third decoding feature f_d4. Two rounds of convolution processing can be performed on the third decoding feature f_d4 to complete the depthwise convolution. Therefore, after obtaining the last decoding feature, pointwise convolution can be performed using a 1*1 convolution kernel to achieve depthwise separable convolution of the decoding features and obtain the target image.

[0127] According to embodiments of this disclosure, by upsampling to gradually enlarge the feature size of the decoded features, it is possible to adapt to the feature scale of multiple second encoded features. The second encoded features with matching feature scales are selected and the upsampled decoded features are fused by depthwise separable convolution. This can efficiently fuse lip movement, identity information, and pose information, while reducing the amount of computation and preserving image details and structural integrity. After multiple decodings, a target image with rich details, unified features, and natural visual appearance is generated, thus improving the overall generation quality.

[0128] According to embodiments of this disclosure, an image generation model is used to process a target mouth shape image, an identity reference image, and a pose reference image to obtain a target image. The image generation model is trained as follows: a fourth loss value between a sample object image and a standard object image, and a fifth loss value between multiple temporally correlated sample object images and multiple temporally correlated standard object images, to obtain a second target loss value. The sample object image is obtained by processing a standard mouth shape image, a sample identity reference image, and a sample pose reference image. The mouth shape in the standard object image matches the mouth shape in the standard mouth shape image, the identity information of the sample object in the standard object image matches the identity information of the sample object in the sample identity reference image, and the pose information of the sample object in the standard object image matches the pose information of the sample object in the sample pose reference image. The model parameters of the initial image generation model are adjusted using the second target loss value to obtain the image generation model.

[0129] The fourth loss value can be used to represent the differences between information such as the standard mouth shape image, sample identity reference image, and sample pose reference image in the sample object image and the mouth shape information, identity information, and pose information in the standard object image.

[0130] Taking sample object images as an example, temporally associated sample object images represent sample object images that are adjacent at multiple time points. In the embodiments of this disclosure, sample object images at three consecutive adjacent time points can be used as temporally associated sample object images.

[0131] Based on the time corresponding to the sample object image, a time adjacent to and before that time, and a time adjacent to and after that time can be determined. The sample object images at the above three times are used as temporally associated sample object images, and the prediction object difference result is determined based on the image pixels of the above sample object images.

[0132] Similar to sample object images, standard object images can also be used as temporally correlated standard object images at three consecutive adjacent time points, and the standard object difference results can be determined based on the image pixels of each of the multiple standard object images.

[0133] By analyzing multiple consecutive frames of standard object images, standard object difference results can be determined to describe the overall changes of objects during the actual action process. By analyzing multiple consecutive frames of sample object images, predicted object difference results can be determined to describe the changes of objects during the predicted action process.

[0134] The difference between the predicted object difference result and the standard object difference result can be used as the fifth loss value. The fifth loss value is used to represent the action difference between the predicted continuous object action and the standard continuous object action, thereby representing the constraint on the continuous action process of the object.

[0135] The second target loss value is determined based on the fourth and fifth loss values. By adjusting the model parameters of the initial image generation model to minimize the second target loss value, the image generation model can be obtained.

[0136] Since the optimization objective is to minimize the second objective loss value, the lip movement, mouth information, identity information, and pose information represented by the object image generated by the adjusted image generation model are consistent with the standard object image, thus ensuring the output accuracy of a single frame object image. Furthermore, in the process of minimizing the second objective loss value, the fifth loss value can be reduced simultaneously, enabling a smooth transition between consecutively generated multiple frames of object images and reducing jitter and jumps in the generated object images.

[0137] According to embodiments of this disclosure, a fourth loss value is constructed based on the single-frame difference between the sample object image and the standard object image, and a fifth loss value is constructed based on the temporal coherence between the sample object image and the standard object image, to obtain a second target loss value. This can simultaneously constrain the overall image fidelity and temporal smoothness, thereby improving the accuracy and reliability of target image generation.

[0138] According to embodiments of this disclosure, the image generation method further includes: determining a third sub-loss value based on the image pixels of the sample object image and the standard object image respectively; determining a fourth sub-loss value based on the image features of the sample object image and the standard object image respectively; and obtaining a fourth loss value based on the third sub-loss value and the fourth sub-loss value.

[0139] The third sub-loss value can be determined by the differences between the pixels of the sample object image and the standard object image. For example, if the sample object image and the standard object image are the same size, the correspondence between multiple pixels in the sample object image and the standard object image can be determined first, and the difference between the pixel values of each corresponding pair of pixels can be accumulated to obtain the third sub-loss value.

[0140] VGG can be used to process the sample object image and the standard object image separately to obtain their respective image features. By using the fourth sub-loss value between the image features of the sample object image and the standard object image, the differences between the sample object image and the standard object image in more high-frequency or more detailed information can be determined, such as the differences between features like mouth shape and eyes.

[0141] According to embodiments of this disclosure, a third sub-loss is determined based on image pixels to constrain the pixel-level reproduction accuracy between the sample object image and the standard object image, ensuring accurate matching of the overall object contour and position. A fourth sub-loss is determined based on the image features of both to constrain semantic-level detail consistency, avoiding stiff details and semantic deviations caused by focusing only on pixel consistency. The fourth loss value obtained by combining the two can balance low-level pixel reproduction and high-level semantic alignment, further improving the accuracy and realism of the generated object image.

[0142] According to embodiments of this disclosure, the image generation method further includes: splitting the speech data to obtain multiple initial sub-audio files; and combining the initial sub-audio files with a preceding initial sub-audio file and a following initial sub-audio file that are temporally correlated to obtain audio data.

[0143] In one embodiment of this disclosure, during the training of the initial lip-shape generation model and the initial image generation model, the speech data can be the audio portion of the video data. The audio portion of the video data is split in the time dimension according to a preset splitting granularity to obtain multiple initial sub-audio files. The image portion of the video data is split in the time dimension according to a preset splitting strength to obtain initial sub-video files that correspond one-to-one with the multiple initial sub-audio files.

[0144] Figure 7 The illustration shows a schematic diagram of the division of voice data and image data according to an embodiment of the present disclosure.

[0145] like Figure 7 As shown, the same video data is regarded as audio and image parts. The audio and image parts are split in the time dimension according to the same splitting granularity, resulting in initial sub-audio and multiple sub-video at multiple time points. The initial sub-audio at each time point corresponds to the initial sub-video at that time point.

[0146] For the first initial sub-audio 710, the initial sub-audio that is adjacent to the time of the first initial sub-audio 710 and is located before the time of the first initial sub-audio 710 can be taken as the previous initial sub-audio 720, and the initial sub-audio that is adjacent to the time of the first initial sub-audio 710 and is located after the time of the first initial sub-audio 710 can be taken as the next initial sub-audio 730.

[0147] By combining the first initial sub-audio 710, the previous initial sub-audio 720, and the subsequent initial sub-audio 730, audio data 740 can be obtained.

[0148] Since multiple initial sub-audio files correspond one-to-one with multiple initial sub-video files, during the training process, multiple initial sub-video files that are in the same time period as the audio data can be determined based on the audio data, and standard lip shape images and standard object images can be determined from them. Therefore, a large number of training samples can be provided for the initial lip shape generation model and the initial image generation model, avoiding problems such as overfitting due to insufficient training samples.

[0149] In another embodiment of this disclosure, after the above-mentioned model training is completed, during the process of image generation using the trained model, the speech data includes the speech that needs to be lip-movement driven and subsequently generated into the image.

[0150] According to embodiments of this disclosure, speech data is divided into multiple initial sub-audio segments to achieve segmented processing of long speech. Combining temporally adjacent initial sub-audio segments avoids audio breaks and lip movement abruptness at sub-audio boundaries, resulting in smooth temporal sequence and semantic integrity of the obtained audio data. This provides a stable and coherent audio input for subsequent lip movement generation, improving the fluency of lip movement.

[0151] According to embodiments of this disclosure, the image generation method further includes: masking the mouth region of a target object in a reference video frame to obtain a pose reference image, wherein the reference video frame is obtained by splitting the reference video into frames.

[0152] The mouth region of the target object in multiple reference video frames of the reference video is masked separately, and the resulting pose reference image does not include lip movements.

[0153] After masking the mouth region of the target object in the reference video frame, the correspondence between the mouth shape image and the pose reference image can be constructed based on the mouth shape image of the target object's mouth region before masking.

[0154] Compared to generating a complete image of the target object, generating only the mouth shape image requires less computation and is more efficient, thus enabling the generation of more accurate mouth shape images. After generating the mouth shape image using the mouth shape generation model, the pose reference image can be quickly determined based on the correspondence between the mouth shape image and the pose reference image. Therefore, it is not necessary to directly generate a complete image of the target object; generating only the mouth shape image is sufficient to determine the complete image of the target object, improving both image generation efficiency and accuracy. Furthermore, by establishing the correspondence between the mouth shape image and the pose reference image, different pose reference images can be selected for subsequent image generation when different mouth shape images are used, avoiding the problem of inconsistent poses for different mouth shape images, which could affect the realism of the target image.

[0155] According to embodiments of this disclosure, performing masking processing on the mouth region of the target object in the reference video frame can remove interference from the original mouth information, so that the pose reference image retains only the non-mouth information such as the pose and facial structure of the target object, avoiding feature conflicts when fused with subsequent target mouth images, and improving the compatibility of fusion between pose information and lip movement information.

[0156] Figure 8 A flowchart illustrating an interaction method according to an embodiment of this disclosure is shown schematically.

[0157] like Figure 8 As shown, the method includes operations S810~S830.

[0158] When operating the S810, in response to receiving interactive information input on the interactive interface, a feedback voice is generated based on the interactive information.

[0159] When operating the S820, feedback video is generated based on the feedback voice.

[0160] When operating S830, the feedback video is played on the interactive interface.

[0161] The video frames in the feedback video are generated according to the image generation method described above.

[0162] Users can input interactive information into the interface, which can represent their needs, such as product information or news updates they want to learn about. This interactive information can be processed using a large model to determine suitable feedback, which can be in text format.

[0163] Text-to-Speech (TTS) technology can be used to process the text-based feedback information to obtain the feedback speech. The image generation method described above can be used to generate multiple target images based on the feedback speech, and these multiple target images can be stitched together in chronological order to obtain the feedback video.

[0164] Feedback videos can be played through the interactive page, allowing users to provide feedback on interactive information in the form of videos.

[0165] According to embodiments of this disclosure, the feedback voice of the interactive information is further processed to generate a feedback video. The feedback video is used to provide feedback on the interactive information. Compared with traditional text interaction, the audiovisual interaction through feedback video improves the immersion and realism of the interaction and optimizes the user experience.

[0166] Figure 9 A block diagram of an image generation apparatus according to an embodiment of the present disclosure is shown schematically.

[0167] like Figure 9 As shown, the image generation apparatus 900 of this embodiment includes an image processing module 910 and an image generation module 920.

[0168] The image processing module 910 is used to perform lip movement processing on a reference mouth shape image based on audio data to obtain a target mouth shape image, wherein the reference mouth shape image includes reference mouth information and the target mouth shape image includes mouth information that matches the lip movement of the audio data.

[0169] The image generation module 920 is used to process the target mouth shape image, the identity reference image, and the posture reference image to obtain the target image. The identity reference image includes the identity information of the target object in the target image, and the posture reference image includes the posture information of the target object.

[0170] According to embodiments of the present disclosure, the image processing module 910 includes a feature processing submodule.

[0171] The feature processing submodule is used to process multiple image coding features at different feature scales based on audio coding features to obtain a target mouth shape image. The audio coding features are obtained by encoding the audio features of the audio data, and the multiple image coding features are obtained by encoding the image features of the reference mouth shape image multiple times. The feature scale is used to indicate the index value of at least one indicator in the feature semantics and feature space.

[0172] According to embodiments of this disclosure, the feature processing submodule includes a feature fusion unit and a feature decoding unit.

[0173] The feature fusion unit is used to fuse audio coding features and a first image coding feature from multiple image coding features to obtain intermediate coding features. The multiple image coding features also include multiple second image coding features, and the feature scale of the first image coding feature is smaller than that of the multiple second image coding features.

[0174] The feature decoding unit is used to decode the intermediate encoded features multiple times based on multiple second image encoded features to obtain the target mouth shape image.

[0175] According to embodiments of this disclosure, the feature decoding unit includes an upsampling subunit and a convolution subunit.

[0176] The upsampling subunit is used to upsample the current sub-intermediate decoded feature to obtain the upsampled intermediate decoded feature, wherein the current sub-intermediate decoded feature is obtained by decoding the intermediate encoded feature at least once.

[0177] The convolutional subunit is used to perform depthwise separable convolution on the target second image coding features and the upsampled intermediate decoding features that match the upsampled intermediate decoding features to obtain the next intermediate decoding features, wherein the feature scale of the target second image coding features matches the feature scale of the upsampled intermediate decoding features.

[0178] According to embodiments of this disclosure, the image generation apparatus 900 further includes a first loss determination module and a first parameter adjustment module.

[0179] The first loss determination module is used to obtain a first target loss value based on a first loss value between sample audio data and sample lip shape images, a second loss value between sample lip shape images and standard lip shape images, and a third loss value between multiple temporally correlated sample lip shape images and multiple temporally correlated standard lip shape images. The sample lip shape images are obtained by performing lip movement processing on sample reference lip shape images based on sample audio data. The sample reference lip shape images include reference lip information, and the standard lip shape images include real lip movement lip information that matches the sample audio data.

[0180] The first parameter adjustment module is used to adjust the model parameters of the initial mouth shape generation model using the first target loss value, so as to obtain the mouth shape generation model.

[0181] According to embodiments of this disclosure, the first loss determination module includes a first sub-loss determination sub-module, a second sub-loss determination sub-module, and a first loss determination sub-module.

[0182] The first sub-loss determination submodule is used to determine the first sub-loss value based on the image pixels of the sample mouth shape image and the standard mouth shape image.

[0183] The second sub-loss determination submodule is used to determine the second sub-loss value based on the image features of the sample mouth shape image and the standard mouth shape image.

[0184] The first loss determination submodule is used to obtain the second loss value based on the first sub-loss value and the second sub-loss value.

[0185] According to embodiments of this disclosure, the image generation module 920 includes a feature encoding submodule and a feature decoding submodule.

[0186] The feature encoding submodule is used to encode the stitched image obtained by stitching together the target mouth shape image, the identity reference image, and the pose reference image multiple times to obtain first encoded features and multiple second encoded features at different feature scales. The feature scale of the first encoded features is smaller than that of the multiple second encoded features.

[0187] The feature decoding submodule is used to decode the first encoded features multiple times based on multiple second encoded features to obtain the target image.

[0188] According to embodiments of this disclosure, the feature decoding submodule includes an upsampling unit and a convolutional unit.

[0189] An upsampling unit is used to upsample the current decoding feature to obtain an upsampled decoding feature, wherein the current decoding feature is obtained by decoding the first encoded feature at least once.

[0190] A convolutional unit is used to perform depthwise separable convolution on the target second encoded feature and the upsampled decoded feature that match the upsampled decoded feature to obtain the next decoded feature, wherein the feature scale of the target second encoded feature matches the feature scale of the upsampled decoded feature.

[0191] According to embodiments of this disclosure, the image generation apparatus 900 further includes a second loss determination module and a second parameter adjustment module.

[0192] The second loss determination module is used to obtain the second target loss value by using the fourth loss value between the sample object image and the standard object image, and the fifth loss value between multiple sample object images and multiple standard object images that are temporally associated. The sample object image is obtained by processing the standard mouth shape image, the sample identity reference image, and the sample pose reference image. The mouth shape in the standard object image matches the mouth shape in the standard mouth shape image, the identity information of the sample object in the standard object image matches the identity information of the sample object in the sample identity reference image, and the pose information of the sample object in the standard object image matches the pose information of the sample object in the sample pose reference image.

[0193] The second parameter adjustment module is used to adjust the model parameters of the initial image generation model using the second target loss value to obtain the image generation model.

[0194] According to embodiments of this disclosure, the second loss determination module includes a third sub-loss determination sub-module, a fourth sub-loss determination sub-module, and a second loss determination sub-module.

[0195] The third sub-loss determination submodule is used to determine the third sub-loss value based on the image pixels of the sample object image and the standard object image.

[0196] The fourth sub-loss determination submodule is used to determine the fourth sub-loss value based on the image features of the sample object image and the standard object image.

[0197] The second loss determination submodule is used to obtain the fourth loss value based on the third and fourth sub-loss values.

[0198] According to embodiments of this disclosure, the image generation apparatus 900 further includes a speech splitting module and an audio combining module.

[0199] The speech segmentation module is used to split speech data into multiple initial sub-audio files.

[0200] The audio combining module is used to combine the preceding and following initial sub-audios that are temporally correlated to obtain audio data.

[0201] According to embodiments of this disclosure, the image generation apparatus 900 further includes an image mask module.

[0202] The image masking module is used to mask the mouth region of the target object in the reference video frame to obtain a pose reference image. The reference video frame is obtained by splitting the reference video into frames.

[0203] Figure 10 A block diagram of an interactive device according to an embodiment of the present disclosure is shown schematically.

[0204] like Figure 10 As shown, the interactive device 1000 in this embodiment includes a voice generation module 1010, a video generation module 1020, and a video feedback module 1030.

[0205] The voice generation module 1010 is used to respond to the interactive information received from the interactive interface and generate feedback voice based on the interactive information.

[0206] The video generation module 1020 is used to generate feedback video based on the feedback voice.

[0207] The video feedback module 1030 is used to play feedback videos on the interactive interface.

[0208] Figure 11 A schematic block diagram of an artificial intelligence agent according to an embodiment of the present disclosure is shown.

[0209] In embodiments of this disclosure, such as Figure 11 As shown, the AI agent 1100 may include an input module 1110, a processing module 1120, and an output module 1130.

[0210] Input module 1110 is used to receive input information;

[0211] Processing module 1120 is used to determine the target task based on the input information received by the input module, determine the large model based on the target task, and obtain output information by calling the large model to execute the image generation method or interactive method provided in the embodiments of this disclosure;

[0212] Output module 1130 is used to output the output information obtained by the processing module.

[0213] According to embodiments of this disclosure, the input module 1110 is responsible for receiving or sensing information such as queries, requests, instructions, signals, or data from the outside world (e.g., users or the external environment), and converting it into a format that the AI agent 1100 can understand and process. The input module 1110 is the primary link for the AI agent 1100 to interact with the outside world, enabling the AI agent 1100 to efficiently and accurately obtain necessary "sensory" information from the outside world and respond to this information.

[0214] In the example, input module 1110 can input the audio data, interactive information, etc. described above.

[0215] In the example, processing module 1120 is the core support for the AI agent 1100's ability to handle complex tasks. Processing module 1120 can execute the image generation method and interaction method described above.

[0216] In the example, the performance of processing module 1120 is closely related to the large model on which AI agent 1100 is based. To fully leverage the capabilities of the large model, the internal structure of processing module 1120 can be designed to be highly configurable and scalable to handle various types of tasks and requirements in real-world scenarios.

[0217] In the example, after the AI agent 1100 acquires the audio data, the processing module 1120 can use the speech recognition big model to process the required speech to obtain audio coding features, the image generation model processes the audio coding features to obtain the target image, and then pass the target image to the output module 1130.

[0218] Understandably, while large language models possess excellent language understanding and generation capabilities, like humans, their ability to solve tasks is limited without the aid of any tools. Once the AI agent 1100 is given the ability to invoke tools, it can perform tasks such as using a calculator to complete mathematical calculations, using Python to perform data analysis, and using a search engine to generate weather forecasts.

[0219] In the example, output module 1130 can output the target image or feedback video described above.

[0220] The AI agent 1100 according to the embodiments of this disclosure can simply and effectively improve the level of intelligence, and enhance flexibility and versatility.

[0221] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0222] According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method described above.

[0223] According to embodiments of the present disclosure, a non-transitory computer-readable storage medium stores computer instructions, wherein the computer instructions are used to cause a computer to perform the method described above.

[0224] According to an embodiment of this disclosure, a computer program product includes a computer program that, when executed by a processor, implements the method described above.

[0225] Figure 12 A schematic block diagram of an example electronic device that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0226] like Figure 12As shown, device 1200 includes a computing unit 1201, which can perform various appropriate actions and processes according to a computer program stored in read-only memory (ROM) 1202 or a computer program loaded from storage unit 1208 into random access memory (RAM) 1203. The RAM 1203 may also store various programs and data required for the operation of device 1200. The computing unit 1201, ROM 1202, and RAM 1203 are interconnected via bus 1204. Input / output (I / O) interface 1205 is also connected to bus 1204.

[0227] Multiple components in device 1200 are connected to input / output (I / O) interface 1205, including: input unit 1206, such as a keyboard, mouse, etc.; output unit 1207, such as various types of displays, speakers, etc.; storage unit 1208, such as a disk, optical disk, etc.; and communication unit 1209, such as a network card, modem, wireless transceiver, etc. Communication unit 1209 allows device 1200 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0228] The computing unit 1201 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1201 performs the various methods and processes described above, such as image generation methods or interactive methods. For example, in some embodiments, the image generation method or interactive method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and / or installed on device 1200 via ROM 1202 and / or communication unit 1209. When the computer program is loaded into RAM 1203 and executed by the computing unit 1201, one or more steps of the image generation method or interactive method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform an image generation method or an interaction method by any other suitable means (e.g., by means of firmware).

[0229] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0230] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0231] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0232] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0233] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0234] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, distributed system servers, or servers incorporating blockchain technology.

[0235] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0236] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. An image generation method, comprising: Based on audio data, a reference lip shape image is processed using lip movement techniques to obtain a target lip shape image. The reference lip shape image includes reference lip information, and the target lip shape image includes lip information matched to the lip movements in the audio data. The target mouth shape image, identity reference image, and posture reference image are processed to obtain a target image, wherein the identity reference image includes the identity information of the target object in the target image, and the posture reference image includes the posture information of the target object.

2. The method according to claim 1, wherein, The step of performing lip movement processing on a reference lip shape image based on audio data to obtain a target lip shape image includes: Based on audio coding features, multiple image coding features at different feature scales are processed to obtain the target mouth shape image. The audio coding features are obtained by encoding the audio features of the audio data, and the multiple image coding features are obtained by encoding the image features of the reference mouth shape image multiple times. The feature scale is used to indicate the index value of at least one indicator in the feature semantics and feature space.

3. The method according to claim 2, wherein, The process of processing multiple image coding features at different feature scales based on audio coding features to obtain the target mouth shape image includes: The audio coding features and a first image coding feature from a plurality of image coding features are fused to obtain intermediate coding features, wherein the plurality of image coding features further include a plurality of second image coding features, and the feature scale of the first image coding features is smaller than that of the plurality of second image coding features; and Based on multiple second image coding features, the intermediate coding features are decoded multiple times to obtain the target mouth shape image.

4. The method according to claim 3, wherein, At least one of the multiple decoding operations performed on the intermediate encoded features includes: Upsampling the current sub-intermediate decoded feature yields an upsampled intermediate decoded feature, wherein the current sub-intermediate decoded feature is obtained by decoding the intermediate encoded feature at least once; and A depthwise separable convolution is performed on the target second image coding feature that matches the upsampled intermediate decoding feature and the upsampled intermediate decoding feature to obtain the next intermediate decoding feature, wherein the feature scale of the target second image coding feature matches the feature scale of the upsampled intermediate decoding feature.

5. The method according to any one of claims 1 to 4, wherein, A lip movement processing method is used to generate a target lip shape image based on audio data and a reference lip shape image. The mouth shape generation model is trained in the following way: A first target loss value is obtained based on a first loss value between sample audio data and sample lip shape images, a second loss value between the sample lip shape images and standard lip shape images, and a third loss value between multiple temporally correlated sample lip shape images and multiple temporally correlated standard lip shape images. The sample lip shape images are obtained by performing lip movement processing on sample reference lip shape images based on the sample audio data. The sample reference lip shape images include reference lip information, and the standard lip shape images include real lip movement lip information matching the sample audio data. The model parameters of the initial mouth shape generation model are adjusted using the first target loss value to obtain the mouth shape generation model.

6. The method according to claim 5, further comprising: Based on the image pixels of the sample mouth shape image and the standard mouth shape image, a first sub-loss value is determined; Based on the image features of the sample mouth shape image and the standard mouth shape image, a second sub-loss value is determined; and The second loss value is obtained based on the first sub-loss value and the second sub-loss value.

7. The method according to any one of claims 1 to 6, wherein, The process of processing the target mouth shape image, identity reference image, and posture reference image to obtain the target image includes: The stitched image obtained by stitching together the target mouth shape image, the identity reference image, and the pose reference image is encoded multiple times to obtain first encoded features and multiple second encoded features at different feature scales, wherein the feature scale of the first encoded features is smaller than that of the multiple second encoded features; and The target image is obtained by decoding the first encoding feature multiple times based on multiple second encoding features.

8. The method according to claim 7, wherein, At least one of the multiple decoding operations performed on the first encoded feature includes: Upsampling is performed on the current decoding feature to obtain an upsampled decoding feature, wherein the current decoding feature is obtained by decoding the first encoded feature at least once; and A depthwise separable convolution is performed on the target second coding feature that matches the upsampled decoding feature and the upsampled decoding feature to obtain the next decoding feature, wherein the feature scale of the target second coding feature matches the feature scale of the upsampled decoding feature.

9. The method according to any one of claims 1 to 8, wherein, The target image is obtained by processing the target mouth shape image, identity reference image, and pose reference image using an image generation model. The image generation model is trained in the following manner: The second target loss value is obtained by combining the fourth loss value between the sample object image and the standard object image, and the fifth loss value between multiple temporally correlated sample object images and multiple temporally correlated standard object images. The sample object image is obtained by processing the standard mouth shape image, the sample identity reference image, and the sample pose reference image. The mouth shape in the standard object image matches the mouth shape in the standard mouth shape image; the identity information of the sample object in the standard object image matches the identity information of the sample object in the sample identity reference image; and the pose information of the sample object in the standard object image matches the pose information of the sample object in the sample pose reference image. The model parameters of the initial image generation model are adjusted using the second target loss value to obtain the image generation model.

10. The method of claim 8, further comprising: A third sub-loss value is determined based on the image pixels of the sample object image and the standard object image, respectively. Based on the image features of the sample object image and the standard object image, a fourth sub-loss value is determined; and The fourth loss value is obtained based on the third sub-loss value and the fourth sub-loss value.

11. The method according to any one of claims 1 to 10, further comprising: The speech data is split into multiple initial sub-audio segments; as well as The audio data is obtained by combining the preceding and following initial sub-audios that are temporally correlated with the initial sub-audios.

12. The method according to any one of claims 1 to 11, further comprising: The mouth region of the target object in the reference video frame is masked to obtain the pose reference image, wherein the reference video frame is obtained by splitting the reference video into frames.

13. An interaction method, comprising: In response to receiving interactive information input on the interactive interface, a feedback voice is generated based on the interactive information. Based on the feedback voice, a feedback video is generated; as well as The feedback video is played on the interactive interface; The video frames in the feedback video are generated by the method according to any one of claims 1 to 12.

14. An image generation apparatus, comprising: An image processing module is used to perform lip movement processing on a reference lip shape image based on audio data to obtain a target lip shape image, wherein the reference lip shape image includes reference lip information, and the target lip shape image includes lip information matching the lip movements in the audio data; and The image generation module is used to process the target mouth shape image, the identity reference image, and the posture reference image to obtain a target image, wherein the identity reference image includes the identity information of the target object in the target image, and the posture reference image includes the posture information of the target object.

15. An interactive device, comprising: The voice generation module is used to respond to the received interactive information input on the interactive interface and generate feedback voice based on the interactive information. A video generation module is used to generate a feedback video based on the feedback voice. as well as The video feedback module is used to play the feedback video on the interactive interface; The video frames in the feedback video are generated by the apparatus according to claim 14.

16. An intelligent agent, comprising: The input module is used to receive input information; The processing module is configured to determine a target task based on the input information received by the input module, determine a large model based on the target task, and execute the method of any one of claims 1 to 13 by calling the large model to obtain output information; An output module is used to output the output information obtained by the processing module.

17. An electronic device comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 13.

18. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1 to 13.

19. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1 to 13.