Speaker face generation method and device, electronic equipment and storage medium
By splicing and optimizing video and audio features, and combining affine transformation and keypoint loss function, high-quality and natural speaking face videos are generated, solving the problems of video clarity and lip-sync in existing technologies, and achieving more realistic speaking face synthesis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA INNOVATION TECH CO LTD
- Filing Date
- 2023-07-26
- Publication Date
- 2026-06-30
Smart Images

Figure CN116844215B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of audio and video technology, and in particular to a method, apparatus, electronic device, and storage medium for generating a speaking face. Background Technology
[0002] Speaker face synthesis technology is a technique that combines the voice of a specific speaker with the facial image of a target person to generate a synthetic face video with the speaker's voice characteristics. This technology has been widely used in many industries, such as virtual anchors, news broadcasting, live-streaming e-commerce, and digital human avatars. Because people are extremely sensitive to asynchrony between video and audio, easily perceiving even a 0.05-second time difference, current speaker synthesis technologies primarily focus on solving the lip-sync problem, striving for natural and realistic lip movements. Furthermore, with technological advancements, people have increasingly higher demands for video clarity; generating high-quality, natural-looking speaking face videos remains a significant challenge in speaker face synthesis technology.
[0003] To address the aforementioned issues, existing speaking face synthesis technologies mainly fall into two categories: First, end-to-end synthesis techniques utilize deep learning models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to capture richer speech and facial expression features. However, key technologies are still needed to solve lip-syncing issues. Second, methods based on intermediate vector representations divide the entire process into two stages. First, an intermediate representation vector is predicted using audio features, such as facial landmarks or heatmaps. Then, a generative adversarial network (GAN) uses this intermediate representation vector as conditional input to synthesize the final face. This method utilizes intermediate vectors to represent speech features, which can solve complex cross-modal modeling. However, the two-stage process inevitably results in the loss of some feature information, leading to lower facial clarity. In conclusion, speaking face synthesis technology still has significant room for improvement in terms of video clarity and the naturalness and realism of lip movements. Summary of the Invention
[0004] To address at least one of the aforementioned technical problems, this disclosure provides a speaker video synthesis method, apparatus, electronic device, and storage medium.
[0005] One aspect of this disclosure provides a method for generating a speaking face, comprising: concatenating facial features corresponding to an image set in an original video and sound features corresponding to a speech frame set in audio data to obtain a concatenated feature sequence, wherein the image set includes an image to be processed and multiple reference images sequentially preceding and following it, and the speech frame includes a target frame corresponding to the image to be processed and multiple reference speech frames sequentially preceding and following it; invoking an affine transformation module to perform deformation optimization processing on the concatenated feature sequence to generate an optimized feature sequence; and constructing a predicted face image with a lip shape corresponding to the target frame based on facial key points mapped by the optimized feature sequence.
[0006] In some implementations, before concatenating the facial features corresponding to the image set in the original video and the sound features corresponding to the speech frame set in the audio data to obtain the concatenated feature sequence, the method includes: using the full-face facial features of multiple reference images to supplement the regional facial features of the image to be processed, forming the facial features corresponding to the image set; and using the speech features of multiple reference speech frames to supplement the speech features of the target frame, forming the sound features corresponding to the speech frame set.
[0007] In some implementations, the facial features of the region are obtained by: performing lip masking to expose the region of interest of the prototype person's face in the reference image; and calling a facial feature extraction tool to extract facial features from the region of interest to obtain the facial features of the region, wherein the facial features of the region are used to characterize the personalized facial parameters of the prototype person.
[0008] In some implementations, before concatenating the facial features corresponding to the image set in the original video and the sound features corresponding to the speech frame set in the audio data to obtain the concatenated feature sequence, the method includes: extracting the image to be processed from the video data, and retrieving multiple images that are sequentially preceding the image to be processed and multiple images that are sequentially following the image to be processed, and using the multiple images as the reference images with respect to the image to be processed; and extracting the target frame from the audio data, and retrieving multiple speech frames that are sequentially preceding the target frame and multiple speech frames that are sequentially following the target frame, and using the multiple speech frames as the reference speech frames with respect to the target frame, wherein the reference images and the reference speech frames correspond to each other, and the image to be processed corresponds to the target frame.
[0009] In some implementations, the method further includes: performing high-definition processing on the image to be processed to obtain a high-definition face image.
[0010] In some implementations, after constructing a face prediction image with a lip shape corresponding to the target frame based on the facial key points mapped by the optimized feature sequence, the process includes: calling a key point loss function to compare the face prediction image with the high-definition face image, obtaining a key point loss value between the two, and optimizing the face prediction image based on the key point loss value until the face prediction image is the same as or similar to the high-definition face image, wherein the key point loss value is used to characterize the difference between the face prediction image and the high-definition face image between a preset number of facial key points.
[0011] In some implementations, the method further includes: attaching the predicted face image, which is the same as or similar to the high-definition face image, to the image to be processed in the video data.
[0012] Another aspect of this disclosure provides an apparatus for generating a speaking face, comprising: a feature splicing module for splicing facial features corresponding to an image set in an original video and sound features corresponding to a speech frame set in audio data to obtain a spliced feature sequence, wherein the image set includes an image to be processed and multiple reference images located before and after it in a temporal sequence, and the speech frame includes a target frame corresponding to the image to be processed and multiple reference speech frames located before and after it in a temporal sequence; an optimization module for calling an affine transformation module to perform deformation optimization processing on the spliced feature sequence to generate an optimized feature sequence; and a prediction image construction module for constructing a predicted face image with a lip shape corresponding to the target frame based on facial key points mapped by the optimized feature sequence.
[0013] Another aspect of this disclosure provides an electronic device comprising: a memory storing execution instructions; and a processor executing the execution instructions stored in the memory, such that the processor performs the method for generating a speaking face as described in any of the preceding embodiments.
[0014] Another aspect of this disclosure provides a readable storage medium storing executable instructions that, when executed by a processor, are used to implement the method for generating a speaking face as described in any of the above embodiments. Attached Figure Description
[0015] The accompanying drawings illustrate exemplary embodiments of the present disclosure and, together with the description thereof, serve to explain the principles of the present disclosure. These drawings are included to provide a further understanding of the present disclosure and are incorporated in and constitute a part of this specification.
[0016] Figure 1 This is a flowchart illustrating a method for generating a speaking face according to an exemplary embodiment of this disclosure.
[0017] Figure 2 This is an architectural diagram of a method for generating a speaking face according to an exemplary embodiment of this disclosure.
[0018] Figure 3 This is a flowchart illustrating the facial and voice feature processing of an exemplary embodiment of this disclosure.
[0019] Figure 4 This is a block diagram of a speaking face generation apparatus according to an exemplary embodiment of the present disclosure. Detailed Implementation
[0020] The present disclosure will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the scope of the disclosure. Furthermore, it should be noted that, for ease of description, only the parts relevant to the present disclosure are shown in the accompanying drawings.
[0021] It should be noted that, where there is no conflict, the embodiments and features described in this disclosure can be combined with each other. The technical solutions of this disclosure will now be described in detail with reference to the accompanying drawings and embodiments.
[0022] Unless otherwise stated, the exemplary implementations / embodiments shown are to be understood as providing exemplary features of various details that provide ways in which the technical concepts of this disclosure can be implemented in practice. Therefore, unless otherwise stated, the features of various implementations / embodiments may be additionally combined, separated, interchanged and / or rearranged without departing from the technical concepts of this disclosure.
[0023] The use of crosshairs and / or shading in the accompanying drawings is generally used to clarify the boundaries between adjacent components. Thus, unless otherwise stated, the presence or absence of crosshairs or shading does not convey or indicate any preference or requirement for the specific material, material properties, dimensions, proportions, commonalities between the illustrated components, or any other characteristics, properties, etc., of the components. Furthermore, in the accompanying drawings, the dimensions and relative dimensions of components may be exaggerated for clarity and / or descriptive purposes. When exemplary embodiments can be implemented differently, a specific process sequence may be performed in a different order than that described. For example, two consecutively described processes may be performed substantially simultaneously or in the reverse order of their description. Furthermore, the same reference numerals denote the same components.
[0024] When a component is referred to as being "on" or "above" another component, "connected to," or "joined to" another component, the component may be directly on, directly connected to, or directly joined to the other component, or there may be intermediate components. However, when a component is referred to as being "directly on" another component, "directly connected to," or "directly joined to" another component, there are no intermediate components. Therefore, the term "connection" can refer to a physical connection, an electrical connection, etc., and may or may not have intermediate components.
[0025] The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used herein, unless the context clearly indicates otherwise, the singular forms “a” and “the” are intended to include the plural forms as well. Furthermore, when the terms “comprising” and / or “including” and variations thereof are used in this specification, it indicates the presence of the stated features, integrals, steps, operations, parts, components, and / or groups thereof, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, parts, components, and / or groups thereof. It should also be noted that, as used herein, the terms “substantially,” “about,” and other similar terms are used as approximate terms rather than as terms of degree, thus explaining the inherent biases in measurements, calculated values, and / or provided values that would be recognized by one of ordinary skill in the art.
[0026] Figure 1 This is a flowchart illustrating a method for generating a speaking face according to an exemplary embodiment of this disclosure. Figure 2 This is an architectural diagram of a method for generating a speaking face according to an exemplary embodiment of this disclosure; Figure 3 This is a flowchart illustrating the facial and voice feature processing of an exemplary embodiment of this disclosure. The following is in conjunction with… Figures 1 to 3 The steps of the speaking face generation method S100 of this disclosure are described.
[0027] Step S102: Concatenate the facial features corresponding to the image set in the original video and the sound features corresponding to the speech frame set in the audio data to obtain the concatenated feature sequence.
[0028] The original video is any video that requires dubbing, featuring a fully visible human face as a prototype. This allows for the rendering of the lip area of the prototype to achieve the dubbing effect. The original video is an integration of multiple consecutive images; dubbing the original video is essentially a process of rendering the lip movements of each individual image.
[0029] The image set includes the image to be processed and multiple reference images sequentially preceding and following it. These reference images correspond to the actual facial features of the prototype in the image to be processed, as well as prior data. The image to be processed is the target image for lip-sync rendering; depending on the dubbing requirements and / or the duration of the audio data, it can be any image from the original video. The reference images are multiple images in the original video sequentially preceding and following the image to be processed. During rendering, using multiple frames before and after the image to be processed as reference images enriches the facial features obtained and makes it easier to capture the personalized lip movements and emotions of the prototype during the current speech, thus making the subsequent prediction results more realistic to the prototype's lip movements. Furthermore, it avoids the situation where individual prediction images obtained when each image in the original video is used as the image to be processed are too independent, improving the correlation between results when predicting consecutive images.
[0030] For example, using the two frames preceding the image to be processed as two reference images, and the two frames following the image to be processed as two reference images, we can obtain an image set consisting of four reference images and one image to be processed. In other words, the image set does not encompass all images from the original video, but rather a collection of multiple images selected before and after the image to be processed. Of course, the reference images do not need to be temporally adjacent to the image to be processed; they can be spaced at a certain distance. However, the task in the reference images should be consistent with the prototype character in the image to be processed.
[0031] Audio data refers to the sound signals that correspond to the virtual character video. It contains certain semantic information, personalized voice parameters, and emotional attitudes, and different audio data are usually represented by different lip movements. Typically, when outputting plain audio data, it is difficult to provide the other party with a visual counterpart; without lip movements as a reference, it also increases the difficulty for the other party to understand the meaning. Therefore, it is necessary to dub the audio data into the original video, using the lip movements of the prototype character in the original video to achieve the audio-to-video conversion.
[0032] Audio data is usually a collection of multiple speech frames. We can use the speech frame that corresponds to the image to be processed in time sequence as the target frame. As the image to be processed changes, any speech frame in the audio data can be used as the target frame.
[0033] A speech frame set is a collection of multiple speech frames extracted from audio data. It includes the target frame corresponding to the image to be processed, and multiple reference speech frames that are sequentially located before and after the target frame. The target frame corresponds temporally to the image to be processed; the reference speech frames preceding the target frame correspond temporally to the reference images preceding the image to be processed; and the reference speech frames following the target frame correspond temporally to the reference images following the image to be processed. When extracting speech features from the target frame, it is necessary to combine the speech features of the other reference speech frames in the preceding and following contexts to form speech features that can relate to the context and represent complete sound information.
[0034] The concatenated feature sequence is the result of concatenating facial features from an image set with audio features from a speech frame set, one-to-one. The concatenated feature sequence includes multiple sequentially arranged features derived from the concatenation of facial and audio features. This sequence can characterize the key facial points of the prototype, providing data support for obtaining face prediction images.
[0035] Step S104: Call the affine transformation module to perform deformation optimization processing on the spliced feature sequence to generate an optimized feature sequence.
[0036] The affine transformation module is the execution unit for affine transformations. Affine transformations are two-dimensional geometric transformations that can change the position, orientation, and size of points on a plane through linear transformations and translations. Affine transformations mainly consist of several basic operations: rotation, translation, scaling, and shearing. These basic operations can be combined and represented using matrix multiplication. The affine transformation module can optimize the spliced feature sequences to solve the deformation problem of the predicted image formed based on the spliced feature sequences, making the predicted image more realistic.
[0037] An optimized feature sequence is a set of optimized features arranged in sequence. The optimized features are the result of the splicing of features. It contains a set of facial and voice features, which can represent the key facial points of the prototype and provide data support for obtaining face prediction images.
[0038] Step S106: Based on the facial key points mapped by the optimized feature sequence, construct a face prediction image with a lip shape corresponding to the target frame.
[0039] Facial keypoints are two-dimensional coordinate points extracted from the image to be processed and multiple related reference images, used to represent the positional information of each facial feature. Since the lip region in the image to be processed is covered by a mask, the input facial features do not include lip region features of the image to be processed. In other words, facial keypoints of the lip region need to be predicted based on the optimized feature sequence.
[0040] Lip shape is a movement of the lip area constructed based on facial key points. Different target frames will correspond to different lip shapes based on semantics, personalized sound parameters and emotions.
[0041] The face prediction image is a supplementary result for the lip region that is masked. That is, in the face prediction image, according to the corresponding target frame, the corresponding lip region motion is rendered for it, and the lip prediction result with lip region motion is attached to the lip region of the prototype person in the image to be processed, so that the prototype person in the obtained face prediction image has the lip region with the corresponding mouth shape.
[0042] In some implementations, before step S102, the method further includes: calling the full-face facial features of multiple reference images to supplement the regional facial features of the image to be processed, forming facial features corresponding to the image set; and calling the speech features of multiple reference speech frames to supplement the speech features of the target frame, forming sound features corresponding to the speech frame set.
[0043] Full-face facial features are the results of face detection and keypoint detection on the full-face region of each reference image. The corresponding features include keypoint features of the prototype's full-face region, with each keypoint containing its positional data. Additionally, full-face facial features also include personalized facial parameters of the prototype's full-face region, such as facial geometric features and color features. Using the full-face facial features of each reference image as prior data for subsequent steps in predicting lip region features in the image to be processed improves the detail of the lip region feature prediction results and the correlation between preceding and subsequent images.
[0044] Regional facial features refer to the facial regions other than the lips in the image to be processed. Since this disclosure mainly focuses on lip shape prediction in the lip region of the image to be processed, the lip region needs to be masked during data preprocessing. For example, the pixel values of the lip region in the image to be processed are adjusted to 0 to obscure its original lip features. Regional facial features are the feature recognition results of the remaining facial regions after the lip region is obscured. They include other personalized facial parameters of the prototype person other than the lip region, such as facial geometric features, pose features, and color features, providing data support for the subsequent natural and realistic stitching of the predicted lip region and non-lip region.
[0045] Specifically, the method for obtaining regional facial features is as follows: performing lip masking to expose the region of interest of the prototype person's face in the reference image; and calling a facial feature extraction tool to extract facial features from the region of interest to obtain regional facial features, wherein the regional facial features are used to characterize the personalized facial parameters of the prototype person.
[0046] In addition, the speech features corresponding to the target frame and the reference speech frame are all deep speech features. That is, each speech frame of the audio data is processed by an automatic speech recognition system based on deep learning. First, the system converts each speech frame into a corresponding text representation, and then extracts the acoustic feature representation corresponding to the text representation. Based on this, each speech frame can be converted into speech features that are easy to train and recognize.
[0047] In some embodiments, before step S102, the method further includes: extracting the image to be processed from the video data, retrieving multiple images that are sequentially preceding the image to be processed and multiple images that are sequentially following the image to be processed, and using the multiple images as reference images about the image to be processed; and extracting the target frame from the audio data, retrieving multiple audio frames that are sequentially preceding the target frame and multiple audio frames that are sequentially following the target frame, and using the multiple audio frames as reference audio frames about the target frame.
[0048] Typically, two frames preceding the image to be processed and two frames following it are used as reference images to ensure continuity between the images. Of course, the reference images do not need to be sequentially adjacent to the image to be processed, and can be set to other numbers as needed; there are no restrictions here.
[0049] Similarly, based on the acquisition results of the image set, corresponding time sequences and corresponding numbers of speech frames are extracted from the audio data, where the reference image and the reference speech frame correspond to each other, and the image to be processed and the target frame correspond to each other.
[0050] In some implementations, before performing steps S102 to S106, publicly available datasets or videos of public speeches can be used as training sample videos to train the speaking face generation model. The speaking face generation model is at least a neural network model used to perform the aforementioned steps S102 to S106.
[0051] In some embodiments, the method S100 for generating a speaking face further includes: performing high-definition processing on the image to be processed to obtain a high-definition face image.
[0052] Because the pixels of the image to be processed extracted from the original video are unclear, a process of high-definition processing is added to it, and this is used as a reference object for the face prediction image. By fitting the model prediction results with the reference object for training, the clarity of the prediction results of the generated speaking face model after training is improved.
[0053] Further, after step S106, the process includes: calling a keypoint loss function to compare the face prediction image with the high-definition face image, obtaining the keypoint loss value between the two, and optimizing the face prediction image based on the keypoint loss value until the face prediction image is the same as or similar to the high-definition face image, wherein the keypoint loss value is used to characterize the difference between a preset number of facial keypoints between the face prediction image and the high-definition face image.
[0054] The keypoint loss function is used to verify the deviation of key data such as perceptual loss value and keypoint loss value. The keypoint loss value includes deviation constraints on multiple keypoints (e.g., 68 keypoints) extracted from the face, especially the deviation constraints on keypoints in the lip region (e.g., 49 to 68 keypoints).
[0055] When the key point loss value between the predicted face image and the high-definition face image is less than or equal to the preset deviation threshold, it proves that the two images show the same or similar faces. This means that the clarity and realism of the predicted face image have met the requirements, and thus the training of the speaking face generation model is complete.
[0056] In some implementations, the method further includes: attaching a predicted face image that is the same as or similar to the high-definition face image to the image to be processed in the video data.
[0057] Furthermore, the predicted face images, which are based on each frame of the original video as the images to be processed, are arranged in chronological order to obtain an image sequence. Then, each predicted face image in the image sequence is sequentially pasted onto the corresponding position in the original video so that the prototype character in the original video performs the speaking action of the audio data, thereby obtaining a dubbed video.
[0058] In some implementations, the process also includes feature encoding of each feature in the full-face facial features, regional facial features, voice features, and optimized feature sequences, in order to convert each feature into a low-dimensional or simplified feature representation, thereby extracting the effective data of these features and reducing computational costs.
[0059] The proposed method for generating speaking faces uses the high-resolution processed image as a reference during model training, resulting in high-fidelity and high-definition face prediction images, thus enhancing the viewing experience. Furthermore, it extracts image features by using multiple frames preceding and following the image to be processed as reference images, and extracts sound features by using multiple speech frames preceding and following the target frame as reference speech frames. This ensures the continuity between the face prediction image and the preceding and following scenes, improving the richness of facial feature acquisition and further guaranteeing the realism and naturalness of the prediction results.
[0060] Figure 4This is a block diagram of a speaking face generation apparatus according to an exemplary embodiment of the present disclosure.
[0061] like Figure 4 As shown, this disclosure proposes a speaking face generation device 1000, comprising: a feature splicing module 1002, used to splice the face features corresponding to the image set in the original video and the sound features corresponding to the speech frame set in the audio data to obtain a spliced feature sequence, wherein the image set includes the image to be processed and multiple reference images located before and after it in time sequence, and the speech frame includes the target frame corresponding to the image to be processed and multiple reference speech frames located before and after it in time sequence; an optimization module 1004, used to call an affine transformation module to perform deformation optimization processing on the spliced feature sequence to generate an optimized feature sequence; and a prediction image construction module 1006, used to construct a face prediction image with a lip shape corresponding to the target frame based on the facial key points mapped by the optimized feature sequence.
[0062] The various modules of the speaking face generation device 1000 are proposed to execute the various steps of the speaking face generation method. Therefore, the execution principle and steps can be referred to the previous text and will not be repeated here.
[0063] The apparatus may include corresponding modules that perform one or more steps in the flowchart above. Therefore, each or more steps in the flowchart above may be performed by a corresponding module, and the apparatus may include one or more of these modules. A module may be one or more hardware modules specifically configured to perform a corresponding step, or implemented by a processor 1300 configured to perform a corresponding step, or stored in a computer-readable medium for implementation by the processor 1300, or implemented by some combination thereof.
[0064] This hardware architecture can be implemented using a bus architecture. The bus architecture can include any number of interconnect buses 1100 and bridges, depending on the specific application and overall design constraints of the hardware. Bus 1100 connects various other circuits 1400, including one or more processors 1300, memory 1300, and / or hardware modules. Bus 1100 can also connect various other circuits 1400 such as peripherals, voltage regulators, power management circuitry, external antennas, etc.
[0065] Bus 1100 can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Component (EISA) bus, etc. Bus 1100 can be categorized as an address bus, data bus, control bus, etc. For ease of representation, only one connection line is used in this diagram, but this does not imply that there is only one bus or one type of bus 1100.
[0066] Any process or method description in the flowcharts or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing a particular logical function or process, and the scope of preferred embodiments of this disclosure includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as will be understood by those skilled in the art to which embodiments of this disclosure pertain. Processor 1300 performs the various methods and processes described above. For example, the method embodiments of this disclosure may be implemented as software programs tangibly contained in a machine-readable medium, such as memory 1300. In some embodiments, part or all of the software program may be loaded and / or installed via memory 1300 and / or a communication interface. When the software program is loaded into memory 1300 and executed by processor 1300, one or more steps of the methods described above may be performed. Alternatively, in other embodiments, processor 1300 may be configured to perform one of the methods described above by any other suitable means (e.g., by means of firmware).
[0067] The logic and / or steps represented in the flowchart or otherwise described herein may be implemented in any readable storage medium for use by, or in conjunction with, an instruction execution system, apparatus or device (such as a computer-based system, a system including processor 1300 or other system that can fetch and execute instructions from, an instruction execution system, apparatus or device).
[0068] The present invention discloses a speaking face generation device that uses the high-definition processing result of the image to be processed as the reference object for the model training process, enabling the face prediction image generated by the model to have high fidelity and high definition, thus improving the viewing experience of the prediction results. Furthermore, it extracts image features by using multiple frames before and after the image to be processed as reference images; and extracts sound features by using multiple speech frames before and after the target frame as reference speech frames, ensuring the continuity between the face prediction image and the preceding and following scenes, thereby improving the richness of facial feature acquisition and further ensuring the realism and naturalness of the prediction results.
[0069] For the purposes of this specification, a "readable storage medium" can be any means capable of containing, storing, communicating, propagating, or transmitting a program for use by or in conjunction with an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of readable storage media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and programmable read-only memory (EPROM or flash memory), fiber optic devices, and portable read-only memory (CDROM). Furthermore, a readable storage medium can even be paper or other suitable media on which a program can be printed, since a program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in memory.
[0070] It should be understood that various parts of this disclosure can be implemented in hardware, software, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0071] Those skilled in the art will understand that all or part of the steps of the methods described above can be implemented by a program instructing related hardware, and the program can be stored in a readable storage medium. When executed, the program includes one or a combination of the steps of the method implementation.
[0072] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into a single processing module, or each unit can exist physically separately, or two or more units can be integrated into a single module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a readable storage medium. The storage medium can be a read-only memory, a disk, or an optical disk, etc.
[0073] Those skilled in the art should understand that the above embodiments are merely for illustrating the present disclosure and are not intended to limit the scope of the disclosure. Those skilled in the art can make other changes or modifications based on the above disclosure, and these changes or modifications still fall within the scope of the present disclosure.
Claims
1. A method for generating a speaking face, characterized in that, include: Extract the image to be processed from the video data, and retrieve multiple images that are sequentially preceding the image to be processed and multiple images that are sequentially following the image to be processed, and use the multiple images as reference images about the image to be processed; Extract the target frame from the audio data, and retrieve multiple speech frames that are sequentially located before the target frame and multiple speech frames that are sequentially located after the target frame. Use the multiple speech frames as reference speech frames for the target frame, wherein the reference image corresponds to the reference speech frame, and the image to be processed corresponds to the target frame. The full-face facial features of multiple reference images are used to supplement the regional facial features of the image to be processed, forming facial features corresponding to the image set in the original video. The full-face facial features include key point features of the full-face region of the prototype person and personalized facial parameters of the full-face region of the prototype person. The personalized facial parameters include facial geometric features and color features. The image set includes the image to be processed and multiple reference images located before and after it in chronological order. The speech features of the target frame are supplemented by calling the speech features of multiple reference speech frames to form sound features corresponding to the speech frame set; The facial features and the voice features are concatenated to obtain a concatenated feature sequence. The voice framing includes a target frame corresponding to the image to be processed and multiple reference voice frames located before and after it in chronological order. The affine transformation module is invoked to perform deformation optimization processing on the spliced feature sequence to generate an optimized feature sequence; and Based on the facial key points mapped by the optimized feature sequence, a face prediction image with lip shapes corresponding to the target frame is constructed.
2. The method for generating a speaking face according to claim 1, characterized in that, The method for obtaining the facial features of the region is as follows: Lip masking is applied to expose the region of interest on the face of the prototype figure in the reference image; and A facial feature extraction tool is invoked to extract facial features from the region of interest on the face to obtain the facial features of the region, wherein the facial features of the region are used to characterize the personalized facial parameters of the prototype.
3. The method for generating a speaking face according to claim 1, characterized in that, Also includes: The image to be processed is subjected to high-definition processing to obtain a high-definition face image.
4. The method for generating a speaking face according to claim 3, characterized in that, After constructing a face prediction image with a lip shape corresponding to the target frame based on the facial key points mapped by the optimized feature sequence, the process includes: The face prediction image is compared with the high-definition face image by calling the key point loss function to obtain the key point loss value between the two. The face prediction image is then optimized based on the key point loss value until the face prediction image is the same as or similar to the high-definition face image. The key point loss value is used to characterize the difference between the face prediction image and the high-definition face image between a preset number of facial key points.
5. The method for generating a speaking face according to claim 4, characterized in that, Also includes: The predicted face image, which is the same as or similar to the high-definition face image, is overlaid onto the image to be processed in the video data.
6. A device for generating a speaking face, characterized in that, include: The feature stitching module is used to extract the image to be processed from the video data, and to retrieve multiple images that are sequentially located before the image to be processed and multiple images that are sequentially located after the image to be processed, and to use the multiple images as reference images about the image to be processed. The feature splicing module is also used to extract target frames in audio data, retrieve multiple speech frames that are located before the target frame in time sequence and multiple speech frames that are located after the target frame in time sequence, and use the multiple speech frames as reference speech frames about the target frame, wherein the reference image corresponds to the reference speech frame and the image to be processed corresponds to the target frame. The feature stitching module is also used to call the full-face facial features of multiple reference images to supplement the regional facial features of the image to be processed, forming facial features corresponding to the image set in the original video. The full-face facial features include key point features of the full-face region of the prototype person and personalized facial parameters of the full-face region of the prototype person. The personalized facial parameters include facial geometric features and color features. The image set includes the image to be processed and multiple reference images located before and after it in chronological order. The feature splicing module is also used to call the speech features of multiple reference speech frames to supplement the speech features of the target frame, forming sound features corresponding to the speech frame set. The feature splicing module is also used to splice the facial features and the voice features to obtain a spliced feature sequence. The voice framing includes a target frame corresponding to the image to be processed and multiple reference voice frames located before and after it in chronological order. The optimization module is used to call the affine transformation module to perform deformation optimization processing on the spliced feature sequence to generate an optimized feature sequence; and The predicted image construction module is used to construct a predicted face image with a lip shape corresponding to the target frame based on the facial key points mapped by the optimized feature sequence.
7. An electronic device, characterized in that, include: The memory stores execution instructions; as well as A processor that executes execution instructions stored in the memory, causing the processor to perform the method for generating a speaking face according to any one of claims 1 to 5.
8. A readable storage medium, characterized in that, The readable storage medium stores execution instructions, which, when executed by a processor, are used to implement the method for generating a speaking face as described in any one of claims 1 to 5.