Face video synthesis system

The face video synthesis system achieves real-time lip-syncing and interactive facial video playback using CPU-based cloud processing and advanced facial expression techniques, addressing the limitations of conventional systems and enhancing user interaction.

JP2026100264APending Publication Date: 2026-06-19NOMURA RESEARCH INSTITUTE

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
NOMURA RESEARCH INSTITUTE
Filing Date
2024-12-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Conventional face video synthesis systems struggle with real-time lip-syncing to audio, requiring high-performance GPUs and failing to create realistic interactions with users.

Method used

A face video synthesis system that utilizes a cloud-based server with CPU processing, incorporating a speech recognition unit, text generation using a Large Language Model, and face video synthesis unit to create and play back synchronized facial videos in real-time without a dedicated GPU, using techniques like LivePortrait for natural facial expressions.

Benefits of technology

Enables real-time synthesis and playback of facial videos that lip-sync to audio, allowing interactive and realistic conversations with users, reducing the need for costly hardware and enhancing user engagement.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026100264000001_ABST
    Figure 2026100264000001_ABST
Patent Text Reader

Abstract

It synthesizes and plays back a video of a face lip-synced to audio in real time. [Solution] The system includes: a voice feature acquisition unit 80 that acquires text from a teacher video 4 by speech recognition and then acquires sample voice features 12 by speech synthesis; a material video creation unit 20 that creates a material video 21 by transferring the teacher video 4 onto an avatar's face image; a speech recognition unit 30 that acquires text from the user's speech by speech recognition; a text generation unit 40 that generates text of the avatar's response from the acquired text using LLM2; a speech synthesis unit 50 that creates a playback target voice from the response text by speech synthesis and acquires playback target voice features; and a face video synthesis unit 70 that determines the most suitable material video 21 based on predetermined conditions including the similarity between the playback target voice features and the sample voice features 12, and plays it together with the playback target voice.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to a technique for synthesizing a moving image of a human face, and more particularly to a technique effective for applying to a face moving image synthesis system that synthesizes a moving image in which the movement of the mouth is synchronized (lip-sync) with the voice.

Background Art

[0002] There is known a lip-sync technique that makes it look as if the person is speaking by synthesizing a face image in which the mouth moves in accordance with the voice.

[0003] For example, in Japanese Patent Application Laid-Open No. 2003-58908 (Patent Document 1), first shape data regarding the shape of the mouth when uttering each vowel is stored for each type of vowel, and types of consonants having common points in the shape of the mouth when pronouncing are classified into the same group, and second shape data regarding the shape of the mouth when uttering the consonants classified into the group is stored for each group, and the sound of the word is separated into vowels or consonants, and for each separated vowel or consonant, control of the movement of the face image is performed based on the first shape data corresponding to the vowel or the second shape data corresponding to the group in which the consonant is classified.

Prior Art Documents

Patent Documents

[0004]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0005] According to the prior art, when synthesizing a lip-synced face moving image, it is said that the amount of data regarding the shape can be reduced as compared with the conventional case, and the face moving image can be synthesized more realistically.

[0006] However, conventional technology does not anticipate the real-time synthesis and playback of facial videos that lip-sync to voice, such as actually interacting with the user using synthesized facial videos. Furthermore, real-time synthesis requires high-performance equipment such as a dedicated GPU (Graphics Processing Unit), which is costly.

[0007] Therefore, the object of the present invention is to provide a face video synthesis system that synthesizes and plays back face videos that are lip-synced to audio in real time.

[0008] The aforementioned and other objectives and novel features of the present invention will become apparent from this specification and the accompanying drawings. [Means for solving the problem]

[0009] A brief overview of some of the representative inventions disclosed in this application is as follows:

[0010] A representative embodiment of the present invention is a face video synthesis system that synthesizes and outputs an avatar face image synchronized with the user's speech, and comprises: a speech feature acquisition unit that acquires text from a plurality of training videos in advance by speech recognition and further synthesizes the text into speech to acquire predetermined feature information related to the speech of each training video as sample speech features; and a material video creation unit that creates a material video based on a transfer video in which the facial expressions of the people in each training video have been transferred onto the avatar face image.

[0011] Furthermore, the system includes: a speech recognition unit that acquires text from the user's speech using speech recognition; a text generation unit that generates text of the avatar's response from the text acquired by the speech recognition unit using a Large Language Model (LLM); a speech synthesis unit that creates a playback target voice from the response text using speech synthesis and acquires predetermined feature information related to the playback target voice as playback target voice features; and a face video synthesis unit that determines the source video that best matches the playback target voice based on predetermined conditions including the similarity between the playback target voice features and each of the sample voice features, and plays the source video together with the playback target voice. [Effects of the Invention]

[0012] The effects obtained by some of the representative inventions disclosed in this application can be briefly explained as follows:

[0013] In other words, according to a typical embodiment of the present invention, it becomes possible to synthesize and play back a facial video that lip-syncs to audio in real time. [Brief explanation of the drawing]

[0014] [Figure 1] This figure outlines an example configuration of a face video synthesis system, which is Embodiment 1 of the present invention. [Figure 2] This figure provides an overview of an example of operation in Embodiment 1 of the present invention. [Figure 3] This flowchart outlines an example of the processing flow for the pre-preparation step in Embodiment 1 of the present invention. [Figure 4] This figure provides an overview of an example of sample text in Embodiment 1 of the present invention. [Figure 5] This figure provides an overview of an example of speech features in Embodiment 1 of the present invention. [Figure 6] This figure outlines an example of a feature quantity added to the audio features in Embodiment 1 of the present invention. [Figure 7] This is a flowchart showing an overview of an example of the processing flow of the synthesis / playback step in Embodiment 1 of the present invention. [Figure 8] This is a diagram showing an overview of an example of the synthesis / playback processing method in Embodiment 1 of the present invention. [Figure 9] This is a diagram showing an overview of an example of switching of material videos in Embodiment 1 of the present invention. [Figure 10] This is a diagram showing an overview of a configuration example of a face video synthesis system according to Embodiment 2 of the present invention. [Figure 11] This is a diagram showing an overview of an example of creating a material video in Embodiment 2 of the present invention. [Figure 12] This is a flowchart showing an overview of an example of the processing flow of the pre - preparation step in Embodiment 2 of the present invention. [Figure 13] This is a diagram showing an overview of an example of creating a material video from a transferred video in Embodiment 2 of the present invention. [Figure 14] This is a diagram showing an overview of an example of creating a connecting video between emotions in Embodiment 2 of the present invention. [Figure 15] This is a diagram showing an overview of the real - time control of the emotion of an "avatar" in Embodiment 2 of the present invention. [Figure 16] This is a diagram showing an overview of the real - time control of the emotion of an "avatar" in Embodiment 2 of the present invention.

Embodiments for Carrying Out the Invention

[0015] Hereinafter, embodiments of the present invention will be described in detail based on the drawings. In all the drawings for explaining the embodiments, the same parts are generally denoted by the same reference numerals, and the repeated explanations thereof are omitted. On the other hand, for the parts described with reference numerals in a certain drawing, they will not be shown again in the explanations of other drawings, but may be referred to with the same reference numerals.

[0016] (Embodiment 1) <Overview> Figure 2 is a diagram illustrating an example of operation using the face video synthesis system, which is Embodiment 1 of the present invention. As shown in the figure, in this embodiment, a user can have a real-time voice conversation with a "person" displayed on a Web browser 31 on a user terminal 3 such as a PC (Personal Computer). The "person" displayed on the Web browser 31 is an "avatar" made of a face video synthesized by the face video synthesis system of this embodiment, and by lip-syncing the mouth movements to the speech uttered during the conversation, an environment is created that makes it seem as if the user is conversing with a "person".

[0017] The facial video synthesis method in the facial video synthesis system of this embodiment is divided into two steps: preparation and synthesis / playback.

[0018] In the preparation step, speech synthesis is performed on various sample texts, and their characteristic information is acquired. Then, for each synthesized speech, a video is created in advance with synchronized mouth movements using known lip-syncing techniques, and this video is used as source material.

[0019] In the synthesis and playback step, the system generates text to be output (spoken), synthesizes speech for it, searches for video footage from the pre-prepared source material that matches the synthesized speech, rearranges it, and displays and plays it together with the synthesized speech. In this embodiment, the synthesis and playback of face videos is achieved in real time by performing the series of processes in the synthesis and playback step immediately.

[0020] <System Configuration> Figure 1 is a diagram illustrating an example configuration of a face video synthesis system according to Embodiment 1 of the present invention. The face video synthesis system 1 is composed of, for example, server equipment or a virtual server built on a cloud computing service, and uses a CPU (not shown) to execute middleware such as an OS (Operating System), DBMS (Database Management System), and Web server programs, which are loaded into memory from a storage device such as an HDD (Hard Disk Drive) or SSD (Solid State Drive), as well as software running on it, thereby realizing various functions related to face video synthesis.

[0021] The face video synthesis system 1 includes, for example, a sample voice creation unit 10, a source video creation unit 20, a voice recognition unit 30, a text generation unit 40, a voice synthesis unit 50, a face video synthesis unit 60, and a synthesis processing control unit 70, all of which are implemented as software.

[0022] The sample voice creation unit 10, in the preparation step, uses LLM2, a large-scale language model such as ChatGPT (registered trademark, hereinafter the same), to create various sample texts. Furthermore, it has the function to create sample synthesized voices 11 for each sample text using a known speech synthesis or text-to-speech service or library (for example, Microsoft's Azure (registered trademark, hereinafter the same) TextToSpeech (TTS)), and to acquire sample voice features 12, which are characteristic information of each sample synthesized voice 11. The contents of the sample voice features 12 will be described later.

[0023] The material video creation unit 20 has a function in the preparation step to create material videos 21 in which the mouth movements are synchronized with each sample synthesized voice 11, based on each sample synthesized voice 11 and reference images of the face and head movements of a person who will be an "avatar," using a known lip-sync service or library (for example, the open-source software (OSS) SadTalker).

[0024] The speech recognition unit 30 has the function of acquiring voice data related to the user's speech obtained via a microphone function (not shown) on the user terminal 3 during the synthesis and playback step, and converting it into text using a known speech recognition service or library. The text generation unit 40 has the function of generating text in response to the text created by the speech recognition unit 30 using LLM2. The speech synthesis unit 50 has the function of performing speech synthesis on the response text generated by LLM2 using a known speech synthesis or text-to-speech service or library.

[0025] The face video synthesis unit 60 searches for the material video 21 created by the material video creation unit 21 that best matches the synthesized voice, based on the voice synthesized by the voice synthesis unit 50. It then rearranges the material video according to the synthesized voice and displays and plays it on the user terminal 3 together with the synthesized voice. The synthesis processing control unit 70 provides control and user interface functions for displaying and playing the synthesized face video on the user terminal 3.

[0026] In the example shown in Figure 1, all components related to the preparation step and the synthesis / playback step are provided by a single server system. However, the system is not limited to this configuration, and may be configured with multiple server systems providing these components in a distributed manner. Furthermore, to avoid network delays between the user terminal 3 and the system, the execution functions of each component related to the synthesis / playback step, including the source video 21 and sample audio feature 12 data, may be downloaded to the user terminal 3 first, and the synthesis / playback processing may be executed in a local environment on a web browser 31 (not shown) on the user terminal 3. This makes it possible to perform real-time face video synthesis using the CPU of the user terminal 3 without requiring a dedicated GPU.

[0027] <Processing Flow (Preparation)> Figure 3 is a flowchart outlining an example of the processing flow of the pre-preparation step in Embodiment 1 of the present invention. First, the sample voice creation unit 10 creates various sample texts using LLM2 (S01). In this embodiment, sample texts are obtained by instructing LLM2 with prompts to generate texts that cover various variations in mouth movements and shapes when spoken. Specifically, for example, LLM2 is made to generate a total of about 250 texts of three different lengths.

[0028] Figure 4 is a diagram illustrating an example of sample text in Embodiment 1 of the present invention. In this embodiment, approximately 250 example texts (sentences) are created in three lengths: short (approximately 5 characters), medium (approximately 20 characters), and long (approximately 50 characters), with approximately 100 short, 100 medium, and 50 long texts. Note that the length of each type of text, the number of types, and the number of texts to be created are not limited to these and can be adjusted as appropriate. Furthermore, the length of the text does not need to strictly match the length of each type (e.g., 5 characters or 20 characters in this embodiment), but only needs to fall within a certain range for each type.

[0029] Returning to Figure 3, the approximately 250 sample texts created in step S01 are then subjected to speech synthesis using known speech synthesis or text-to-speech services or libraries to obtain sample synthesized speech 11 (S02), and sample speech features 12 are obtained for each sample synthesized speech 11 (S03). For this process, known speech synthesis or text-to-speech services or libraries such as Azure TTS can be used.

[0030] Figure 5 is a diagram illustrating an example of a sample speech feature 12 in Embodiment 1 of the present invention. In the example in Figure 5, the sample speech feature 12 includes the content of the sample text, the start and end times of the pronunciation in the sample synthesized speech 11, the duration of the pronunciation (in seconds), and an ID that identifies the mouthform. In this embodiment, this information is stored in a JSON (JavaScript Object Notation) file, but is not limited to this.

[0031] The start and end times of pronunciation, as well as the duration of pronunciation, can be calculated as relative times from the beginning of the sample synthesized speech 11, for example, based on the start and end timestamps (absolute times) of each word output during speech synthesis of the sample text. The duration between these two points can also be calculated.

[0032] A mouth shape ID represents the transition of a person's mouth shape when pronouncing the target text, indicated by an array of mouth shape IDs. A mouth shape is a visual representation of the mouth shape corresponding to a phoneme, which is the sound of an individual sound in spoken language. Each mouth shape corresponds to one or more phonemes (a phoneme and another phoneme may share the same mouth shape). In the aforementioned Azure TTS, a "visemeID" can be obtained for the synthesized speech, which makes each mouth shape identifiable by its ID.

[0033] Returning to Figure 3, after acquiring sample synthesized speech 11 and sample speech features 12 related to the approximately 250 sample texts created, the material video creation unit 20 receives input or specification of reference images of a person's face to be an "avatar" and reference videos of head movements (S04), and uses a known lip-sync service or library to create a material video 21 in which the mouth movements of the reference images and reference videos are synchronized with each sample synthesized speech 11 (S05).

[0034] Then, the features of the created source video 21 are obtained and added to the JSON file of sample audio features 12 (S06), and the processing of the pre-preparation step is completed. For the features of the source video 21, for example, the coordinates of facial landmarks (important feature points for extracting facial features such as the position of the eyes and nose) in the source video 21 are obtained using known facial recognition technology. In this embodiment, for example, the coordinates of facial landmarks in the frame images at the start and end of the pronunciation in the source video 21 are obtained using the Python (registered trademark, hereinafter the same) library.

[0035] Figure 6 is a diagram illustrating an example of features added to the sample speech feature 12 in Embodiment 1 of the present invention. As shown in the figure, in this embodiment, for example, the coordinates of the left eye, right eye, nose, and mouth are obtained as landmarks at the start and end of pronunciation, respectively. This information is added to the JSON file of the sample speech feature 12 as described above.

[0036] <Processing flow (synthesis and regeneration)> Figure 7 is a flowchart outlining an example of the processing flow of the synthesis and playback step in Embodiment 1 of the present invention. First, the system receives voice input related to the user's speech via the microphone function of the user terminal 3 (S11). Then, the voice recognition unit 30 converts the acquired voice data into text using a known voice recognition service or library (S02). Subsequently, based on the created text, the text generation unit 40 generates text of the response content using LLM2 (S13).

[0037] Then, the speech synthesis unit 50 performs speech synthesis on the response text generated by LLM2 using known speech synthesis and text-to-speech services and libraries (S14), and acquires speech features (speech features for playback) for the synthesized speech (speech to be played back) in the same way as in the example in Figure 5 (S15). Based on the similarity between these speech features for playback and the sample speech features 12 related to each material video 21 acquired in the preparation step, the face video synthesis unit 60 searches for the material video 21 that best matches the speech to be played back (S16), and plays the obtained material video 21 in synchronization with the speech to be played back to play back the face video (S17). By performing this series of processes immediately, face video synthesis and playback can be performed in real time.

[0038] Furthermore, before starting processing in step S11, as described above, the execution functions of each part related to the synthesis and playback step, including the source video 21 and sample audio feature 12 data, may be downloaded in advance to a Web browser 31 (not shown) on the user terminal 3, and the synthesis and playback processing may be executed in the local environment of the user terminal 3. In this case, for example, an IndexedDB (a key-value database built on the Web browser 31) can be built on the Web browser 31, and the downloaded source video 21 data and the JSON data of the sample audio feature 12 can be stored therein, making it possible to access this data at high speed.

[0039] Figure 8 is a diagram illustrating an example of a synthesis and playback processing method in Embodiment 1 of the present invention. In this embodiment, considering real-time performance, the generation of the response text in step S13 of Figure 7 uses streaming output that outputs one character at a time. When a sentence break such as punctuation or a question mark is detected, the speech synthesis process from step S14 onwards and the acquisition of speech features in step S15 are immediately performed on the text up to that point. The subsequent processes, including the search for the source video 21 in step S16 and the playback of the face video and audio in step S17, are asynchronously processed using a FIFO (First-In-First-Out) queue for the exchange of processing results at the sentence break level, thereby increasing the response speed.

[0040] In this embodiment, when searching for the source video 21 that best matches the synthesized speech in step S16, the coordinates of the voice features and facial landmarks are used to obtain a source video 21 that connects naturally to the most recently played source video 21 and whose mouth movements also match the synthesized speech. Specifically, the similarity between the voice features (voice features to be played) of the generated synthesized speech (voice to be played) and the utterance duration and mouth shape transitions of the sample voice features 12 of each source video 21 is scored using a predetermined method, and the degree of discrepancy between the coordinates of the facial landmark in the final frame of the most recently played source video 21 and the coordinates of the facial landmark in the first frame of each source video 21 is scored using a predetermined method, and the most suitable source video 21 is determined based on the scores from these two perspectives.

[0041] Furthermore, in this embodiment, when playing the next material video 21 following the previous material video 21, the synthesis processing control unit 70, etc., inserts a pre-prepared short, silent "transition video" with the mouth closed in between. This reduces the discrepancy between the previous material video 21 and the next material video 21, allowing the facial images to be played back smoothly.

[0042] In this case, when sequentially switching between and playing source video 21 and silent video, a white screen (a screen with nothing displayed) may briefly appear during the switch, making it possible for the user to recognize that the video has been switched. In contrast, in this embodiment, two areas for playing videos are provided on the web browser 31, and these are used to switch between videos, making the breaks even less noticeable.

[0043] Figure 9 is a diagram illustrating an example of switching source video 21 in Embodiment 1 of the present invention. In the upper diagram, the web browser 31 has two areas: a display area 32 and a hidden area 33 (implemented by the hidden attribute of HTML (HyperText Markup Language)). The display area 32 indicates that source video 21_1 is currently playing. At this time, the composite processing control unit 70 or the like preloads source video 21_2 to be played next into the hidden area 33.

[0044] When playback of source video 21_1 finishes, as shown in the middle diagram, the composite processing control unit 70, etc., overlays the next source video 21_2, which has been loaded in the hidden area 33, on top of source video 21_1 in the display area 32 and starts playback. Then, as shown in the lower diagram, after playback of source video 21_2 begins, the next source video to be played, 21_3, is loaded into the hidden area 33. By repeating this series of processes, the breaks between source videos 21 and silent videos can be made less noticeable. In addition, situations in which the next source video 21 to be played cannot be prepared in time due to network delays or overload on the user terminal 3 can be avoided. Note that in the example in Figure 9, the display area 32 and the hidden area 33 are arranged vertically, but this is not the only arrangement.

[0045] As described above, according to the first embodiment of the present invention, the facial video synthesis system 1 generates text in response to the user's pronunciation, synthesizes the corresponding audio (audio to be played back) and acquires audio features, searches for the video that best matches the audio to be played back based on the audio features from the material video 21 created in advance, rearranges them, and displays and plays them back together with the audio to be played back. This makes it possible to synthesize and play back facial videos synchronized with the mouth movements of the audio to be played back in real time.

[0046] In this embodiment, the text that forms the basis of the audio to be synthesized into the facial video (the audio to be played back) is generated by the text generation unit 40 using LLM2 (as a response to the user's utterance), but it is not limited to this, and any text can be used as the target.

[0047] (Embodiment 2) <Overview> In the face video synthesis system 1 of Embodiment 1 of the present invention described above, as shown in Figure 8, when playing back the synthesized face video and audio, a short, silent "transition video" is inserted between the previous source video 21 and the next source video 21 to reduce the discrepancy between the source videos 21 and to play back the face images so that they connect smoothly. However, depending on the face image in the final frame of the previous source video 21 and the face image in the first frame of the next source video 21, the discrepancy may be noticeable even when a silent video is inserted in between. Therefore, in the face video synthesis system of Embodiment 2 of the present invention, the method for creating the source videos 21 is improved to further reduce the discrepancy between preceding and succeeding source videos 21.

[0048] Furthermore, in this embodiment, in order to enable the "avatar" to express emotions and facial expressions according to the content of the conversation with the user, various emotion-specific source videos 21 are created in advance, and the system recognizes the emotions during the conversation and their changes, and searches for, selects, and plays the source video 21 corresponding to the recognized emotion. This makes it possible to control the "avatar's" emotional expressions in real time.

[0049] In this embodiment, in order to enable the creation of source videos 21 that express various emotions, instead of using known lip-sync techniques such as SadTalker, which were used in Embodiment 1 (technologies that synchronize the mouth movements of a face image with the sound), known animation techniques that create videos by animating the facial expressions of still images of faces are used as the model for creating the source videos 21. For example, techniques like the open-source software LivePortrait, which transfer the facial movements in an input face video to a still image of another target face to generate a face video. Face videos obtained using this technique are created based on the actual facial movements and expressions of people, and therefore have the characteristic of having natural and realistic facial and mouth movements and expressions.

[0050] <System Configuration> Figure 10 is a diagram illustrating an example configuration of a face video synthesis system according to Embodiment 2 of the present invention. It is generally similar to the example configuration of the face video synthesis system 1 shown in Figure 1 of Embodiment 1 described above, and only the differences will be explained below. The face video synthesis system 1 of this embodiment does not have the sample voice creation unit 10 of the face video synthesis system 1 of Embodiment 1 shown in Figure 1, and newly includes a voice feature acquisition unit 80 and an emotion recognition unit 90. Accordingly, the functions and operations of the other units are also modified and changed as appropriate.

[0051] In this embodiment, during the preparation step, the material video creation unit 20 creates material videos 21 using known animation techniques such as LivePortrait, as described above. As will be described in detail later, a large number of facial videos of people speaking with various expressions are created and input as training videos 4, and each training video 4 is transferred to the facial image of the target "avatar" to create a large number of material videos 21.

[0052] Figure 11 is a diagram illustrating an example of creating source video in Embodiment 2 of the present invention. The left side of the figure shows multiple face videos (training videos 4) of a man talking about various things, and the right side of the figure shows face videos (source video 21) created by transferring the facial movements and expressions from the face videos onto an image of a female "avatar". In this way, a large number of source videos 21 of an "avatar" talking are created from a large number of training videos 4 of a person talking.

[0053] The audio feature acquisition unit 80 has the function of acquiring sample audio features 12 for each training video 4, similar to those shown in Figure 5 of Embodiment 1. In this embodiment, since animation technology that creates face videos from still images such as LivePortrait is used as the model used to create the source video 21, it is not possible to obtain sample audio features 12 using the same method as in Embodiment 1.

[0054] In this embodiment, the speech feature acquisition unit 80 uses, for example, a known speech recognition service or library (e.g., Microsoft's Azure SpeechToText (STT)) to convert the speech in the training video 4 into text, and also acquires the start and end timestamps (absolute time) of each word. Using these timestamps, the number of seconds for the start and end of pronunciation, and the duration between them, are calculated as relative times from the beginning of the training video 4. Furthermore, information on mouth shapes is also acquired by inputting the obtained text back into Azure TTS and performing speech synthesis. Through this process, sample speech features 12 similar to those in Figure 5 of Embodiment 1 can be obtained even from the training video 4.

[0055] The emotion recognition unit 90 takes the speech recognition unit 30's speech data converted into text as input and has the function of recognizing and estimating the emotions (joy, sadness, anger, neutral, etc.) of the "avatar" that will be the conversation partner in real time using the LLM2. Based on the emotions of the "avatar" recognized here, the face video synthesis unit 60 searches for and plays the corresponding source video 21.

[0056] <Processing Flow (Preparation)> Figure 12 is a flowchart outlining an example of the processing flow of the pre-preparation step in Embodiment 2 of the present invention. Unlike the example in Figure 3 in Embodiment 1 described above, first, a large number of videos (for example, a total of 2000-3000 videos, with a size of several hundred MB) of any person (for example, a person in charge of the service or operation related to the face image synthesis system 1, a developer, an administrator, etc.) speaking are taken to create a training video 4 (S11). At this time, as will be described later, by creating each training video 4 with various emotional expressions, it is possible to create source videos 21 corresponding to each emotion. At that time, it is desirable to make it clear that the person is speaking by making slight movements of the shoulders, etc. In addition, source videos 21 of the person nodding or giving verbal cues (listening) are also created.

[0057] Subsequently, the speech feature acquisition unit 80 transcribes each training video 4 into text using Azure STT or the like, obtains a timestamp for each word, and calculates the start and end times of pronunciation in training video 4, as well as the duration between them (S12). Furthermore, the obtained text is input back into Azure TTS and speech synthesis is performed to acquire mouth shape information (S13). This information is output as sample speech features 12.

[0058] Subsequently, the source video creation unit 20 inputs each teacher video 4 into LivePortrait or the like to create a video in which the facial movements and expressions in the teacher video 4 are transferred to the face image of a predetermined "avatar" (S14).

[0059] As shown in the example in Figure 8 of Embodiment 1, in the face video synthesis and playback step, the material video 21 that is searched for as the most suitable based on the similarity of the audio features (audio features to be played back) of the synthesized voice to be output (audio to be played back) and the naturalness of the connection between the material videos 21 is connected and played back. At that time, a short, silent "transition video" is inserted between the previous material video 21 and the next material video 21 to reduce the misalignment between the material videos 21 and to play back so that the face images are connected smoothly. However, depending on the face image in the last frame of the previous material video 21 and the face image in the first frame of the next material video 21, the misalignment of the face may be noticeable even when a silent video is inserted in between.

[0060] Therefore, in this embodiment, in order to further reduce the discrepancy between source videos 21, the transcription video created from the teacher video 4 is edited by cutting out only the section in which the actual speech is spoken and deleting the unnecessary parts before and after it (S15), and predetermined face images are set as fixed frames at the start and end of the video (S16), and then output as source video 21. By fixing the start and end of source video 21 as fixed frames, it is possible to eliminate the discrepancy in the position of the face when connecting source videos 21. Even in this case, in the face video synthesis and playback step, if there are delimiters such as commas or question marks in the output audio (text), a short silent "connecting video" may be inserted between the preceding and succeeding source videos 21, as shown in the example in Figure 8 of Embodiment 1, in order to adjust the timing. Once all source videos 21 are output, the processing of the pre-preparation step is completed.

[0061] Figure 13 is a diagram illustrating an example of creating a source video 21 from a transcription video in Embodiment 2 of the present invention. The upper part of the figure shows a transcription video created from a training video 4 using LivePortrait or the like, and the lower part of the figure shows that a cut video, which is obtained by cutting out the section in which the speaker is actually speaking, is given a source video 21 by setting fixed frames consisting of predetermined identical face images at the beginning and end of the cut video. At this time, frame interpolation is performed between the fixed frames and the cut video using known tools or libraries so that the fixed frames at the beginning and end of the video and the cut video are smoothly connected.

[0062] <Emotional expression> As described above, in this embodiment, by creating separate source videos 21 corresponding to various emotions, it becomes possible to recognize emotions and their changes during a conversation in real time, and to search for, select, and play the source video 21 corresponding to the recognized emotion. The source videos 21 for each emotion (joy, sadness, anger, neutral, ...) are created, for example, from a large number of training videos 4 created for each emotion using the method shown in the examples in Figures 12 and 13 above. At this time, the fixed frames for the start and end positions of each source video 21 are fixed frames based on the facial expression for the target emotion.

[0063] When combining multiple source videos 21, there are cases where source videos 21 representing different emotions are combined depending on the emotional changes of the "avatar" during the conversation. In this embodiment, to ensure that the source videos 21 between each emotion are smoothly connected, the aforementioned short, silent "transition videos" inserted between preceding and succeeding source videos 21 are created to connect the emotions.

[0064] Figure 14 is a diagram illustrating an example of creating a video that connects emotions according to Embodiment 2 of the present invention. The upper part of the figure shows a short, silent video created using LivePortrait or the like, while the lower part of the figure shows that a video is created by cutting a predetermined section (for example, a section divided into fixed time intervals) from the silent video, and then setting fixed frames consisting of facial images of the emotions to be connected at the beginning and end of the video to create the source video 21.

[0065] In the example in Figure 14, the first fixed frame represents "joy" and the last fixed frame represents "anger," thus creating a transitional video showing the transition from "joy" to "anger." Such transitional videos are created for each combination and pattern of emotional transitions. Furthermore, for these transitional videos as well, frame interpolation is performed between the fixed frames and the silent clips using known tools or libraries to ensure a smooth transition between each emotion's fixed frame and the clip.

[0066] Figures 15 and 16 illustrate the real-time control of the emotions of an "avatar" in Embodiment 2 of the present invention. In the example in Figure 15, while a user is speaking to an "avatar" displayed on a web browser, (1) the face image synthesis system 1 receives the speech as input and the speech recognition unit 30 performs speech recognition. Then, (2) the emotion recognition unit 90 infers and recognizes the emotions of the "avatar" based on the speech-recognized text. Subsequently, (3) the face video synthesis unit 60 selects and synthesizes a source video 21 so that the "avatar" is nodding in response to the recognized emotion, and plays it on the web browser. This makes it possible to have the "avatar" listening to the user while expressing emotions.

[0067] Once the user has finished speaking, (4) the text generation unit 40 generates a response as an "avatar" based on the speech-recognized text, and (5) the speech synthesis unit 50 synthesizes the speech corresponding to the response text. Then, (6) the face video synthesis unit 60 selects and synthesizes the synthesized speech, speech features, and material video 21 corresponding to emotions, and (7) plays it on the web browser and outputs the synthesized speech. This makes it possible to have the "avatar" speak to the user while expressing emotions.

[0068] Figure 16 shows an example of expressing emotions by selecting and synthesizing source videos 21 in real time according to the changes in the emotions of the "avatar" predicted and determined by the emotion recognition unit 90. In the example in Figure 16, the source videos 21 are played in the order of the thick arrows. When the user speaks, the emotion is initially recognized as "joy," so a source video 21 of the avatar nodding with a "joyful" expression is played. Subsequently, when the emotion recognition unit 90's recognition of the emotion changes from "joy" to "anger" based on the content of the user's speech, a transitional video showing the change in emotion from "joy" to "anger" is played, followed by a source video 21 of the avatar nodding with an "anger" expression. Finally, when the user finishes speaking, a source video 21 of the avatar responding with an "anger" expression is synthesized and played. In this way, emotions can be expressed in real time in response to the changes in the "avatar's" emotions.

[0069] As described above, according to the face video synthesis system 1 of Embodiment 2 of the present invention, by fixing the start and end frames of the source video 21, it is possible to eliminate the misalignment of the face position during transitions between source video 21. Furthermore, source video 21 corresponding to various emotions are created in advance, and the system recognizes the emotions during the conversation and their changes in real time, and searches for, selects, and plays the source video 21 corresponding to the recognized emotion. This makes it possible to control the emotional expression of the "avatar" in real time.

[0070] The present inventors have described the invention in detail based on embodiments above, but it goes without saying that the present invention is not limited to the above embodiments and can be modified in various ways without departing from its essence. Furthermore, the above embodiments are described in detail for the purpose of explaining the present invention in an easy-to-understand manner and are not necessarily limited to those having all the described configurations. It is also possible to replace a part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add a part of the configuration of another embodiment to the configuration of one embodiment. In addition, it is possible to add, delete, or replace parts of the configuration of each embodiment with other configurations.

[0071] Furthermore, each of the above configurations, functions, processing units, and processing means may be implemented in hardware, in whole or in part, for example, by designing them as integrated circuits. Alternatively, each of the above configurations, functions, and means may be implemented in software by having the processor interpret and execute programs that implement each function. Information such as programs, tables, and files that implement each function can be stored in memory, hard disks, SSDs, or other recording devices, or in recording media such as IC cards, SD cards, or DVDs.

[0072] Furthermore, in the diagrams above, the control lines and information lines shown are those deemed necessary for explanation and do not necessarily represent all control lines and information lines that would be present in the actual implementation. In reality, it can be assumed that almost all components are interconnected. [Industrial applicability]

[0073] This invention can be used in a face video synthesis system that synthesizes video with synchronized mouth movements to audio. [Explanation of Symbols]

[0074] 1…Face video synthesis system, 2…LLM, 3…User terminal, 4…Teacher video, 10...Sample voice creation section, 11...Sample synthesized voice, 12...Sample voice characteristics, 20...Material video creation department, 21, 21_1~3...Material videos, 30... Voice recognition unit, 31...Web browser, 32...Display area, 33...Hidden area 40...Text generation unit, 50...Speech synthesis unit, 60...Face video synthesis unit, 70...Synthesis processing control unit, 80...Speech feature acquisition unit, 90...Emotion recognition unit

Claims

1. A facial video synthesis system that synthesizes and outputs an avatar's facial image synchronized with the user's speech, A voice feature acquisition unit acquires predetermined characteristic information related to the audio of each of the teacher videos as sample audio features by acquiring text from multiple teacher videos in advance using speech recognition, and further acquiring the text by speech synthesis. A material video creation unit creates a material video based on a transfer video in which the facial expressions of the person in each of the aforementioned teacher videos are transferred to the face image of the avatar, A speech recognition unit that acquires text from the user's speech using speech recognition, A text generation unit generates text of the avatar's response from the text acquired by the speech recognition unit using a large-scale language model (hereinafter referred to as "LLM"). A speech synthesis unit creates a speech to be played back from the text of the response by speech synthesis and acquires predetermined characteristic information related to the speech to be played back as speech features, A face video synthesis unit determines the source video that best matches the audio to be played back based on predetermined conditions including the similarity between the audio feature to be played back and each of the sample audio features, and plays the source video together with the audio to be played back. A facial video synthesis system having the following features.

2. In the face video synthesis system according to claim 1, A face video compositing system in which the first and last frames of the aforementioned source video are set to the same predetermined face image.

3. In the face video synthesis system according to claim 1, A face video synthesis system that performs processing by the speech synthesis unit and the face video synthesis unit at each unit of sentence segmentation in the text of the response.

4. In the face video synthesis system according to claim 1, further, A face video synthesis system comprising a web browser on a terminal that plays the audio to be played and the source video, the system having a display area for playing the first source video and a hidden area for loading the second source video to be played next, and a synthesis processing control unit that, after the playback of the first source video is finished, overlays the second source video onto the display area and starts playback, and loads the third source video to be played next into the hidden area.

5. In the face video synthesis system according to claim 1, The face image synthesis unit, when playing the source video together with the audio to be played, inserts a predetermined silent source video between the first source video and the second source video to be played next, and plays them together, in a face video synthesis system.

6. In the face video synthesis system according to claim 1, further, The system includes an emotion recognition unit that infers the emotions of the avatar from the text acquired by the speech recognition unit, The aforementioned material video creation unit creates material videos with multiple types of emotions based on the aforementioned teacher videos with multiple types of emotions, The face video synthesis unit determines the material video that best matches the audio to be played from the material videos corresponding to the emotions inferred by the emotion recognition unit, in this face video synthesis system.

7. In the face video synthesis system according to claim 6, The face video synthesis system includes a face video synthesis unit which, when the emotion inferred by the emotion recognition unit changes from a first emotion to a second emotion, inserts and plays a predetermined silent video between the source video related to the first emotion and the source video related to the second emotion, in which the face image related to the first emotion is set in the first frame and the face image related to the second emotion is set in the last frame.