Conversion program, conversion device, conversion system, and conversion method

The conversion system addresses the challenge of translating speech in videos by using a processor-based system with machine learning to maintain tone and duration, ensuring accurate and natural speech translation.

JP7883812B1Active Publication Date: 2026-07-02TITAN INTELLIGENCE CO LTD

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
TITAN INTELLIGENCE CO LTD
Filing Date
2026-01-15
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing technologies fail to effectively translate and maintain the tone of voice when converting speech from one language to another in videos, leading to inaccuracies and a loss of natural speech characteristics.

Method used

A conversion system that includes a processor with speech voice acquisition, first language utterance text acquisition, second language utterance text translation, and second language utterance voice output functions, utilizing machine learning models to maintain the tone and length of the original speech while translating between languages.

Benefits of technology

The system accurately translates speech while preserving the vocal tone and duration, ensuring a natural impression in the translated video, reducing the risk of mistranslation and maintaining the original voice characteristics.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0007883812000001_ABST
    Figure 0007883812000001_ABST
Patent Text Reader

Abstract

This generates a video that translates the speaker's utterance from the first language to the second language. [Solution] The conversion program enables the processor to implement: a speech voice acquisition function that acquires speech voice containing the speaker's utterances in their first language; a first language utterance text acquisition function that acquires first language utterance text, which is a text representation of the speaker's utterances in their first language, based on the speech voice; a second language utterance text translation function that translates the first language utterance text into second language utterance text; and a second language utterance voice output function that outputs speech voice containing the translated utterances based on the second language utterance text.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] At least one embodiment of the present invention relates to a conversion program, a conversion device, a conversion system, and a conversion method.

Background Art

[0002] Patent Document 1 describes a machine translation method. The machine translation method generates a first translated document by translating a source document using a neural network, determines a correction target phrase from the phrases included in the source document based on the analysis result of the first translated document, replaces the correction target phrase with a high-frequency phrase in the training data used for training the neural network to correct the source document, and generates a second translated document by translating the corrected source document using a neural network.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] Speakers appear in videos such as animations, live-action movies, or posted videos. It would be beneficial to be able to generate a video in which the speech of the speaker, such as the dialogue, is translated from the first language to the second language, as it would enable the distribution of videos in other languages.

[0005] The objective of at least one embodiment of the present invention is to solve the above problems and provide a conversion program, a conversion device, a conversion system, and a conversion method that can generate a video in which the speech of the speaker is translated from the first language to the second language.

Means for Solving the Problems

[0006] In a non-limiting view, a conversion program according to one embodiment of the present invention provides a processor with the following functions: a speech voice acquisition function for acquiring speech voice containing the speaker's utterances in a first language; a first language utterance text acquisition function for acquiring first language utterance text, which is a text representation of the speaker's utterances in a first language, based on the speech voice; a second language utterance text translation function for translating the first language utterance text into second language utterance text; and a second language utterance voice output function for outputting speech voice containing the translated utterances based on the second language utterance text.

[0007] In a non-limiting view, a conversion device according to one embodiment of the present invention is a conversion device equipped with a processor, comprising: a speech voice acquisition function for acquiring speech voice including speech of a speaker in a first language; a first language speech text acquisition function for acquiring first language speech text, which is a text representation of the speaker's speech in a first language, based on the speech voice; a second language speech text translation function for translating the first language speech text into speech text in a second language; and a second language speech voice output function for outputting speech voice including the translated speech based on the second language speech text.

[0008] In a non-limiting view, a conversion system according to one embodiment of the present invention is a conversion system comprising at least a processor, wherein the processor is equipped with: a speech voice acquisition function for acquiring speech voice including a speaker's utterance in a first language; a first language utterance text acquisition function for acquiring first language utterance text, which is a text representation of the speaker's utterance in a first language based on the speech voice; a second language utterance text translation function for translating the first language utterance text into speech text in a second language; and a second language utterance voice output function for outputting speech voice including the translated utterance based on the second language utterance text.

[0009] In a non-limiting view, a conversion method according to one embodiment of the present invention is a conversion method using a device having a processor, comprising: a speech voice acquisition step of acquiring speech voice including speech of a speaker in a first language; a first language speech text acquisition step of acquiring first language speech text, which is a text representation of the speaker's speech in a first language based on the speech voice; a second language speech text translation step of translating the first language speech text into speech text in a second language; and a second language speech voice output step of outputting speech voice including the translated speech based on the second language speech text. [Effects of the Invention]

[0010] Each embodiment of the present invention resolves one or more of the shortcomings. [Brief explanation of the drawing]

[0011] [Figure 1] This is a block diagram showing an example configuration of a conversion system corresponding to at least one embodiment of the present invention. [Figure 2] This is a block diagram showing the configuration of a server corresponding to at least one embodiment of the present invention. [Figure 3] This flowchart shows an example of the processing of a conversion system corresponding to at least one embodiment of the present invention. [Modes for carrying out the invention]

[0012] Hereinafter, examples of embodiments of the present invention will be described with reference to the drawings. The various components in each embodiment described below can be combined as appropriate, provided that no inconsistencies arise. Furthermore, some aspects described in one embodiment may be omitted in other embodiments. Also, operations and processes unrelated to the characteristic features of each embodiment may be omitted. Moreover, the order of the various processes constituting the various flows and sequences described below is not necessarily in any particular order, provided that no inconsistencies arise in the processing content.

[0013] The following explanation uses a conversion program executed on a server, which is an example of a computer, as an example. However, the computer may be other devices such as a user terminal.

[0014] Figure 1 is a block diagram showing an example configuration of a conversion system corresponding to at least one embodiment of the present invention. The conversion system 1 comprises a server 10 and user terminals 20 used by users of the conversion system 1. User terminals 20A, 20B, and 20C are examples of user terminals 20. The configuration of the conversion system 1 is not limited thereto. For example, the conversion system 1 may be configured so that a single user terminal is used by multiple users. The conversion system 1 may also comprise multiple servers.

[0015] Server 10 and user terminal 20 are examples of computers. Server 10 and user terminal 20 are each connected to a communication network 30, such as the Internet, in a way that allows communication. The connection between the communication network 30 and server 10, and the connection between the communication network 30 and user terminal 20, may be a wired or wireless connection. For example, user terminal 20 may connect to the communication network 30 by performing data communication via a wireless communication line with a base station managed by a telecommunications carrier.

[0016] The conversion system 1, comprising a server 10 and a user terminal 20, realizes various functions for executing various processes in response to user operations.

[0017] Server 10 comprises a processor 11, memory 12, and storage device 13. The processor 11 is, for example, a central processing unit such as a CPU (Central Processing Unit) that performs various calculations and controls. If Server 10 is equipped with a GPU (Graphics Processing Unit), some of the various calculations and controls may be performed by the GPU. Server 10 uses the data read into memory 12 to perform various information processing with the processor 11 and stores the obtained processing results in storage device 13 as needed.

[0018] The storage device 13 has a function as a storage medium for storing various types of information. The configuration of the storage device 13 is not particularly limited, but from the perspective of reducing the processing load on the user terminal 20, it may have a configuration capable of storing all the various types of information necessary for the control performed by the conversion system 1. Examples of such include HDDs and SSDs. However, the storage device for storing various types of information only needs to have a storage area in a state accessible by the server 10, and for example, it may have a configuration with a dedicated storage area outside the server 10.

[0019] The user terminal 20 is managed by the user. Examples of the user terminal 20 include, for example, mobile phone terminals, smartphones, PDAs (Personal Digital Assistants), personal computers, tablets, and the like.

[0020] The user terminal 20 is connected to the communication network 30 and includes hardware and software for executing various processes by communicating with the server 10. Each of the plurality of user terminals 20 may also be configured to be able to communicate directly with each other without going through the server 10.

[0021] The user terminal 20 may have a display device built in. Also, a display device may be wirelessly or wiredly connected to the user terminal 20. Since the display device has a very common configuration, illustration thereof is omitted here. The display device is an example of an output device.

[0022] The user terminal 20 includes a processor 21, a memory 22, and a storage device 23. The processor 21 is, for example, a central processing unit such as a CPU (Central Processing Unit) that performs various operations and controls. Also, when the user terminal 20 includes a GPU (Graphics Processing Unit), part of the various operations and controls may be performed by the GPU. The user terminal 20 uses the data read into the memory 22 to execute various information processes by the processor 21, and stores the obtained processing results in the storage device 23 as needed. The storage device 23 has a function as a storage medium for storing various information.

[0023] An input device may be built into the user terminal 20. Also, an input device may be wirelessly or wired-connected to the user terminal 20. The input device receives operation inputs from the user. In response to the operation inputs from the user, the processor included in the server 10 or the processor included in the user terminal 20 executes various control processes. Examples of the input device include a touch panel screen, keyboard, mouse, game pad, joystick, and other controllers provided in smartphones and tablets.

[0024] In addition, the user terminal 20 may include other output devices such as a speaker. The other output devices output voice, vibration, and other various information to the user.

[0025] For example, a video in the first language may be transmitted from the user terminal 20 to the server 10. The server 10 generates a video in the second language based on the received video in the first language. Note that the first language and the second language are different languages. For example, the first language may be Japanese and the second language may be English. The first language and the second language may be other languages than these.

[0026] Figure 2 is a block diagram showing the configuration of a server corresponding to at least one embodiment of the present invention. Server 10 comprises a speech voice acquisition unit 101, a first language speech text acquisition unit 102, a second language speech text translation unit 103, and a second language speech voice output unit 104. The processor in Server 10 refers to a conversion program held in a storage device and executes the program to functionally realize the speech voice acquisition unit 101, the first language speech text acquisition unit 102, the second language speech text translation unit 103, and the second language speech voice output unit 104.

[0027] The speech acquisition unit 101 has the function of acquiring speech audio that includes the speaker's utterances in their first language. The first language utterance text acquisition unit 102 has the function of acquiring first language utterance text, which is a text representation of the speaker's utterances in their first language, based on the speech audio. The second language utterance text translation unit 103 has the function of translating the first language utterance text into second language utterance text. The second language utterance audio output unit 104 has the function of outputting speech audio that includes the translated utterances based on the second language utterance text.

[0028] Figure 3 is a flowchart showing an example of processing of a conversion system corresponding to at least one embodiment of the present invention.

[0029] The speech acquisition unit 101 acquires speech audio that includes the speaker's utterances in their first language (St101). The first language utterance text acquisition unit 102 acquires first language utterance text, which is a text transcription of the speaker's utterances in their first language, based on the speech audio (St102). The second language utterance text translation unit 103 translates the first language utterance text into second language utterance text (St103). The second language utterance audio output unit 104 outputs speech audio that includes the translated utterances based on the second language utterance text (St104).

[0030] The speech acquisition unit 101 may acquire speech from, for example, a video. The video may be various types of videos, such as animation, live-action films, or user-submitted videos. The video features a speaker, who is the subject of the speech. For example, a fictional character in animation, or an actor in live-action. The video contains the speaker's speech in their first language.

[0031] The speech voice acquisition unit 101 may acquire the speech voice directly, rather than separating and extracting it from the video.

[0032] <Identify the speech segment and acquire the speech audio> The characters in a video do not speak continuously from the beginning to the end of the video, but usually speak intermittently. Therefore, in step St101, the speech sound acquisition unit 101 identifies the section of the video containing the speech sound from the start to the end of the speaker's speech as the speech section. The speech sound acquisition unit 101 may identify the speech section as the time from when a voice of a specific tone is detected until the voice is no longer detected. Alternatively, margin time may be added before and after, and the speech section may be identified as the time from when a voice of a specific tone is detected by a predetermined margin time, until after the voice is no longer detected and the predetermined margin time has elapsed. The speech sound acquisition unit 101 acquires the speech sound from the video in the identified speech section.

[0033] <Separating the speaker's speech from the spoken audio> Speech audio containing a speaker's utterances may also contain sounds other than the speaker's utterances. For example, background music or utterances by persons other than the specific speaker may be included. Here, the conversion system 1 of this disclosure translates utterances in a first language into utterances in a second language. To improve the accuracy of the translation, in step St102, the first language utterance text acquisition unit 102 separates and acquires the speaker's utterances from the speech audio. The first language utterance text acquisition unit 102 may extract the utterances of a specific speaker from the speech audio, or conversely, it may exclude sounds other than the speaker's utterances from the speech audio.

[0034] The first language speech text acquisition unit 102 may, for example, perform the above-mentioned speech separation by inputting the speech into a trained model that has been machine-learned to separate speech sounds that are mixed together.

[0035] The speech sounds obtained by separating utterances are sometimes referred to as single-utterance speech sounds. The first-language speech text acquisition unit 102 may acquire first-language speech text, which is a text representation of the speaker's utterances in their first language, based on the single-utterance speech sounds.

[0036] <Speaker Identification> For example, there are cases where characters A, B, and C in a video speak at different times. There are also cases where characters A and B speak lines simultaneously. In step St102, the first language speech text acquisition unit 102 identifies the speaker from the speech audio containing speech from multiple speakers, and extracts, for example, only the speech of character A. Such speaker identification may also be performed by inputting the speech audio into a trained model that has been trained on machine learning to separate and extract the speech of characters A, B, and C from speech audio containing speech from multiple speakers.

[0037] <Retrieving spoken text> The first language utterance text acquisition unit 102 applies Speech To Text (STT) to the isolated utterances of the speaker. This allows the first language utterance text acquisition unit 102 to acquire first language utterance text, which is a transcription of the speaker's utterances in their first language, based on the isolated utterances. Since Speech To Text is a known technology, a detailed explanation is omitted.

[0038] <Translation into a second language> The second language speech-to-text translation unit 103 translates the first language speech-to-text text into the second language speech-to-text text. A translation engine or a pre-trained text-to-text model may be used for this translation.

[0039] Here, the final output of the conversion system 1 of this disclosure is to output a video in which the speaker speaks in the second language, based on a video in which the speaker speaks in the first language. Therefore, it is required to perform the translation from the first language to the second language in a way that matches the length of time, i.e., the duration, of the speaker speaking in the source video.

[0040] Therefore, the second language utterance text translation unit 103 translates the first language utterance text into the second language utterance text so that the second language utterance text of the translated word fits within a predetermined length. For example, when using a pre-trained Text-to-Text model, the second language utterance text translation unit 103 may obtain a second language utterance text that fits within a predetermined length by inputting the first language utterance text and a parameter indicating a predetermined length into the pre-trained model.

[0041] The term "length" here may refer to the length of the audio after the text has been converted to speech (for example, in seconds). For example, a predetermined length may be such that the speech audio containing second language utterances based on second language speech text (i.e., speech audio obtained by converting second language speech text to speech) fits within the speech duration of the first language utterance of the speaker that formed the basis of the first language speech text.

[0042] The term "length" here may refer to the number of characters in the text data. The predetermined length may also be a length determined according to the length of the first language utterance text.

[0043] If an algorithm other than Text to Text is used for translation from the first language to the second language, the second language utterance text translation unit 103 may, for example, output multiple candidate translations of the first language utterance text in the second language, and from among the output candidate translations, select a candidate that fits within a predetermined length and identify it as the second language utterance text.

[0044] It is also conceivable that the translation process from first-language spoken text to second-language spoken text could be performed without adjusting the length. Furthermore, the translated text could be converted to audio, and the audio speed could be adjusted, for example, to 1.2x or 0.8x, to fit the desired length. However, with this method, the speaker's utterance in the second-language video would be noticeably faster or slower than the speaker's utterance in the first-language video, making it impossible to maintain the tone of voice from the original video. To give viewers the same natural impression in the second-language video as in the first-language video, it is preferable to adjust the length during the translation process itself, rather than adjusting the speed of the finished audio to fit the desired length.

[0045] The second language speech output unit 104 outputs speech audio that includes translated speech based on the second language speech text. <Maintaining vocal tone>

[0046] The second language speech output unit 104 inputs the second language speech text translated from the first language speech text and the aforementioned standalone speech audio into the Diffusion model's artificial intelligence to obtain second language speech audio that maintains the tone of voice of the speaker in the first language. It then outputs the speech audio.

[0047] Vocal tone refers to the timbre or pitch of a voice. Components of vocal tone include, for example, volume, pitch, intonation, voice quality, and changes in these components over time. Emotions conveyed through the voice, such as sad voices, happy voices, shouts, and whispers, are also included in vocal tone. The conversion system 1 of this disclosure converts spoken speech from a first language to a second language while maintaining (or, as closely as possible, imitating) the vocal tone having the above-mentioned components.

[0048] Furthermore, simply mimicking one of the above components alone does not necessarily mean that the voice tone is maintained. The voice tone is considered maintained if at least two or more components are similar to the speech sounds of the first language.

[0049] One example of an objective indicator for defining voice tone is a spectrogram. The conversion system 1 of this disclosure may generate a second language speech tone that approximates the spectrogram of the first language speech tone by converting a single speech tone of the first language into a spectrogram and inputting it into the artificial intelligence of a Diffusion model. In other words, voice tone is maintained by processing the spectrogram of the second language speech tone to approximate the spectrogram of the first language speech tone.

[0050] Furthermore, maintaining voice tone may be implemented using methods other than the Diffusion model. For example, a Generative Adversarial Network (GAN) may be used to maintain voice tone between the first language speech and the second language speech.

[0051] The second language speech output unit 104 may generate a video of a speaker speaking in the second language by combining the second language speech generated as described above with the speech section in the video from which the first language speech was obtained. It may also output the generated video. The audio used in the video has been separated as described above. Therefore, the second language speech output unit 104 may combine the second language speech obtained as described above with other audio obtained through separation (for example, speech by other speakers or background music) and combine the resulting synthesized audio with the video.

[0052] <Changing voice tone> Conversely, the tone of voice may be deliberately changed during translation. In other words, the second language speech output unit 104 may output a second language speech that exhibits a predetermined tone of voice, different from the tone of voice of the speaker in the first language. To achieve this, the second language speech output unit 104 may, for example, input a speech voice that exhibits a different tone of voice from the speaker to be translated in the original video or audio, and a second language speech text translated from the first language speech text, into the Diffusion model's artificial intelligence to obtain a second language speech voice with a changed tone of voice. Then, it outputs the speech voice. As a result, it is possible to generate speech voices where the meaning of the speech is the same as in the first language, but the language of the speech and the tone of voice are different from those in the first language, or to generate videos that combine such speech voices.

[0053] For the sake of explanation, the present invention will be described below as a function of the conversion system 1. A conversion device may have such a function. The function of the conversion system 1 may be implemented as a conversion program. Each processing step performed by the conversion system 1 may be performed by a conversion method.

[0054] <Calculation of translation accuracy> When translating text from a first language to text in a second language, it is impossible for someone unfamiliar with the second language to determine whether the translation is correct. Therefore, the translation program, translation device, translation system, and translation method disclosed herein improve the accuracy of mistranslation detection by combining multiple language models, semantic vectors, and proper noun dictionaries.

[0055] The conversion system 1 outputs a score indicating the accuracy of the translation result from the first language to the second language. This allows the accuracy of the translation to be calculated as a score. It also improves the accuracy of mistranslation detection when errors occur. Furthermore, it allows for the identification of areas in the translation result that require attention, such as entrusting areas with low scores to language specialists.

[0056] Suppose a first-language text is translated into a second-language text. The first-language text may be a first-language spoken text, but it may also be any other type of text. The second-language text may be a second-language spoken text, but it may also be any other type of text.

[0057] The translation system 1 inputs the first language text and the second language text into a language model and outputs a score indicating the accuracy of the translation result. The language model may be, for example, an LLM. The LLM may be given a prompt instructing it to output a score.

[0058] The conversion system 1 may also input a proper noun dictionary into the LLM. The proper noun dictionary may be, for example, a translation table showing how a certain proper noun expressed in the first language is expressed in the second language.

[0059] The score may, for example, be a score indicating the degree of semantic agreement between the first language text and the second language text. The score may also identify sections where the meaning differs between the first language text and the second language text, and indicate the percentage of sections with differing meanings relative to the input text.

[0060] Furthermore, the conversion system 1 may also use CycleConsistency. Specifically, as follows: Although the above describes translating the first language text into the second language text, the conversion system 1 re-translates the translated second language text back into the first language to obtain the re-translated text. The conversion system 1 compares the first language text with the re-translated text and scores the degree of agreement obtained from this comparison as the quality of the translation.

[0061] The conversion system 1 may use multiple language models. The conversion system 1 may calculate the final score by combining multiple scores output from multiple language models. The combination here may be, for example, addition or multiplication. The combination may also be a predetermined operation other than addition or multiplication.

[0062] Furthermore, weighting may be applied when combining multiple scores output from multiple language models. For example, the conversion system 1 stores a history of errors in the translated text output by the language models in a memory device. Based on the error history, the conversion system 1 determines the weight values ​​so that language models with fewer errors are assigned larger weights, and language models with many errors are assigned smaller weights. The method of applying weights may be other than that described above. The conversion system 1 calculates a weighted score for a language model by multiplying the score output by the language model by the weight values ​​described above. The conversion system 1 combines the multiple weighted scores output by multiple language models to calculate the final score.

[0063] <Accuracy calculation for multilingual speech synthesis> Translation is also used for dubbing movies and animations. Dubbing can be done using multilingual speech synthesis. However, users often do not know the target language. Here, "users" refers to the dubbing workers and the users of the dubbing system. If users do not know the target language, there is a risk that they will not be able to judge the accuracy of the output at both the translation and dubbing generation stages, and that inaccurate dubbed versions of movies, etc., will be released.

[0064] Therefore, the conversion system 1 of this disclosure evaluates the accuracy of the dubbed audio. Specifically, it does so as follows:

[0065] The conversion system 1 has a function to convert speech to text. Speech to Text (STT) may be used for this conversion. The audio before dubbing is considered the audio of the first language. The audio after dubbing is considered the audio of the second language. The conversion system 1 obtains the text obtained by converting the audio of the first language to text and the text obtained by converting the audio of the second language to text. The text obtained by converting the audio of the first language to text may be prepared in advance. STT may be used to convert the audio of the first language to text. The text may be a sequence of phonemes, a sequence of hiragana, a sequence of katakana, or kanji.

[0066] The conversion system 1 receives text in at least one of the first language and text in the second language as input to the language model. The language model outputs a value that can be used to derive the accuracy of the text in the second language. The conversion system 1 may further receive speech in the first language and at least one of the speech in the second language as input to the language model.

[0067] The accuracy mentioned above may be the accuracy indicating whether the text in the first language has been correctly translated into the text in the second language. The accuracy of the text in the second language may be the accuracy of the entire text in the second language, or the accuracy of a portion of the text in the second language. The conversion system 1 may calculate the accuracy mentioned above based on a predetermined known mathematical formula or algorithm.

[0068] Since the conversion system 1 outputs an accuracy score, the accuracy of the dubbed audio can be evaluated, thus reducing the risk of outputting an inaccurate dub. Furthermore, the accuracy score can be used as a criterion for determining whether the dubbing work is complete. Additionally, the accuracy score can be used as a criterion for deciding whether or not to release the dubbed audio.

[0069] <Non-verbal speech detection and transfer> In dubbing audio for films, animations, and other media, there are elements that are substantially common between the first and second languages. For example, breathing, shouting, and laughter. These are sometimes referred to as non-verbal sounds in this specification. Since non-verbal sounds are substantially common between the first and second languages, using the original first language non-verbal sounds without translation will result in an expression closer to the original before translation. Therefore, the conversion system 1 of this disclosure generates dubbed audio using the original first language non-verbal sounds. Specifically, this is done as follows.

[0070] The conversion system 1 extracts non-verbal speech from the speech of the first language.

[0071] Furthermore, the audio may be the audio contained in the video. This also applies to other descriptions in this specification that describe non-verbal speech detection and transfer, unless it does not create a contradiction.

[0072] The conversion system 1 obtains the second language audio dubbed from the first language audio. The conversion system 1 adds the extracted non-verbal audio to the two language audio.

[0073] This allows the dubbed audio to maintain the same breathing, shouting, and laughter as the original audio.

[0074] <Matching the audio with lip-syncing> Lip-syncing is a technique used in animation and other media to synchronize the mouth movements of characters with their voices (dialogue). Generally, the mouth movements are synchronized with the voice. In contrast, this invention synchronizes the voice with the mouth movements. Specifically, this is as follows:

[0075] The conversion system 1 acquires constraint information. This constraint information may include information indicating the shape of the mouth, the temporal position of vowels, and the number of seconds that vowels last. The conversion system 1 may extract the constraint information from a video that includes mouth movements. A known algorithm may be used for the extraction.

[0076] The conversion system 1 performs Text to Speech (TTS) using the text of the second language to be converted to speech and the constraint information mentioned above as input, and obtains speech in the second language based on the constraint information. This makes it possible to generate speech that matches the mouth movements.

[0077] Note that the text entered into TTS may be a string of phonemes derived from speech. For example, if there is a video that contains mouth movements to be lip-synced, the conversion system 1 will obtain a string of phonemes corresponding to the speech contained in that video.

[0078] Furthermore, the conversion system 1 may also input a video containing mouth movements into the TTS format.

[0079] Using LLM, it is possible to obtain different outputs multiple times based on the same prompt. Taking advantage of this property, the conversion system 1 may output multiple speeches according to the constraint information. The speeches (dialogue) output multiple times are based on the above constraint information. Therefore, although they have the same meaning, various elements may differ. For example, the wording of the dialogue may differ. The way of breathing and the position where words are extended may differ. The reading speed of the speech may differ. The timing of speech (so-called pauses) may differ.

[0080] Therefore, the conversion system 1 selects from the multiple output voices that match predetermined selection criteria and chooses the final voice that matches the mouth movements.

[0081] The prescribed selection criteria may be criteria based on one or more considerations. These considerations may include, for example, the degree of agreement in the timing of vowel appearance, the degree of agreement between the sound and the mouth shape, and the degree of agreement between the sounds of the first language and the sounds of the second language.

[0082] As described above, each embodiment of the present application solves one or more of the shortcomings. Note that the effects of each embodiment are non-limiting effects or examples of effects.

[0083] In each of the embodiments described above, the user terminal 20 and the server 10 execute the various processes described above according to various control programs (e.g., conversion programs) stored in their own storage devices. Furthermore, other computers, not limited to the user terminal 20 and the server 10, may also execute the various processes described above according to various control programs (e.g., conversion programs) stored in their own storage devices.

[0084] Furthermore, the configuration of the conversion system 1 is not limited to the configuration described as an example of the embodiment above. For example, the server may perform some or all of the processes described as being performed by the user terminal, or the user terminal may perform some or all of the processes described as being performed by the server. Alternatively, the user terminal may be equipped with some or all of the storage unit (memory device) provided by the server. In other words, the conversion system 1 may be configured such that one of the user terminals or the server provides some or all of the functions provided by the other.

[0085] Furthermore, the program may be configured to implement some or all of the functions described above as examples of each embodiment in a standalone device that does not include a communication network.

[0086] [Note] The above-described embodiments are written in such a way that at least the following invention can be put into practice by a person with ordinary skill in the art to which the invention pertains.

[0087] [1] The processor, A speech acquisition function that captures speech audio including the speaker's first language, A first language speech text acquisition function that acquires first language speech text, which is a text transcription of the speaker's first language speech, based on the aforementioned speech audio. A second-language speech-to-text translation function that translates the first-language speech-to-text into the second-language speech-to-text, A second language speech output function that outputs speech audio containing translated speech based on the aforementioned second language speech text, A conversion program that makes this possible.

[0088] According to the above conversion program, it is possible to generate a video in which the speaker's utterance is translated from the first language to the second language.

[0089] [2] The second language speech output function outputs speech in the second language while maintaining the tone of voice of the speaker in the first language. [1] The conversion program described. This makes it possible to generate a video that translates the speaker's speech from the first language to the second language while maintaining the speaker's voice tone from the original video.

[0090] [3] The second language speech output function outputs speech in the second language that exhibits a predetermined tone of voice different from the tone of voice of the speaker in the first language. [1] The conversion program described. This makes it possible to generate a video in which the speaker's speech is translated from the first language to the second language, while changing the speaker's voice tone in the original video to the desired voice tone.

[0091] [4] The first language speech text acquisition function described above, A conversion program according to [1] that obtains a first language utterance text, which is a text representation of the speaker's first language utterance, based on a single utterance audio obtained by separating the speaker's first language utterance from the aforementioned utterance audio. This allows for accurate conversion of speech into a second language, even if other audio is present in the original video.

[0092] [5] The speech acquisition function described above, In a video containing the aforementioned spoken audio, the section from the start to the end of the speaker's speech is identified as the speech section, and the spoken audio within the identified speech section is obtained from the video. [1] The conversion program described. This allows us to extract the parts of a video where the speaker is speaking and convert them into speech in a second language.

[0093] [6] The second language speech-to-text translation function translates the first language speech-to-text into the second language speech-to-text such that the translated second language speech-to-text is of a predetermined length. [1] The conversion program described.

[0094] [7] The predetermined length is such that the spoken audio containing the second language utterance based on the second language utterance text fits within the speaking time of the speaker's first language utterance that formed the basis of the first language utterance text. The conversion program described in [6].

[0095] [8] The predetermined length is determined according to the length of the first language utterance text, The conversion program described in [6].

[0096] These translation programs can translate from the first language to the second language in a way that matches the length of time, or duration, of the speaker's speech in the source video.

[0097] [9] A converter equipped with a processor, A speech acquisition function that captures speech audio including the speaker's first language, A first language speech text acquisition function that acquires first language speech text, which is a text transcription of the speaker's first language speech, based on the aforementioned speech audio. A second-language speech-to-text translation function that translates the first-language speech-to-text into the second-language speech-to-text, A second language speech output function that outputs speech audio containing translated speech based on the aforementioned second language speech text, A conversion device having the following features.

[0098] The above-described conversion device can generate a video in which the speaker's utterance is translated from the first language to the second language.

[0099]

[10] A conversion system comprising at least a processor, The aforementioned processor, A speech acquisition function that captures speech audio including the speaker's first language, A first language speech text acquisition function that acquires first language speech text, which is a text transcription of the speaker's first language speech, based on the aforementioned speech audio. A second-language speech-to-text translation function that translates the first-language speech-to-text into the second-language speech-to-text, A second language speech output function that outputs speech audio containing translated speech based on the aforementioned second language speech text, A conversion system that makes this possible.

[0100] According to the above conversion system, it is possible to generate a video in which the speaker's utterance is translated from the first language to the second language.

[0101]

[11] A conversion method using a device having a processor, The process involves acquiring speech audio that includes the speaker's first language, and A first language utterance text acquisition step, which involves obtaining a first language utterance text, which is a transcription of the speaker's utterance in their first language, based on the aforementioned utterance audio. A second-language utterance text translation step, which translates the first-language utterance text into a second-language utterance text, A second language speech output step that outputs speech audio containing translated speech based on the second language speech text, A conversion method having the following characteristics.

[0102] According to the above conversion method, it is possible to generate a video in which the speaker's utterance is translated from the first language to the second language. [Industrial applicability]

[0103] According to one embodiment of the present invention, a conversion program, conversion device, conversion system, and conversion method are useful for generating a video in which a speaker's speech is translated from a first language to a second language. [Explanation of Symbols]

[0104] 1. Conversion System 10 servers 11 processors 12 memory 13 Storage device 20, 20A, 20B User Terminals 21 processors 22 memory 23 Storage device 30 Communication Networks 101 Speech Acquisition Unit 102 First Language Speech Text Acquisition Unit 103 Second Language Speech Text Translation Department 104 Second Language Speech Output Unit

Claims

1. In the processor, A speech acquisition function that captures speech audio including the speaker's first language, Based on the aforementioned spoken audio, a first language utterance text acquisition function is provided to obtain first language utterance text, which is a text representation of the speaker's utterance in their first language. A second language speech-to-text translation function that translates the first language speech-to-text into the second language speech-to-text, A second language speech output function that outputs speech audio containing translated speech based on the aforementioned second language speech text, To make it happen, A conversion program that adds non-verbal speech extracted from speech containing utterances in the first language to speech containing translated utterances.

2. The second language speech output function outputs speech in the second language while maintaining the tone of voice of the speaker in the first language. The conversion program according to claim 1.

3. The second language speech output function outputs speech in the second language that exhibits a predetermined tone of voice different from the tone of voice of the speaker in the first language. The conversion program according to claim 1.

4. In the first language speech text acquisition function, The conversion program according to claim 1, which obtains a first language utterance text by transcribing the speaker's first language utterance into text based on a single utterance audio obtained by separating the speaker's first language utterance from the aforementioned utterance audio.

5. In the aforementioned speech acquisition function, In a video containing the aforementioned spoken audio, the section from the start to the end of the speaker's speech is identified as the speech section, and the spoken audio within the identified speech section is obtained from the video. The conversion program according to claim 1.

6. The second language speech-to-text translation function translates the first language speech-to-text into the second language speech-to-text such that the translated second language speech-to-text fits within a predetermined length. The conversion program according to claim 1.

7. The predetermined length is such that the spoken audio containing the second language utterance based on the second language utterance text fits within the speaking time of the speaker's first language utterance that formed the basis of the first language utterance text. The conversion program according to claim 6.

8. The predetermined length is determined according to the length of the first language utterance text. The conversion program according to claim 6.

9. A converter equipped with a processor, A speech acquisition function that captures speech audio including the speaker's first language, Based on the aforementioned spoken audio, a first language utterance text acquisition function is provided to obtain first language utterance text, which is a text representation of the speaker's utterance in their first language. A second language speech-to-text translation function that translates the first language speech-to-text into the second language speech-to-text, A second language speech output function that outputs speech audio containing translated speech based on the aforementioned second language speech text, It has, A conversion device that adds non-verbal speech extracted from speech containing utterances in the first language to speech containing translated utterances.

10. A conversion system comprising at least a processor, The aforementioned processor, A speech acquisition function that captures speech audio including the speaker's first language, Based on the aforementioned spoken audio, a first language utterance text acquisition function is provided to obtain first language utterance text, which is a text representation of the speaker's utterance in their first language. A second language speech-to-text translation function that translates the first language speech-to-text into the second language speech-to-text, A second language speech output function that outputs speech audio containing translated speech based on the aforementioned second language speech text, To make it happen, A conversion system that adds non-verbal speech extracted from speech containing utterances in the first language to speech containing translated utterances.

11. A conversion method using a device having a processor, The process involves acquiring speech audio that includes the speaker's first language, and Based on the aforementioned spoken audio, a first language utterance text acquisition step is obtained, which is a text representation of the speaker's utterance in their first language. A second-language speech text translation step, which translates the first-language speech text into a second-language speech text, A second language speech output step that outputs speech audio containing translated speech based on the second language speech text, It has, A conversion method comprising adding non-verbal speech extracted from speech containing utterances in the first language to speech containing translated utterances.