Electronic device, method, and non-transitory computer-readable storage medium for generating animation of character
By employing AI models to analyze voice signals and context, the device generates lifelike character animations that address the issue of static faces, improving user interaction.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NCSOFT CORP
- Filing Date
- 2024-12-23
- Publication Date
- 2026-07-02
AI Technical Summary
Existing systems fail to generate natural-looking animations of characters speaking voice signals, as they often display static faces that appear unnatural to users.
An electronic device uses artificial intelligence models, particularly natural language processing, to identify the meaning and context of spoken words, along with volume and pitch, to generate animations that accurately depict the facial expressions of characters.
The solution effectively creates animations that reflect the intended facial expressions and emotions of characters, enhancing the user experience by making character interactions more lifelike and engaging.
Smart Images

Figure KR2024020989_02072026_PF_FP_ABST
Abstract
Description
Electronic device, method, and non-transient computer-readable storage medium for generating animation of a character
[0001] The present disclosure relates to an electronic device, a method, and a non-transient computer-readable storage medium for generating animations of characters.
[0002] Artificial intelligence is a technology for simulating human (or biological) neural activities such as perception and / or inference, and can be implemented as hardware, software, or a combination thereof designed to perform calculations for simulating neural activities.
[0003] The information described above may be provided as related art for the purpose of aiding understanding of the present disclosure. No claim or determination is made as to whether any of the foregoing may be applied as prior art related to the present disclosure.
[0004] An electronic device is described. The electronic device may include a memory for storing instructions. The electronic device may include at least one processor including a processing circuit. When the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to obtain a voice signal for the utterance of a character and text information regarding the voice signal. When the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to identify a word having a designated meaning among the words included in the text information by providing the text information to an artificial intelligence model. When the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to identify the time at which the word is uttered using the voice signal. When the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to obtain data indicating the time at which the time is identified based on the identification of the time at which the time is identified. The above instructions, when executed individually or collectively by the at least one processor, may cause the electronic device to generate an animation representing the face of a character speaking the voice signal by providing the voice signal and the data to a trained model. The above instructions, when executed individually or collectively by the at least one processor, may cause the electronic device to control the output of the voice signal while displaying the animation.
[0005] A method is provided. The method may be executed by an electronic device. The method may include an operation of acquiring a voice signal for the utterance of a character and text information regarding said voice signal. The method may include an operation of identifying a word having a designated meaning among the words included in said text information by providing said text information to an artificial intelligence model. The method may include an operation of identifying a time when said word is to be uttered using said voice signal. The method may include an operation of acquiring data indicating said time based on identifying said time. The method may include an operation of generating an animation representing the face of a character uttering said voice signal by providing said voice signal and said data to a trained model. The method may include an operation of controlling said voice signal to be output while displaying said animation.
[0006] A non-transient computer-readable storage medium is provided. The non-transient computer-readable storage medium may store one or more programs. The one or more programs may include instructions that cause the electronic device to obtain a voice signal for the utterance of a character and text information regarding said voice signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify a word having a designated meaning among the words included in said text information by providing said text information to an artificial intelligence model when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify the time when said word is uttered using said voice signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to obtain data indicating said time based on identifying said time when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to generate an animation representing the face of a character speaking the voice signal by providing the voice signal and the data to a trained model when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to control the output of the voice signal while displaying the animation when executed by the electronic device.
[0007] Figure 1 illustrates an example of controlling to output a voice signal while displaying a stationary character.
[0008] Figure 2 illustrates an example of a character's facial expression included in an animation created using a different voice signal and a person's facial expression that produced a voice corresponding to a voice signal.
[0009] Figure 3 is a simplified block diagram of an exemplary electronic device.
[0010] FIG. 4 is a flowchart illustrating exemplary operations of an electronic device for generating an animation using a voice signal and text information about the voice signal.
[0011] FIG. 5 illustrates an example of obtaining first data that indicates the time when a word having a specified meaning is uttered using a voice signal and text information about the voice signal.
[0012] FIG. 6 illustrates an example of obtaining second data that indicates an emotion regarding text information using text information regarding a voice signal.
[0013] FIG. 7 is a flowchart illustrating exemplary operations of an electronic device for generating and displaying an animation representing a character's face.
[0014] Figure 8 illustrates an example of generating an animation that expresses a character's facial expression.
[0015] Figure 9 illustrates an example of generating an animation that expresses a character's face by blending an animation that expresses a character's facial expression and an animation that expresses the shape of the character's mouth.
[0016] Figure 10 illustrates an example of controlling the output of a voice signal while displaying an animation expressing a character's face.
[0017] The electronic device (or external electronic device) according to the various embodiments disclosed in this document may be of various forms. The electronic device may include, for example, a portable communication device (e.g., a smartphone), a computer device, a portable multimedia device, a portable medical device, a camera, a wearable device, a server, or a consumer electronics device. The electronic device (or external electronic device) according to the embodiments of this document is not limited to the devices described above.
[0018] The various embodiments of this document and the terms used therein are not intended to limit the technical features described in this document to specific embodiments, and should be understood to include various modifications, equivalents, or substitutions of such embodiments. In connection with the description of the drawings, similar reference numerals may be used for similar or related components. The singular form of a noun corresponding to an item may include one or multiple items unless the relevant context clearly indicates otherwise. In this document, phrases such as "A or B," "at least one of A and B," "at least one of A or B," "A, B or C," "at least one of A, B and C," and "at least one of A, B, or C" may each include any one of the items listed together in the corresponding phrase, or any possible combination thereof. Terms such as "first," "second," or "first" or "second" may be used simply to distinguish a component from another corresponding component and do not limit the components in any other aspect (e.g., importance or order). Where any (e.g., 1st) component is referred to as "coupled" or "connected" to another (e.g., 2nd) component, with or without the terms "functionally" or "communicationly," it means that the component may be connected to the other component directly (e.g., via a wire), wirelessly, or through a third component.
[0019] As used in this document, the term "module" may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit. A module may be a component formed integrally, or a minimum unit of a component or part thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).
[0020] Various embodiments of this document may be implemented as software (e.g., a program) comprising one or more instructions stored in a storage medium readable by a machine (e.g., an electronic device (100)). For example, a processor of the machine (e.g., an electronic device (100)) may call at least one of the one or more instructions stored from the storage medium and execute it. This enables the machine to operate to perform at least one function according to at least one called instruction. One or more instructions may include code generated by a compiler or code that can be executed by an interpreter. The storage medium readable by the machine may be provided in the form of a non-transitory storage medium. Here, "non-transitory" simply means that the storage medium is a tangible device and does not contain a signal (e.g., electromagnetic waves), and this term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily in the storage medium.
[0021] According to one embodiment, the method according to the various embodiments disclosed herein may be provided as included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)), or distributed online (e.g., download or upload) through an application store (e.g., Play Store™ or App Store™) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily created on a device-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.
[0022] According to various embodiments, each component (e.g., module or program) of the described components may include a singular or multiple entities. According to various embodiments, one or more of the components or operations of the aforementioned components may be omitted, or one or more other components or operations may be added. Generally or additionally, multiple components (e.g., module or program) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the components of the multiple components in the same or similar manner as they were performed by the corresponding component among the multiple components prior to integration. According to various embodiments, operations performed by a module, program, or other component may be executed sequentially, in parallel, iteratively, or heuristically; one or more of the operations may be executed in a different order; omitted; or one or more other operations may be added.
[0023] Figure 1 illustrates an example of controlling to output a voice signal while displaying a stationary character.
[0024] Referring to FIG. 1, the electronic device (100) may be described as a device available for displaying a character (120). For example, the electronic device (100) may be one of various forms of mobile devices such as smartphones having various form factors (e.g., bar-type smartphones, foldable-type smartphones, or rollable-type smartphones), tablets, wearable devices, cellular phones, personal computers (PCs) (e.g., laptops and / or desktops), and / or other similar computing devices that include circuits (or circuitry) for displaying the character (120).
[0025] For example, the electronic device (100) may include a display (110) (e.g., the display (330) of FIG. 3) and a speaker (115) (e.g., the speaker (340) of FIG. 3). For example, in state (105), the electronic device (100) may display a character (120) through the display (110) (or an external electronic device connected to the electronic device (100). For example, the character (120) may include a player character (PC), a non-player character (NPC), and / or an avatar. For example, the player character and the avatar may include visual objects that perform actions available within the game environment based on user input. For example, the non-player character may represent a character set during the design of the virtual space provided by the server. For example, the non-player character may be referred to as a monster. The descriptions below exemplify characters, but are merely illustrative.
[0026] For example, the electronic device (100) can be controlled to output a voice signal (125) while displaying the character (120), and the voice signal (125) can be output through a speaker (115) (or an external electronic device connected to the electronic device (100). For example, the voice signal (125) can be described as voice, audio, and / or speech uttered by the character (120). For example, because the voice signal (125) is not seen as being uttered by the character (120), the static character (120) (or the face of the character (120)) displayed while the voice signal (125) is being output may appear unnatural to the user. For example, a solution may be required to resolve the problem of the static character (120) (or the face of the character (120)) appearing unnatural to the user.
[0027] To resolve these problems, it may be required to display the face of the character (120) speaking the voice signal (125) while outputting the voice signal (125). For example, to display the face of the character (120) speaking the voice signal (125), it may be required to generate an animation that expresses the face of the character (120) speaking the voice signal (125). For example, the animation that expresses the face of the character (120) speaking the voice signal (125) may include an animation that expresses the facial expression of the character (120) speaking the voice signal (125). For example, the electronic device (100) may use text information regarding the voice signal (125) to generate an animation that expresses the facial expression of the character (120) speaking the voice signal (125). Creating an animation that expresses the facial expression of a character (120) speaking a voice signal (125) using text information about a voice signal (125) is exemplified in the description of FIG. 2.
[0028] Figure 2 illustrates an example of a character's facial expression included in an animation created using a different voice signal and a person's facial expression that produced a voice corresponding to a voice signal.
[0029] Referring to FIG. 2, the electronic device (100) can generate an animation (215) expressing the facial expression of a character speaking the voice signal (200) using the voice signal (200) and text information (210) for the voice signal (200). For example, the voice signal (200) may be acquired (or recorded) through a microphone included in the electronic device (100) (or an external electronic device connected to the electronic device (100)). For example, the voice signal (200) may be acquired from a voice spoken by a person (205). For example, the voice signal (200) may be described as a voice signal to be spoken by a character. For example, the text information (210) for the voice signal (200) may be described as text information containing letters corresponding to the syllables of the voice signal (200).
[0030] For example, the electronic device (100) can identify words included in text information (210) for a voice signal (200). For example, the electronic device (100) can identify a designated word among the words included in text information (210) for a voice signal (200). For example, the electronic device (100) can identify a category containing the designated word. For example, an animation (215) expressing the facial expression of a character speaking the voice signal (200) can express the face of a character having an expression mapped to the category of the designated word at the time the character speaks the designated word. For example, because the number of categories is limited, the number of facial expressions mapped to each category may be limited. For example, the facial expression of the person (205) speaking the voice signal (200) may not be included in the limited number of facial expressions. For example, generating an animation (215) that expresses the facial expression of a character speaking a voice signal (200) using limited facial expressions may result in errors.
[0031] For example, the meaning of the specified word may vary depending on the context of the text information (210) for the voice signal (200). For example, if the meaning of the specified word and the meaning of the word in the text information (210) are different, the facial expression mapped to the category of the specified word in the text information (210) may differ from the facial expression of the person (205) speaking the specified word. For example, determining the facial expression of a character in the animation (215) based on whether the specified word is included in the text information (210) for the voice signal (200) may result in an error.
[0032] To resolve these errors regarding the facial expressions of a character expressed in the animation (215), it may be required to identify the meaning of a designated word included in the text information (210) for the voice signal (200). For example, an artificial intelligence model (e.g., a natural language processing (NLP) model) may be used to identify the meaning of a designated word included in the text information (210) for the voice signal (200). For example, an electronic device (100) may generate an animation expressing the facial expression of a character speaking the voice signal (200) based on the meaning of a designated word included in the text information (210) for the voice signal (200) identified using an artificial intelligence model (e.g., a natural language processing (NLP) model). For example, the electronic device (100) may perform the actions exemplified in the description of FIGS. 4 through 10 to generate an animation expressing the facial expression of a character speaking the voice signal (200). The electronic device (100) may include components for performing the above operations. The components may be exemplified within the description of FIG. 3.
[0033] Figure 3 is a simplified block diagram of an exemplary electronic device.
[0034] Referring to FIG. 3, the electronic device (300) may be one of various types of mobile devices, such as smartphones having various form factors (e.g., bar-type smartphones, foldable-type smartphones, or rollable-type smartphones), tablets, wearable devices, cellular phones, personal computers (PCs) (e.g., laptops and / or desktops), and / or other similar computing devices. For example, the electronic device (300) may include or correspond to the electronic device (100) of FIG. 1 and 2. For example, the electronic device (300) may include at least one processor (310), memory (320), display (330), and / or speaker (340).
[0035] At least one processor (310) may include a processing circuit. For example, at least one processor (310) may include a CPU (central processing unit) (e.g., including a processing circuit). For example, at least one processor (310) may include a GPU (graphic processing unit) (e.g., including a processing circuit) and / or an NPU (neural processing unit) (e.g., including a processing circuit). For example, at least one processor (310) may be described as an application processor. For example, at least one processor (310) may be configured to control a memory (320), a display (330), and / or a speaker (340). At least one processor (310) may be configured to execute instructions stored in memory (320) individually or collectively to cause the electronic device (100) to perform at least some of the operations illustrated in the description of FIGS. 1 and 2. At least one processor (310) may be configured to execute instructions stored in memory (320) to cause the electronic device (300) to perform at least some of the operations illustrated in the description of FIGS. 4 through 10.
[0036] For example, the term “processor” as used herein, including in the claims, may include various processing circuits comprising at least one processor, and one or more of said at least one processor may be configured to perform the various functions described below in a distributed manner, individually and / or collectively. As used below, where “processor,” “at least one processor,” and “one or more processors” are described as being configured to perform various functions, these terms encompass, for example, but not limited to, situations where one processor performs some of the cited functions and another processor(s) perform other parts of the cited functions, and also situations where one processor can perform all of the cited functions. Additionally, said at least one processor may include a combination of processors that perform the enumerated / disclosed various functions, for example, in a distributed manner. At least one processor may execute program instructions to achieve or perform the various functions.
[0037] The memory (320) may include one or more storage media. For example, the memory (320) may store various data used by at least one component of the electronic device (300) (e.g., at least one processor (310), display (330), and / or speaker (340)). For example, the data may include input data or output data for software and related commands. The memory (320) may include volatile memory or non-volatile memory.
[0038] The display (330) can output visualized information to a user under the control of at least one processor (310). For example, the display (330) may include a flat panel display (FPD) and / or electronic paper. The FPD may include a liquid crystal display (LCD), a plasma display panel (PDP), and / or one or more light emitting diodes (LEDs). For example, the LEDs may include organic LEDs (OLEDs). For example, the display (330) may be used to output animations.
[0039] The speaker (340) can output an acoustic signal to the outside of the electronic device (300). For example, the speaker (340) can convert an electrical signal into sound. For example, the speaker (340) can be connected directly or wirelessly to the electronic device (300) and can output sound through an external electronic device (e.g., a speaker or headphones). For example, the speaker (340) can be used to output a voice signal.
[0040] The electronic device (300) illustrated in the description of FIG. 3 may perform at least some of the operations illustrated in the descriptions of FIG. 4 through 10. For example, the operations illustrated in the descriptions of FIG. 4 through 10 may be caused by (or within) the electronic device (300) under the control of at least one processor (310).
[0041] FIG. 4 is a flowchart illustrating exemplary operations of an electronic device for generating an animation using a voice signal and text information about the voice signal.
[0042] Referring to FIG. 4, in operation 400, at least one processor (310) can acquire a voice signal for the speech of a character (e.g., voice signal (515) of FIG. 5). For example, the voice signal can be described as a voice signal to be spoken by a character. For example, the character may include a player character (PC), a non-player character (NPC), and / or an avatar. For example, the character may include a 2D (two-dimensional) character and a 3D (three-dimensional) character.
[0043] For example, the voice signal may include a voice signal to be used within media content (e.g., movies, games, and / or the metaverse). For example, the voice signal may be referred to as dialogue of a character within the media content. For example, the voice signal may be used to create an animation representing the face of a character speaking the voice signal. For example, the voice signal may be acquired (or recorded) through a microphone (or external electronic device) included within the electronic device (300). As an example, without limitation, the voice signal may be received from an external electronic device or generated within the electronic device (300). For example, the voice signal may be acquired by performing processing (or tuning) on the voice signal acquired through the microphone (or external electronic device). However, it is not limited thereto.
[0044] For example, at least one processor (310) can obtain text information for a voice signal. For example, text information for a voice signal may be described as text information containing characters corresponding to the syllables of the voice signal. For example, text information for a voice signal may include natural language corresponding to the voice signal. For example, text information for a voice signal may be obtained based on user input. As an example, but not limited to, text information for a voice signal may be received from an external electronic device or generated using the voice signal. For example, at least one processor (310) can obtain text information for a voice signal by providing the voice signal to a speech-to-text (STT) model (e.g., the speech-to-text (STT) model (520) of FIG. 5). However, it is not limited thereto.
[0045] In operation 410, at least one processor (310) can identify a word having a designated meaning among the words included in the text information for a voice signal. For example, at least one processor (310) can provide (or input) the text information for the voice signal to an artificial intelligence model (e.g., the NLP model (505) of FIG. 5). For example, the artificial intelligence model may be included in an electronic device (300) or in a server. For example, by providing (or inputting) the text information for the voice signal to the artificial intelligence model, at least one processor (310) can identify a word having a designated meaning among the words included in the text information for the voice signal. For example, a word having a designated meaning may include an interjection. For example, using the artificial intelligence model to identify a word having a designated meaning among the words included in the text information for the voice signal will be illustrated in the description of FIG. 5.
[0046] In operation 420, at least one processor (310) can identify the time when a word having a designated meaning is uttered using a voice signal. For example, a word having a designated meaning can be described as a word included in text information for the voice signal identified in operation 410. For example, at least one processor (310) can identify the times when each of the words included in the voice signal is uttered by providing the voice signal to a STT model. For example, at least one processor (310) can identify the time when a word having a designated meaning is uttered by identifying a word having a designated meaning among the words included in the voice signal. For example, the time when a word having a designated meaning is uttered may include the time when a word having a designated meaning is uttered by a character. Identifying the time when a word having a designated meaning is uttered will be exemplified in the description of FIG. 5.
[0047] In operation 430, at least one processor (310) may acquire first data indicating a time when a word having a specified meaning is to be uttered. For example, the first data may be used to generate an animation expressing the facial expression of a character uttering a voice signal. For example, the first data may further include information on a word having a specified meaning. For example, the animation may express the face of a character having an expression corresponding to a word having a specified meaning at the time according to the first data. For example, acquiring the first data is illustrated in the description of FIG. 5.
[0048] FIG. 5 illustrates an example of obtaining first data that indicates the time when a word having a specified meaning is uttered using a voice signal and text information about the voice signal.
[0049] Referring to FIG. 5, at least one processor (310) may provide (or input) text information (500) regarding a voice signal to an artificial intelligence model (e.g., an NLP model (505)). For example, the NLP model (505) may be contained within an electronic device (300) or within a server. For example, the NLP model (505) may be described as an artificial intelligence (AI) model that analyzes (or interprets) and processes human language. For example, the NLP model (505) may include a machine learning model, a deep learning model, and / or an AI model. For example, the NLP model (505) may include a natural language understanding (NLU) model.
[0050] For example, an NLP model (505) provided (or input) with text information (500) about a voice signal can perform syntactic analysis and / or semantic analysis on the text information (500) about the voice signal. For example, the NLP model (505) can extract words from the text information (500) about the voice signal using linguistic features (e.g., grammatical elements) of morphemes and / or phrases included in the text information (500) about the voice signal. For example, the NLP model (505) can identify the context of the text information (500) about the voice signal using the words extracted from the text information (500) about the voice signal. For example, at least one processor (310) can determine the meaning of the words extracted from the text information (500) about the voice signal according to the identified context of the text information (500) about the voice signal. For example, at least one processor (310) can determine the meaning of the extracted words by performing word sense disambiguation or word sense induction on the words extracted from text information (500) for a voice signal.
[0051] For example, at least one processor (310) can identify a word (510) having a designated meaning among the words included in the text information (500) for the voice signal by providing (or inputting) text information (500) for the voice signal to an NLP model (505). For example, the word (510) having a designated meaning may include an interjection. For example, the word (510) having a designated meaning may differ from the designated word because it is identified based on the context of the text information (500) for the voice signal. For example, the same words included in the text information (500) for the voice signal may have different meanings depending on the context of the text information (500) for the voice signal. For example, at least one processor (310) can generate an animation expressing the face of a character having an expression corresponding to the designated meaning by identifying the word (510) having a designated meaning among the words included in the text information (500) for the voice signal.
[0052] For example, at least one processor (310) may provide (or input) a speech signal (515) to a speech-to-text (STT) model (520). For example, the STT model (520) may be included in an electronic device (300) or in a server. For example, the STT model (520) may be described as an AI model for changing (or converting) text information from an input speech signal. For example, the STT model (520) may include an automatic speech recognition (ASR) model. For example, the STT model (520) may include a machine learning model, a deep learning model, and / or an AI model.
[0053] For example, at least one processor (310) can obtain text information about a voice signal (515) by providing (or inputting) a voice signal (515) to an STT model (520). For example, the text information about the voice signal (515) obtained using the STT model (520) may correspond to, or substantially correspond to, text information (500) about the voice signal. For example, at least one processor (310) can further obtain data (525) about the times when each of the words included in the voice signal (515) is uttered by providing (or inputting) a voice signal (515) to an STT model (520). For example, the data (525) about the times when each of the words included in the voice signal (515) is uttered may be referred to as information about the times of the words included in the voice signal (515).
[0054] For example, at least one processor (310) can identify words included in text information for a voice signal (515) obtained using an STT model (520). For example, at least one processor (310) can identify a word (510) having a designated meaning among the words included in the voice signal (515). For example, at least one processor (310) can identify the time when the word (510) having a designated meaning is to be spoken by using data (525) regarding the times when each of the words included in the voice signal (515) is to be spoken. For example, at least one processor (310) can obtain a first data (530) indicating the time when the word (510) having a designated meaning is to be spoken and / or the time when the word having a designated meaning is to be spoken, based on identifying the time when the word (510) having a designated meaning is to be spoken.
[0055] Referring again to FIG. 4, in operation 440, at least one processor (310) can obtain second data (e.g., second data (600) of FIG. 6) indicating an emotion regarding text information. For example, at least one processor (310) can provide (or input) text information regarding a voice signal to an artificial intelligence model (e.g., NLP model (505) of FIG. 5). For example, the artificial intelligence model may be contained within an electronic device (300) or contained within a server. For example, at least one processor (310) can obtain second data (e.g., second data (600) of FIG. 6) indicating an emotion regarding text information by providing (or inputting) text information regarding a voice signal to the artificial intelligence model.
[0056] For example, at least one processor (310) may perform operation 440 after performing operations 410 to 430, or perform operations 410 to 430 after performing operation 440. For example, at least one processor (310) may perform operation 440 while performing operations 410 to 430, or perform operations 410 to 430 while performing operation 440. For example, obtaining second data indicating sentiment regarding text information using an artificial intelligence model is exemplified in the description of FIG. 6.
[0057] FIG. 6 illustrates an example of obtaining second data that indicates an emotion regarding text information using text information regarding a voice signal.
[0058] Referring to FIG. 6, at least one processor (310) can obtain second data indicating an emotion regarding text information using text information regarding a voice signal. For example, at least one processor (310) can provide (or input) text information (500) regarding a voice signal to an artificial intelligence model (e.g., an NLP model (505)). For example, the NLP model (505) may be contained within an electronic device (300) or within a server. For example, the NLP model (505) may correspond to the NLP model (505) of FIG. 5.
[0059] For example, an NLP model (505) provided (or input) with text information (500) about a voice signal may perform structural parsing and / or dependency parsing on the text information (500) about the voice signal. For example, the NLP model (505) may perform sentiment analysis (or opinion mining) on the text information (500) about the voice signal for which structural parsing and / or dependency parsing has been performed. For example, at least one processor (310) may identify the sentiment regarding the text information (500) about the voice signal based on performing sentiment analysis (or opinion mining). For example, at least one processor (310) can use an NLP model (505) to identify the type of emotion (e.g., joy, sadness, anger, surprise, neutral) and / or level of the text information (500) for the voice signal. For example, the level of emotion for the text information (500) for the voice signal may indicate the intensity of the emotion for the text information (500) for the voice signal. For example, the emotion for the text information (500) for the voice signal may be classified into N levels (e.g., 3 levels). For example, N may be a natural number.
[0060] For example, the emotion regarding text information (500) regarding a voice signal may correspond to the emotion regarding the voice signal. For example, the emotion regarding text information (500) regarding a voice signal may be determined according to the context of text information (500) regarding a voice signal. For example, at least one processor (310) may acquire second data (600) indicating the emotion regarding text information (500) regarding a voice signal based on identifying the emotion regarding text information (500) regarding a voice signal. For example, the second data (600) may include the type and / or level of the emotion regarding text information (500) regarding a voice signal.
[0061] For example, at least one processor (310) can generate an animation expressing the facial expression of a character speaking a voice signal using the first data and the second data. For example, data indicating the volume and pitch of the voice signal may be further used to generate an animation expressing the facial expression of a character speaking a voice signal. For example, the animation expressing the facial expression of a character speaking a voice signal may be used to generate an animation expressing the face of a character speaking a voice signal. Generating and displaying an animation expressing the face of a character speaking a voice signal is exemplified in the description of FIG. 7.
[0062] FIG. 7 is a flowchart illustrating exemplary operations of an electronic device for generating and displaying an animation representing a character's face.
[0063] Referring to FIG. 7, in operation 700, at least one processor (310) can identify the volume of a voice signal using a voice signal (e.g., voice signal (515) of FIG. 5). For example, the volume of the voice signal can be used as a means to supplement the emotion regarding text information (e.g., text information (500) of FIG. 6) of the second data (e.g., second data (600) of FIG. 6). For example, the emotion of the character speaking the voice signal determined using only text information may be inaccurate. For example, the emotion regarding the text information may vary depending on the volume of the voice signal. For example, the level of emotion regarding the text information may be proportional to the volume of the voice signal. For example, as the volume of the voice signal increases, the level of emotion regarding the text information may increase, or as the volume of the voice signal decreases, the level of emotion regarding the text information may decrease. For example, at least one processor (310) can identify the volume of a voice signal to supplement the sentiment regarding the text information of the second data.
[0064] For example, at least one processor (310) can identify the pitch of the voice signal using the voice signal. For example, the pitch of the voice signal can be used as a means to supplement the sentiment regarding the text information of the second data. For example, the sentiment regarding the text information may vary depending on the pitch of the voice signal. For example, the level of sentiment regarding the text information may be proportional to the pitch of the voice signal. For example, as the pitch of the voice signal increases, the level of sentiment regarding the text information may increase, or as the pitch of the voice signal decreases, the level of sentiment regarding the text information may decrease. For example, at least one processor (310) can identify the pitch of the voice signal to supplement the sentiment regarding the text information of the second data.
[0065] For example, at least one processor (310) can obtain third data indicating the volume of the voice signal and the pitch of the voice signal based on identifying the volume of the voice signal and the pitch of the voice signal.
[0066] In operation 710, at least one processor (310) may provide (or input) the first data (e.g., the first data (530) of FIG. 5), the second data (e.g., the second data (600) of FIG. 6), and the third data to a trained model (e.g., the trained model (800) of FIG. 8). For example, by providing (or inputting) the first data, the second data, and the third data to the trained model, at least one processor (310) may generate a first animation (e.g., the first animation (810) of FIG. 8) that expresses the facial expression of a character speaking a voice signal (e.g., the voice signal (515) of FIG. 5). Generating the first animation using the trained model is illustrated in the description of FIG. 8.
[0067] Figure 8 illustrates an example of generating an animation that expresses a character's facial expression.
[0068] Referring to FIG. 8, at least one processor (310) may provide (or input) a voice signal (e.g., voice signal (515) of FIG. 5) to a trained model (800). For example, at least one processor (310) may provide (or input) a first data (530), a second data (605), and a third data (805) to the trained model (800). For example, the trained model (800) may include a machine learning model, a deep learning model, and / or a generative artificial intelligence (AI) model. For example, the trained model (800) may be contained within an electronic device (300) or within a server. For example, the trained model (800) may be described as a model trained to generate an animation expressing the facial expression of a character speaking the voice signal using data indicating a word having a designated meaning, data indicating an emotion, and / or data indicating the volume and pitch of the voice signal.
[0069] For example, the trained model (800) may include an artificial neural network model comprising multiple layers and / or operations (or computations). As an example, but not limited to, the trained model (800) may include one of a feedforward neural network (FNN), a deep neural network (DNN), a convolutional neural network (CNN), a region with convolutional neural network (R-CNN), a region proposal network (RPN), a recurrent neural network (RNN), a stacking-based deep neural network (S-DNN), a state-space dynamic neural network (S-SDNN), a deconvolution network, a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), deep Q-networks, a fully convolutional network, a long short-term memory (LSTM) network, a classification network, or a combination of two or more of these. However, it is not limited thereto. For example, a trained model (800) can be trained on input data (e.g., data indicating words with specified meanings, data indicating emotions, and / or data indicating the volume and pitch of a voice signal), and can perform operations (or inference) based on receiving input data to obtain output data (e.g., an animation expressing the facial expression of a character speaking a voice signal).The trained model (800) on the electronic device (300) may include a hardware structure or a software structure that is additional (or alternative) to the hardware structure.
[0070] For example, at least one processor (310) can generate a first animation (810) that expresses the facial expression of a character speaking a voice signal by providing (or inputting) the voice signal, the first data (530), the second data (605), and the third data (805) to a trained model (800). As an example without limitation, the character's facial expression may include the character's eyebrow movement, eye blinking, head movement, and / or lip movement. However, it is not limited thereto. For example, in the first animation (810), the character may have an expression corresponding to the word of the specified meaning of the first data (530) at the time of speaking the word of the specified meaning of the first data (530). For example, the expression corresponding to the word of the specified meaning may be described as a natural expression that the character may have while speaking the word. For example, the expression corresponding to the word of the specified meaning may be determined by the trained model (800). For example, since the word with a specified meaning is determined by an NLP model (e.g., the NLP model (505) of FIG. 5), the facial expression corresponding to the word with a specified meaning may correspond to, or substantially correspond to, the facial expression of a person at the time of speaking the word with a specified meaning within the voice signal. For example, the facial expression corresponding to the word with a specified meaning of the first data (530) possessed by a character in the first animation (810) can be described as non-verbal communication.
[0071] For example, in the first animation (810), a character may have an expression corresponding to the emotion of the second data (605). For example, the expression corresponding to the emotion of the second data (605) may include an expression corresponding to the type of emotion of the second data (605) (e.g., joy, sadness, anger, surprise, basic). For example, the expression corresponding to the emotion of the second data (605) may include an expression corresponding to the level of the emotion of the second data (605). For example, the level of the emotion of the second data (605) may be determined based further on the volume of the voice signal of the third data (805). For example, the level of the emotion of the second data (605) may increase as the volume of the voice signal of the third data (805) increases, or decrease as the volume of the voice signal of the third data (805) decreases. For example, the level of emotion of the second data (605) may be determined based more on the pitch of the voice signal of the third data (805). For example, the level of emotion of the second data (605) may increase as the pitch of the voice signal of the third data (805) increases, or decrease as the pitch of the voice signal of the third data (805) decreases. For example, the facial expression corresponding to the emotion of the second data (605) may be expressed overall while the character speaks the voice signal within the animation (810). For example, the facial expression corresponding to the emotion of the second data (605) may be described as a background expression.
[0072] Referring again to FIG. 7, in operation 720, at least one processor (310) may provide a voice signal (e.g., voice signal (515) of FIG. 5) to another trained model. For example, the other trained model may include a machine learning model, a deep learning model, and / or a generative artificial intelligence (AI) model. For example, the other trained model may be contained within an electronic device (300) or within an external server. For example, the other trained model may be described as a model trained to generate an animation expressing the shape of the mouth of a character speaking the voice signal using the voice signal. For example, at least one processor (310) may train the other trained model by providing (or inputting) voice signals and videos including the face of a person speaking the voice signal to the other trained model.
[0073] For example, other trained models may include artificial neural network models comprising multiple layers and / or operations. By example, other trained models may include one of a feedforward neural network (FNN), a deep neural network (DNN), a convolutional neural network (CNN), a region with convolutional neural network (R-CNN), a region proposal network (RPN), a recurrent neural network (RNN), a stacking-based deep neural network (S-DNN), a state-space dynamic neural network (S-SDNN), a deconvolution network, a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), deep Q-networks, a fully convolutional network, a long short-term memory (LSTM) network, or a classification network, or a combination of two or more of these, but are not limited thereto. For example, another trained model may be trained on input data (e.g., speech signals) and may perform computation (or inference) based on receiving input data to obtain output data (e.g., an animation expressing the shape of a character's mouth speaking a speech signal). The other trained model on the electronic device (300) may include a hardware structure or a software structure that is additional (or alternative) to the hardware structure.
[0074] For example, at least one processor (310) can generate a second animation (e.g., the second animation (900) of FIG. 9) that represents the shape of the mouth of a character speaking a voice signal by providing (or inputting) a voice signal to another trained model. For example, the second animation can be described as a lip sync (or lip synchronization) animation of the voice signal. For example, the type of character in the second animation can be determined by the type of voice signal (e.g., young male). For example, the type of character in the second animation can correspond to the type of voice signal.
[0075] For example, at least one processor (310) may perform operation 720 while performing operations 410 to 440, operation 700, and operation 710, or perform operations 410 to 440, operation 700, and operation 710 while performing operation 720. For example, at least one processor (310) may perform operations 410 to 440, operation 700, and operation 710 and perform operation 720, or perform operation 720 and perform operations 410 to 440, operation 700, and operation 710.
[0076] In operation 730, at least one processor (310) can blend a first animation (e.g., the first animation (810) of FIG. 8) and a second animation. For example, at least one processor (310) can generate a third animation (e.g., the third animation (905) of FIG. 9) representing the face of a character speaking a voice signal by blending the first animation and the second animation. For example, the character in the third animation may correspond to the character in the first animation and the character in the second animation. Generating a third animation representing the face of a character speaking a voice signal by blending the first animation and the second animation is exemplified in the description of FIG. 9.
[0077] Figure 9 illustrates an example of generating an animation that expresses a character's face by blending an animation that expresses a character's facial expression and an animation that expresses the shape of the character's mouth.
[0078] Referring to FIG. 9, at least one processor (310) can generate a third animation (905) representing the face of a character speaking a voice signal (515) by blending a first animation (810) and a second animation (900). For example, the third animation (905) may represent the face of a character including the shape of the mouth of the character speaking a voice signal and the facial expression of the character speaking a voice signal. For example, at least one processor (310) can generate the third animation (905) by changing the weight of the first animation (810) and the weight of the second animation (900).
[0079] For example, the character in the third animation (905) may have the facial expression of the character in the first animation (810) according to the weight of the first animation (810). For example, the character in the third animation (905) may have the shape of the mouth of the character in the second animation (900) according to the weight of the second animation (900). For example, the character in the third animation (910) may correspond to the character in the first animation (810) and the character in the second animation (900). For example, since the third animation (910) expresses the shape of the mouth and the facial expression of the character speaking a voice signal, the face of the character speaking a voice signal in the third animation (905) may be more natural than the face of the character speaking a voice signal in the first animation (810) and the face of the character speaking a voice signal in the second animation (900).
[0080] Referring again to FIG. 7, in operation 740, at least one processor (310) may control to display a third animation (e.g., the third animation (905) of FIG. 9) through a display (330) (or a display of an external electronic device). For example, at least one processor (310) may control to output a voice signal (e.g., the voice signal (515) of FIG. 5) through a speaker (340) (or a speaker of an external electronic device) while displaying the third animation. For example, at least one processor (310) may control to display the third animation while outputting the voice signal. For example, at least one processor (310) may provide the face of a character speaking a voice signal by controlling to output a voice signal (or display the third animation while outputting the voice signal) while displaying the third animation. Controlling to output a voice signal while displaying the third animation is exemplified in the description of FIG. 10.
[0081] Figure 10 illustrates an example of controlling the output of a voice signal while displaying an animation expressing a character's face.
[0082] Referring to FIG. 10, the state (1000) can be described as a state in which the third animation (905) is displayed. In the state (1000), at least one processor (310) can be controlled to output a voice signal (515) while the third animation (905) is displayed. For example, at least one processor (310) can output (or display) the third animation (905) and the voice signal (515) simultaneously.
[0083] For example, at the time when a phoneme in the voice signal (515) is output, the shape of the mouth of the character uttering the phoneme in the third animation (905) is displayed, so that the character in the third animation (905) may be seen uttering the voice signal (515). For example, at least one processor (310) may provide a character uttering the voice signal (515) by displaying the third animation (905) while the voice signal (515) is output.
[0084] For example, at the time when a word having a designated meaning in the voice signal (515) is output, a character in the third animation (905) may have an expression corresponding to the word having the designated meaning. For example, while the voice signal (515) is output, a character in the third animation (905) may have an expression corresponding to an emotion contained in the voice signal (515). For example, at least one processor (310) may perform non-verbal communication by displaying a character having an expression corresponding to the voice signal (515) while the voice signal (515) is output.
[0085] According to another embodiment, at least one processor (310) may transmit a voice signal (515) and a third animation (905) to an external electronic device. For example, the external electronic device may be one of various types of mobile devices, such as smartphones having various form factors (e.g., bar-type smartphones, foldable-type smartphones, or rollable-type smartphones), tablets, wearable devices, cellular phones, personal computers (PCs) (e.g., laptops and / or desktops), and / or other similar computing devices. For example, the external electronic device may receive the voice signal (515) and the third animation (905) from the electronic device (300). For example, the external electronic device may output the voice signal (515) while displaying the third animation (905) through the display of the external electronic device. For example, at least one processor (310) can control an external electronic device to output a voice signal (515) while displaying the third animation (905) by transmitting a voice signal (515) and a third animation (905) to the external electronic device.
[0086] In FIG. 10, only the display of a third animation (905) while outputting a voice signal (515) is illustrated, but while outputting the voice signal (515), a first animation (e.g., the first animation (810) of FIG. 8) and / or a second animation (e.g., the second animation (900) of FIG. 9) may be displayed. For example, at least one processor (310) may provide a character that speaks the voice signal (515) by outputting the first animation and / or the second animation while outputting the voice signal (515). For example, at least one processor (310) may provide a character having an expression corresponding to the voice signal (515) by controlling the output of the first animation and / or the second animation while outputting the voice signal (515).
[0087] As described above, the electronic device may include a memory for storing instructions. The electronic device may include at least one processor including a processing circuit. When the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to acquire a voice signal for the utterance of a character and text information regarding the voice signal. When the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to identify a word having a designated meaning among the words included in the text information by providing the text information to an artificial intelligence model. When the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to identify the time at which the word is uttered using the voice signal. When the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to acquire data indicating the time at which the time is identified based on the identification of the time at which the time is identified. The above instructions, when executed individually or collectively by the at least one processor, may cause the electronic device to generate an animation representing the face of a character speaking the voice signal by providing the voice signal and the data to a trained model. The above instructions, when executed individually or collectively by the at least one processor, may cause the electronic device to control the output of the voice signal while displaying the animation.
[0088] For example, when the above instructions are executed individually or collectively by the at least one processor, they may cause the electronic device to acquire data indicating an emotion regarding the text information by providing the text information to the artificial intelligence model. When the above instructions are executed individually or collectively by the at least one processor, they may cause the electronic device to generate another animation expressing the face of the character that expresses the emotion and utters the voice signal with an expression corresponding to the word at that time by further providing the data indicating the emotion to the trained model. When the above instructions are executed individually or collectively by the at least one processor, they may cause the electronic device to control the output of the voice signal while displaying the other animation.
[0089] For example, the data indicating the emotion may include data regarding the type of emotion and data regarding the level of the emotion.
[0090] For example, the above instructions may cause the electronic device to acquire data indicating the volume and pitch of the voice signal using the voice signal when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to generate the other animation expressing the face of the character uttering the voice signal with the expression corresponding to the word at a given time, by further providing the data indicating the volume and pitch to the trained model when executed individually or collectively by the at least one processor.
[0091] For example, the level of the emotion may be proportional to the volume of the voice signal or the pitch of the voice signal.
[0092] For example, the animation may be a first animation. When the instructions are executed individually or collectively by the at least one processor, they may cause the electronic device to generate a second animation expressing the shape of the mouth of the character speaking the voice signal by providing the voice signal to another trained model. When the instructions are executed individually or collectively by the at least one processor, they may cause the electronic device to generate a third animation expressing the face of the character speaking the voice signal with the character's facial expression and the shape of the character's mouth by blending the first animation and the second animation.
[0093] For example, the instructions may cause the electronic device to obtain data regarding the times when each of the words included in the speech signal is uttered by providing the speech signal to a speech-to-text (STT) model when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify the time when the word having the specified meaning is uttered by using the data regarding the times when executed individually or collectively by the at least one processor.
[0094] The above method, as described above, may be executed within an electronic device having memory and a processor. The method may include an operation of acquiring a voice signal for character utterance and text information regarding said voice signal. The method may include an operation of identifying a word having a designated meaning among the words included in said text information by providing said text information to an artificial intelligence model. The method may include an operation of identifying the time at which said word is uttered using said voice signal. The method may include an operation of acquiring data indicating said time based on identifying said time. The method may include an operation of generating an animation representing the face of a character uttering said voice signal by providing said voice signal and said data to a trained model. The method may include an operation of controlling said voice signal to be output while displaying said animation.
[0095] For example, the above method may include an operation of obtaining data indicating an emotion regarding the text information by providing the text information to the artificial intelligence model. The above method may include an operation of generating another animation expressing the face of the character that expresses the emotion and utters the voice signal with an expression corresponding to the word at the time by further providing the data indicating the emotion to the trained model. The above method may include an operation of controlling the output of the voice signal while displaying the other animation.
[0096] For example, the data indicating the emotion may include data regarding the type of emotion and data regarding the level of the emotion.
[0097] For example, the above method may include an operation of obtaining data indicating the volume and pitch of the voice signal using the voice signal. The above method may include an operation of generating the other animation by providing the data indicating the volume and pitch to the trained model, thereby expressing the emotion at a level according to the volume and pitch and expressing the face of the character uttering the voice signal with the facial expression corresponding to the word at the time.
[0098] For example, the level of the emotion may be proportional to the volume of the voice signal or the pitch of the voice signal.
[0099] For example, the animation may be a first animation. The method may include the action of generating a second animation that expresses the shape of the mouth of the character speaking the voice signal by providing the voice signal to another trained model. The method may include the action of generating a third animation that expresses the face of the character speaking the voice signal, having the character's facial expression and the shape of the character's mouth, by blending the first animation and the second animation.
[0100] For example, the above method may include an operation of obtaining data regarding the timing points when each of the words included in the speech signal is uttered by providing the speech signal to a speech-to-text (STT) model. The above method may include an operation of identifying the timing point when the word having the specified meaning is uttered by using the data regarding the timing points.
[0101] As described above, the non-transient computer-readable storage medium may store one or more programs. The one or more programs may include instructions that cause the electronic device to obtain a voice signal for character utterance and text information regarding the voice signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify a word having a designated meaning among the words included in the text information by providing the text information to an artificial intelligence model when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify the time when the word is uttered using the voice signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to obtain data indicating the time when executed by the electronic device based on identifying the time. The above one or more programs may include instructions that cause the electronic device to generate an animation representing the face of a character speaking the voice signal by providing the voice signal and the data to a trained model when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to control the output of the voice signal while displaying the animation when executed by the electronic device.
[0102] For example, the above one or more programs may include instructions that cause the electronic device to obtain data indicating an emotion regarding the text information by providing the text information to the artificial intelligence model when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to generate another animation expressing the face of the character that expresses the emotion and utters the voice signal with an expression corresponding to the word at a given time by further providing the data indicating the emotion to the trained model when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to control the output of the voice signal while displaying the other animation when executed by the electronic device.
[0103] For example, the data indicating the emotion may include data regarding the type of emotion and data regarding the level of the emotion.
[0104] For example, the above one or more programs may include instructions that cause the electronic device to acquire data indicating the volume and pitch of the voice signal using the voice signal when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to generate the other animation that expresses the emotion at a level according to the volume and pitch and expresses the face of the character uttering the voice signal with the expression corresponding to the word at that time by further providing the data indicating the volume and pitch to the trained model when executed by the electronic device.
[0105] For example, the level of the emotion may be proportional to the volume of the voice signal or the pitch of the voice signal.
[0106] For example, the animation may be a first animation. The one or more programs may include instructions that cause the electronic device to generate a second animation expressing the shape of the mouth of the character speaking the voice signal by providing the voice signal to another trained model when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to generate a third animation expressing the face of the character speaking the voice signal by blending the first animation and the second animation when executed by the electronic device, with the character's facial expression and the shape of the character's mouth.
[0107] For example, the above one or more programs may include instructions that cause the electronic device to obtain data regarding the points in time when each of the words included in the speech signal is uttered by providing the speech signal to a speech-to-text (STT) model when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to identify the point in time when the word having the specified meaning is uttered by using the data regarding the points in time when executed by the electronic device.
[0108] The device described above may be implemented as a hardware component, a software component, and / or a combination of a hardware component and a software component. For example, the device and components described in the embodiments may be implemented using one or more general-purpose or special-purpose computers, such as a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing unit may execute an operating system (OS) and one or more software applications executed on said operating system. Additionally, the processing unit may access, store, manipulate, process, and generate data in response to the execution of the software. For ease of understanding, the processing unit may be described as being used as a single unit, but those skilled in the art will understand that the processing unit may include multiple processing elements and / or multiple types of processing elements. For example, the processing unit may include multiple processors or one processor and one controller. In addition, other processing configurations, such as parallel processors, are also possible.
[0109] Software may include computer programs, code, instructions, or a combination of one or more of these, and may configure a processing unit to operate as desired or instruct the processing unit independently or collectively. Software and / or data may be embodied in any type of machine, component, physical device, computer storage medium, or device so as to be interpreted by the processing unit or to provide instructions or data to the processing unit. Software may be distributed over networked computer systems and may be stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.
[0110] The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. In this case, the medium may continuously store a computer-executable program, or temporarily store it for execution or download. Additionally, the medium may be various recording or storage means in the form of a single or several combined hardware, and may not be limited to a medium directly connected to a computer system but may exist distributed over a network. Examples of media may include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and media configured to store program instructions, including ROM, RAM, and flash memory. Additionally, other examples of media may include recording or storage media managed by app stores that distribute applications or sites and servers that supply or distribute various other software.
[0111] Although the embodiments have been described above with reference to limited examples and drawings, those skilled in the art can make various modifications and variations from the description above. For example, suitable results may be achieved even if the described techniques are performed in a different order than described, and / or the components of the described system, structure, device, circuit, etc. are combined or assembled in a form different from described, or replaced or substituted by other components or equivalents.
[0112] Therefore, other implementations, other embodiments, and equivalents to the claims set forth below are also within the scope of the claims. According to one embodiment, the method according to the various embodiments disclosed herein may be provided as a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)), or distributed online (e.g., download or upload) through an application store (e.g., Play Store™) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily created in a device-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.
[0113] According to various embodiments, each component (e.g., module or program) of the components described above may include a singular or multiple entities, and some of the multiple entities may be separated and placed in other components. According to various embodiments, one or more of the components or operations of the aforementioned components may be omitted, or one or more other components or operations may be added. Generally or additionally, multiple components (e.g., module or program) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the multiple components in the same or similar manner as those performed by the corresponding component among the multiple components prior to integration. According to various embodiments, operations performed by the module, program, or other components may be executed sequentially, in parallel, iteratively, or heuristically, or one or more of the operations may be executed in a different order, omitted, or one or more other operations may be added.
Claims
1. In an electronic device, Memory that stores instructions and includes one or more storage media; and It includes at least one processor comprising a processing circuit, and When the above instructions are executed individually or collectively by the at least one processor, Acquiring a voice signal for character speech and text information regarding the voice signal, By providing the above text information to an artificial intelligence model, words having a designated meaning among the words included in the above text information are identified, and Using the above voice signal, the time when the above word is uttered is identified, and Based on identifying the above time point, data indicating the above time point is obtained, and By providing the above voice signal and the above data to a trained model, an animation is generated that represents the face of a character speaking the voice signal, and To control the output of the voice signal while displaying the above animation, causing the above electronic device, Electronic device.
2. In Claim 1, When the above instructions are executed individually or collectively by the at least one processor, By providing the above text information to the artificial intelligence model, data indicating an emotion regarding the above text information is obtained, and By further providing the data indicating the above emotion to the trained model, another animation is generated that expresses the face of the character that expresses the emotion and utters the voice signal with an expression corresponding to the word at the time, and To control the output of the voice signal while displaying the above other animation, causing the above electronic device, Electronic device.
3. In Claim 2, The above data indicating the above emotion is, including data on the type of emotion and data on the level of emotion, Electronic device.
4. In Claim 2, When the above instructions are executed individually or collectively by the at least one processor, Using the above voice signal, data indicating the volume of the voice signal and the pitch of the voice signal is obtained, and By further providing the data indicating the volume and pitch to the trained model, the other animation is generated to express the emotion at a level according to the volume and pitch, and to express the face of the character uttering the voice signal with the facial expression corresponding to the word at the time. causing electronic devices Electronic device.
5. In Claim 4, The above level of the above emotion is, Proportional to the volume of the voice signal or the pitch of the voice signal, Electronic device.
6. In Claim 1, The above animation is, It is the first animation, and When the above instructions are executed individually or collectively by the at least one processor, By providing the above voice signal to another trained model, a second animation is generated that expresses the shape of the mouth of the character speaking the above voice signal, and By blending the first animation and the second animation, a third animation is generated that expresses the face of the character emitting the voice signal, having the facial expression of the character and the shape of the character's mouth. causing the above electronic device, Electronic device.
7. In Claim 1, When the above instructions are executed individually or collectively by the at least one processor, By providing the above speech signal to a speech-to-text (STT) model, data regarding the timings at which each of the words included in the speech signal is uttered is obtained, and Using the data for the above points in time, to identify the point in time when the word having the above designated meaning is uttered, causing the above electronic device, Electronic device.
8. A method executed by an electronic device including memory and a processor, wherein the method comprises: The operation of obtaining a voice signal for character speech and text information regarding the voice signal, The operation of identifying words having a designated meaning among the words included in the text information by providing the above text information to an artificial intelligence model, An operation of identifying the time when the word is uttered using the above voice signal, An operation of acquiring data indicating the above time point based on identifying the above time point, The operation of generating an animation that represents the face of a character speaking the voice signal by providing the voice signal and the data to a trained model, and A control operation to output the voice signal while displaying the above animation, method.
9. In claim 8, the method is, The operation of obtaining data indicating an emotion regarding the text information by providing the text information to the artificial intelligence model, The operation of generating another animation that expresses the face of the character uttering the voice signal with an expression corresponding to the emotion and an expression corresponding to the word at the time by further providing the data indicating the emotion to the trained model, and A control operation to output the voice signal while displaying the other animation mentioned above, method.
10. In Claim 9, The above data indicating the above emotion is, including data on the type of emotion and data on the level of emotion, method.
11. In Claim 10, The above method is, An operation of acquiring data indicating the volume of the voice signal and the pitch of the voice signal using the voice signal, and By further providing the data indicating the volume and pitch to the trained model, the operation of generating the other animation that expresses the emotion at a level according to the volume and pitch and expresses the face of the character that utters the voice signal having the facial expression corresponding to the word at the time, method.
12. In Claim 11, The above level of the above emotion is, Proportional to the volume of the voice signal or the pitch of the voice signal, method.
13. In Claim 8, The above animation is, It is the first animation, and The above method is, The operation of generating a second animation expressing the shape of the mouth of the character speaking the voice signal by providing the voice signal to another trained model, and The operation of generating a third animation that expresses the face of the character uttering the voice signal, having the facial expression of the character and the shape of the character's mouth, by blending the first animation and the second animation. method.
14. In claim 8, the method is, The operation of obtaining data regarding the timings at which each of the words included in the speech signal is uttered by providing the above speech signal to a speech-to-text (STT) model, and Using the data regarding the above points in time, the operation of identifying the point in time when the word having the specified meaning is to be uttered, method.
15. In a non-transient computer-readable storage medium storing one or more programs, said one or more programs, when executed by an electronic device having memory and a processor, Acquiring a voice signal for character speech and text information regarding the voice signal, By providing the above text information to an artificial intelligence model, words having a designated meaning among the words included in the above text information are identified, and Using the above voice signal, the time when the above word is uttered is identified, and Based on identifying the above time point, data indicating the above time point is obtained, and By providing the above voice signal and the above data to a trained model, an animation is generated that represents the face of a character speaking the voice signal, and To control the output of the voice signal while displaying the above animation, Including instructions that cause the above electronic device, Non-transient computer-readable storage media.
16. In Claim 15, When the above one or more programs are executed by the electronic device, By providing the above text information to the artificial intelligence model, data indicating an emotion regarding the above text information is obtained, and By further providing the data indicating the above emotion to the trained model, another animation is generated that expresses the face of the character uttering the voice signal having an expression corresponding to the emotion and an expression corresponding to the word at the time, and To control the output of the voice signal while displaying the above other animation, Including instructions that cause the above electronic device, Non-transient computer-readable storage media.
17. In Claim 16, The above data indicating the above emotion is, including data on the type of emotion and data on the level of emotion, Non-transient computer-readable storage media.
18. In Claim 17, When the above one or more programs are executed by the electronic device, Using the above voice signal, data indicating the volume of the voice signal and the pitch of the voice signal is obtained, and By further providing the data indicating the volume and pitch to the trained model, the other animation is generated to express the emotion at a level according to the volume and pitch, and to express the face of the character uttering the voice signal with the facial expression corresponding to the word at the time. Including instructions that cause the above electronic device, Non-transient computer-readable storage media.
19. In Claim 15, The above animation is, It is the first animation, and When the above one or more programs are executed by the electronic device, By providing the above voice signal to another trained model, a second animation is generated that expresses the shape of the mouth of the character speaking the above voice signal, and By blending the first animation and the second animation, a third animation is generated that expresses the face of the character emitting the voice signal, having the facial expression of the character and the shape of the character's mouth. Including instructions that cause the above electronic device, Non-transient computer-readable storage media.
20. In Claim 15, When the above one or more programs are executed by the electronic device, By providing the above speech signal to a speech-to-text (STT) model, data regarding the timings at which each of the words included in the speech signal is uttered is obtained, and Using the data for the above points in time, to identify the point in time when the word having the above designated meaning is uttered, Including instructions that cause the above electronic device, Non-transient computer-readable storage media.