Electronic device and method for acquiring image for audio signal, and non-transitory computer-readable storage medium
By processing audio signals to generate images or videos of the external environment, the device addresses the challenge of camera-less devices, offering a solution to capture visual information and improve user experience.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SAMSUNG ELECTRONICS CO LTD
- Filing Date
- 2025-10-23
- Publication Date
- 2026-06-18
Smart Images

Figure KR2025016983_18062026_PF_FP_ABST
Abstract
Description
Electronic device, method, and non-transient computer-readable storage medium for acquiring an image for an audio signal
[0001] The present disclosure relates to an electronic device, a method, and a non-transient computer-readable storage medium for acquiring an image for an audio signal.
[0002] Artificial intelligence is a technology for simulating human (or biological) neural activities such as perception and / or inference, and can be implemented as hardware, software, or a combination thereof designed to perform calculations for simulating neural activities.
[0003] The information described above may be provided as related art for the purpose of aiding understanding of the present disclosure. No claim or determination is made as to whether any of the foregoing may be applied as prior art related to the present disclosure.
[0004] An electronic device is described. The electronic device may include at least one processor comprising a processing circuit, a microphone, and a memory comprising one or more storage media configured to store one or more programs configured to be executed individually or collectively by the at least one processor. The instructions may cause the electronic device to acquire an audio signal using the microphone when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify a context indicated by the audio signal based on an event that generates an image for the audio signal when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify a speaker of at least one voice signal included in the audio signal based on the event when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to identify at least one image containing the speaker's representation among images stored in the electronic device and images registered in association with the user's account of the electronic device, when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to generate a prompt including first information about the speaker and second information about the context, when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to provide the prompt and the at least one image to a trained model, when executed individually or collectively by the at least one processor.The above instructions, when executed individually or collectively by the at least one processor, may cause the electronic device to acquire the image for the audio signal generated by the trained model using the prompt and the at least one image.
[0005] A method is described. The method may be performed within an electronic device including a microphone. The method may include an operation of acquiring an audio signal using the microphone. The method may include an operation of identifying a context indicated by the audio signal based on an event of generating an image for the audio signal. The method may include an operation of identifying a utterer of at least one voice signal included in the audio signal based on the event. The method may include an operation of identifying at least one image containing a representation of the utterer among images stored in the electronic device and images registered in association with the user's account of the electronic device. The method may include an operation of generating a prompt including first information about the utterer and second information about the context. The method may include an operation of providing the prompt and the at least one image to a trained model. The method may include an operation of acquiring the image for the audio signal generated by the trained model using the prompt and the at least one image.
[0006] A non-transient computer-readable storage medium is described. The non-transient computer-readable storage medium may store one or more programs. The one or more programs may include instructions that cause the electronic device to acquire an audio signal using the microphone when executed by the electronic device including the microphone. The one or more programs may include instructions that cause the electronic device to identify the context indicated by the audio signal based on an event that generates an image for the audio signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify the utterer of at least one voice signal included in the audio signal based on the event when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to identify at least one image containing the speaker's representation among images stored in the electronic device and images registered in association with the user's account of the electronic device when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to generate a prompt including first information about the speaker and second information about the context when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to provide the prompt and the at least one image to a trained model when executed by the electronic device.The above one or more programs may include instructions that cause the electronic device to acquire the image for the audio signal generated by the trained model using the prompt and the at least one image when executed by the electronic device.
[0007] An electronic device is described. The electronic device may include at least one processor comprising a processing circuit, a microphone, and a memory comprising one or more storage media for storing one or more programs configured to be executed individually or collectively by the at least one processor. The instructions may cause the electronic device to acquire an audio signal using the microphone when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify the context indicated by the audio signal based on an event that generates video for the audio signal when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify the utterer of at least one voice signal included in the audio signal based on the event when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to identify at least one image containing the speaker's representation among images stored in the electronic device and images registered in association with the user's account of the electronic device, when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to generate a prompt including first information about the speaker and second information about the context, when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to provide the prompt and the at least one image to a trained model, when executed individually or collectively by the at least one processor.The above instructions, when executed individually or collectively by the at least one processor, may cause the electronic device to acquire the video for the audio signal generated by the trained model using the prompt and the at least one image.
[0008] A method is described. The method may be performed within an electronic device including a microphone. The method may include an operation of acquiring an audio signal using the microphone. The method may include an operation of identifying a context indicated by the audio signal based on an event of generating a video for the audio signal. The method may include an operation of identifying a utterer of at least one voice signal included in the audio signal based on the event. The method may include an operation of identifying at least one image containing a representation of the utterer among images stored in the electronic device and images registered in association with the user's account of the electronic device. The method may include an operation of generating a prompt including first information about the utterer and second information about the context. The method may include an operation of providing the prompt and the at least one image to a trained model. The method may include an operation of acquiring the video for the audio signal generated by the trained model using the prompt and the at least one image.
[0009] A non-transient computer-readable storage medium is described. The non-transient computer-readable storage medium may store one or more programs. The one or more programs may include instructions that cause the electronic device to acquire an audio signal using the microphone when executed by the electronic device including the microphone. The one or more programs may include instructions that cause the electronic device to identify the context indicated by the audio signal based on an event that generates video for the audio signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify the utterer of at least one voice signal included in the audio signal based on the event when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to identify at least one image containing the speaker's representation among images stored in the electronic device and images registered in association with the user's account of the electronic device when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to generate a prompt including first information about the speaker and second information about the context when executed by the electronic device. The above one or more programs may include instructions that cause the electronic device to provide the prompt and the at least one image to a trained model when executed by the electronic device.The above one or more programs may include instructions that cause the electronic device to acquire the video for the audio signal generated by the trained model using the prompt and the at least one image when executed by the electronic device.
[0010] Figure 1 illustrates an example of acquiring an audio signal generated from outside an electronic device.
[0011] Figure 2 is a simplified block diagram of an exemplary electronic device.
[0012] FIG. 3 is a flowchart illustrating exemplary operations of an electronic device for acquiring an image for an audio signal.
[0013] Figure 4 illustrates an example of generating a prompt using a first trained model.
[0014] Figure 5 illustrates an example of identifying at least one image.
[0015] Figure 6 illustrates an example of generating an image for an audio signal using a second trained model.
[0016] Figure 7 illustrates an example of generating a video for an audio signal using a third trained model.
[0017] FIG. 8 illustrates an example of an image for an audio signal and / or a video for an audio signal.
[0018] Figure 9 illustrates examples of components of an electronic device.
[0019] FIG. 10 is a block diagram of an electronic device in a network environment according to various embodiments.
[0020] FIG. 11 illustrates an example of a generative artificial intelligence system according to one embodiment.
[0021] Hereinafter, embodiments of the present disclosure are described in detail with reference to the drawings so that those skilled in the art can easily practice them. However, the present disclosure may be embodied in various different forms and is not limited to the embodiments described herein. In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components. Furthermore, in the drawings and related descriptions, descriptions of well-known functions and configurations may be omitted for clarity and brevity.
[0022] Figure 1 illustrates an example of acquiring an audio signal generated from outside an electronic device.
[0023] Referring to FIG. 1, the electronic device (100) may be described as a device available for generating an image for an audio signal. For example, the electronic device (100) may be one of various forms of mobile devices such as smartphones having various form factors (e.g., bar-type smartphones, foldable-type smartphones, or rollable-type smartphones), tablets, wearable devices, cellular phones, personal computers (PCs) (e.g., laptops and / or desktops), and / or other similar computing devices that include circuits (or circuitry) for generating an image for an audio signal.
[0024] State (110) can be described as a state for acquiring an image of the external environment of the electronic device (100). In state (110), the electronic device (100) may not include a camera or may not be able to capture the external environment using a camera. For example, as the electronic device (100) does not include a camera or may not be able to capture the external environment using a camera, the electronic device (100) may not be able to acquire an image of the external environment. For example, as the electronic device (100) is unable to acquire an image of the external environment, a user attempting to acquire an image of the external environment may experience inconvenience. For example, a solution may be required to resolve the inconvenience experienced by the user due to the electronic device (100) being unable to acquire an image of the external environment.
[0025] For example, to resolve this inconvenience, the electronic device (100) can acquire an image of the external environment by using an audio signal (115) generated from the external environment. For example, the electronic device (100) may include a microphone (105) (e.g., the microphone (230) of FIG. 2). For example, the electronic device (100) can acquire an audio signal (115) generated from the external environment by using the microphone (105). For example, the audio signal (115) may include at least one voice signal generated from a utterer (120). For example, the electronic device (100) can acquire an image of the external environment including an expression of the utterer (120) by using at least one voice signal. For example, the audio signal (115) may include an audio signal generated from the background (125) of the external environment. For example, the electronic device (100) can acquire an image for an external environment including a representation of the background (125) by using an audio signal generated from the background (125) of the external environment. The electronic device (100) can perform operations that are exemplified within the description of FIGS. 3 through 8 to acquire an image for the external environment. The electronic device (100) may include components for performing said operations. said components may be exemplified within the description of FIG. 2.
[0026] Figure 2 is a simplified block diagram of an exemplary electronic device.
[0027] Referring to FIG. 2, the electronic device (200) may be one of various types of mobile devices, such as smartphones having various form factors (e.g., bar-type smartphones, foldable-type smartphones, or rollable-type smartphones), tablets, wearable devices, cellular phones, personal computers (PCs) (e.g., laptops and / or desktops), and / or other similar computing devices. For example, the electronic device (200) may include or correspond to the electronic device (100) of FIG. 1. For example, the electronic device (200) may include at least a part of the electronic device (1001) of FIG. 10 or correspond to at least a part of the electronic device (1001) of FIG. 10. For example, the electronic device (200) may include at least one processor (210) (e.g., processor (1020) of FIG. 10), memory (220) (e.g., memory (1030) of FIG. 10), and microphone (230) (e.g., audio module (1070) of FIG. 10).
[0028] At least one processor (210) may include a processing circuit. For example, at least one processor (210) may include a CPU (central processing unit) (e.g., including a processing circuit). For example, at least one processor (210) may include a GPU (graphic processing unit) (e.g., including a processing circuit) and / or an NPU (neural processing unit) (e.g., including a processing circuit). For example, at least one processor (210) may be described as an application processor. For example, at least one processor (210) may be configured to control a memory (220) and a microphone (230). At least one processor (210) may be configured to execute instructions stored in the memory (220) individually or collectively to cause the electronic device (200) to perform at least some of the operations illustrated in the description of FIG. 1. At least one processor (210) may be configured to execute instructions stored in memory (220) to cause the electronic device (200) to perform at least some of the operations illustrated in the description of FIGS. 3 through 8.
[0029] For example, the term “processor” as used herein, including in the claims, may include various processing circuits comprising at least one processor, and one or more of said at least one processor may be configured to perform the various functions described below in a distributed manner, individually and / or collectively. As used below, where “processor,” “at least one processor,” and “one or more processors” are described as being configured to perform various functions, these terms encompass, for example, but not limited to, situations where one processor performs some of the cited functions and another processor(s) perform other parts of the cited functions, and also situations where one processor can perform all of the cited functions. Additionally, said at least one processor may include a combination of processors that perform the enumerated / disclosed various functions, for example, in a distributed manner. At least one processor may execute program instructions to achieve or perform the various functions.
[0030] The memory (220) may include one or more storage media. For example, the memory (220) may store various data used by at least one component of the electronic device (200) (e.g., at least one processor (210) and / or microphone (230)). For example, the data may include input data or output data for software and related commands. The memory (220) may include volatile memory or non-volatile memory.
[0031] A microphone (230) may be used to convert sound into an electrical signal. For example, the microphone (230) may be configured to acquire an audio signal generated from outside the electronic device (200). For example, the microphone (230) may include a digital microphone, an electronic condenser microphone (ECM), or a micro electro mechanical system (MEMS). However, it is not limited thereto.
[0032] The electronic device (200) illustrated in the description of FIG. 2 may perform at least some of the operations illustrated in the descriptions of FIG. 3 through 8. For example, the operations illustrated in the descriptions of FIG. 3 through 8 may be caused by (or within) the electronic device (200) under the control of at least one processor (210).
[0033] FIG. 3 is a flowchart illustrating exemplary operations of an electronic device for acquiring an image for an audio signal.
[0034] Referring to FIG. 3, in operation 300, at least one processor (210) can acquire an audio signal (e.g., audio signal (115) of FIG. 1) using a microphone (230). For example, the audio signal may be generated from outside the electronic device (200) (or around the electronic device (200)). For example, the audio signal may include at least one voice signal generated by a speaker. For example, the audio signal may be used to generate an image (or video) for the audio signal. For example, the image (or video) for the audio signal may include an image (or video) of the external environment of the electronic device (200) at the time the audio signal is acquired. For example, at least one processor (210) can acquire an audio signal using a microphone (230) to acquire an image (or video) of the external environment of the electronic device (200).
[0035] In operation 310, at least one processor (210) can detect an event of generating an image for an audio signal. For example, the event of generating an image for an audio signal may be detected while the audio signal is being acquired or after the audio signal is being acquired. For example, the event of generating an image for an audio signal may be detected based on identifying that the audio signal is being acquired within a specified area (or region, or place, or location) or identifying that the audio signal is being acquired within a specified time interval (or date). For example, the event of generating an image for an audio signal may be detected based on identifying that the acquired audio signal contains a voice signal of a specified utterer or speaker or that the acquired audio signal contains an audio signal having a specified context (or content).
[0036] For example, at least one processor (210) can identify the speaker of at least one voice signal included in the audio signal based on an event that generates an image for the audio signal. For example, the electronic device (200) may further include a display. For example, at least one processor (210) may display a notification (or window, or pop-up window) through the display indicating the detection of an event that generates an image for the audio signal based on the detection of an event that generates an image for the audio signal. For example, at least one processor (210) may further display a window (or UI (user interface)) through the display inquiring whether to generate an image for the audio signal. For example, at least one processor (210) may receive (or identify) user input to generate an image for the audio signal through the window inquiring whether to generate an image for the audio signal. For example, at least one processor (210) may receive (or identify) user input to generate an image for the audio signal after an event that generates an image for the audio signal is detected. For example, at least one processor (210) can identify the speaker of at least one voice signal included in the audio signal based on user input generating an image for the audio signal.
[0037] According to another embodiment, at least one processor (210) can identify the speaker of at least one voice signal included in the audio signal based on the detection of an event that generates an image for the audio signal. For example, at least one processor (210) can identify the speaker of at least one voice signal included in the audio signal without user input generating an image for the audio signal (or without receiving user input).
[0038] For example, at least one processor (210) can obtain at least one voice signal contained within the audio signal by using the audio signal based on an event that generates an image for the audio signal. For example, at least one processor (210) can extract at least one voice signal from the audio signal. For example, at least one voice signal contained within the audio signal may be generated by a plurality of speakers, respectively. For example, at least one processor (210) can identify a plurality of speakers for each of the at least one voice signal contained within the audio signal.
[0039] For example, at least one processor (210) can obtain information representing the features of at least one voice signal using at least one voice signal. For example, the information representing the features of at least one voice signal may be referred to as feature information, feature vector, embedding vector, and / or acoustic feature. For example, at least one processor (210) can identify the speaker of at least one voice signal using the information representing the features of at least one voice signal. For example, at least one processor (210) can identify the speaker of at least one voice signal by comparing the information representing the features of at least one voice signal with the information representing the features of each voice signal stored in the electronic device (200). As a non-limiting example, at least one processor (210) can identify the speaker of at least one voice signal by matching the speaker based on the similarity between the feature vector (or embedding vector) of at least one voice signal and the feature vector (or embedding vector) of each voice signal stored in the electronic device (200). However, it is not limited thereto.
[0040] According to another embodiment, at least one processor (210) can obtain an audio signal from which noise contained within the audio signal has been removed by performing noise cancellation on the audio signal. For example, at least one processor (210) can identify the speaker of at least one voice signal contained within the audio signal by using the noise-canceled audio signal. For example, at least one processor (210) can increase the accuracy of identifying the speaker of at least one voice signal contained within the audio signal by performing noise cancellation on the audio signal.
[0041] In operation 320, at least one processor (210) can identify the context indicated by the audio signal based on an event that generates an image for the audio signal. For example, the event that generates an image for the audio signal may correspond to the event that generates an image for the audio signal of operation 310. For example, at least one processor (210) can identify the context indicated by the audio signal based on the detection of the event that generates an image for the audio signal, or identify the context indicated by the audio signal based on user input that generates an image for the audio signal.
[0042] For example, the context indicated by the audio signal may include the context indicated by at least one voice signal contained within the audio signal. For example, the context indicated by at least one voice signal may be referred to as a conversational context. For example, at least one processor (210) may obtain at least one voice signal contained within the audio signal by using the audio signal based on an event that generates an image for the audio signal. For example, at least one processor (210) may extract at least one voice signal contained within the audio signal from the audio signal. For example, at least one processor (210) may obtain text for at least one voice signal by performing speech-to-text (STT) and / or automatic speech recognition (ASR) on at least one voice signal. For example, at least one processor (210) may identify the context indicated by at least one voice signal by using the text for at least one voice signal.
[0043] For example, at least one processor (210) can obtain the remaining audio signal, excluding at least one voice signal included in the audio signal, by using the audio signal based on an event that generates an image for the audio signal. For example, the remaining audio signal, excluding at least one voice signal included in the audio signal, may be referred to as an audio signal indicating a background sound. For example, at least a portion of the remaining audio signal, excluding at least one voice signal included in the audio signal, may have a context. For example, the context indicated by the audio signal may include the context indicated by the remaining audio signal, excluding at least one voice signal included in the audio signal. For example, at least one processor (210) can obtain text for the remaining audio signal by performing STT and / or ASR on the remaining audio signal, excluding at least one voice signal included in the audio signal. For example, at least one processor (210) can identify the context indicated by the remaining audio signal by using the text for the remaining audio signal.
[0044] For example, at least one processor (210) can obtain an audio signal from which noise contained within the audio signal has been removed by performing noise cancellation on the audio signal. For example, at least one processor (210) can identify the context indicated by the audio signal using the noise-canceled audio signal. For example, at least one processor (210) can increase the accuracy of identifying the context indicated by the audio signal by performing noise cancellation on the audio signal.
[0045] For example, at least one processor (210) may perform operation 320 while performing operation 310, or perform operation 310 and then perform operation 320, or perform operation 320 and then perform operation 310.
[0046] In operation 330, at least one processor (210) may generate a prompt including first information about the speaker of at least one voice signal and second information about the context indicated by the audio signal. For example, a first trained model may be used to generate the prompt. Generating the prompt using the first trained model is illustrated in the description of FIG. 4.
[0047] Figure 4 illustrates an example of generating a prompt using a first trained model.
[0048] Referring to FIG. 4, the first trained model (400) may be contained within an electronic device (200) or within a server outside the electronic device (200). For example, the first trained model (400) may be described as a model trained to generate a prompt. For example, the first trained model (400) may include the prompt design component (1121) of FIG. 11. For example, the first trained model (400) may include a machine learning model and / or a deep learning model.
[0049] For example, at least one processor (210) may provide the first trained model (400) with first information (405) about the speaker of at least one voice signal included in the audio signal and second information (410) about the context indicated by the audio signal. For example, at least one processor (210) may obtain the first information (405) (e.g., “sophia” and / or “jane”) based on identifying the speaker of at least one voice signal included in the audio signal. For example, at least one processor (210) may obtain the second information (410) (e.g., “jane made three sandcastles, and sophia is decorating the sandcastles with shells.”) based on identifying the context indicated by the audio signal. For example, at least one processor (210) may obtain a prompt (425) generated by the first trained model (400) using the first information (405) and the second information (410). For example, the prompt (425) (e.g., “Make a picture of Sophia and Jane playing in the sand, Jane made three sandcastles, and Sophia is decorating the sandcastles with shells.”) may include the first information (405) and the second information (410).
[0050] For example, the electronic device (200) may further include a GPS (global positioning system) receiver. For example, at least one processor (210) may use the GPS receiver to identify the location (or place) where the audio signal was acquired. For example, the location (or place) where the audio signal was acquired may be described as the location (or place) of the electronic device (200) at the time the audio signal was acquired. For example, at least one processor (210) may acquire third information (415) (e.g., “○○ Beach”) regarding the location (or place) where the audio signal was acquired. For example, at least one processor (210) may acquire the third information (415) based further on the audio signal. For example, at least one processor (210) may identify the location (or place) where the audio signal was acquired by using an audio signal indicating a background sound included in the audio signal, or identify the location (or place) where the audio signal was acquired based on the context of the audio signal. For example, at least one processor (210) can obtain third information (415) based further on an identified location (or place) using an audio signal indicating a background sound included in the audio signal (and / or based on the context of the audio signal).
[0051] For example, at least one processor (210) may further provide third information (415) to the first trained model (400). For example, at least one processor (210) may obtain a prompt (425) generated by the first trained model (400) using the third information (415) further. For example, the prompt (425) (e.g., “Make a picture of Sophia and Jane playing in the sand at ○○ Beach, Jane made three sandcastles, and Sophia is decorating the sandcastles with shells.”) may include the first information (405), the second information (410), and / or the third information (415).
[0052] For example, at least one processor (210) can identify the time and / or date at which the audio signal was acquired. For example, at least one processor (210) can acquire fourth information (420) (e.g., “2:00 PM” and / or “June”) regarding the time and / or date at which the audio signal was acquired. For example, at least one processor (210) can acquire the fourth information (420) based further on the audio signal. For example, at least one processor (210) can identify the time and / or date at which the audio signal was acquired by using an audio signal indicating a background sound included in the audio signal, or can identify the time and / or date at which the audio signal was acquired based on the context of the audio signal. For example, at least one processor (210) can acquire the fourth information (420) based further on the identified date and / or time by using an audio signal indicating a background sound included in the audio signal (and / or based on the context of the audio signal).
[0053] For example, at least one processor (210) may further provide the fourth information (420) to the first trained model (400). For example, at least one processor (210) may obtain a prompt (425) generated by the first trained model (400) using the fourth information (420) further. For example, the prompt (425) (e.g., “Make a picture of Sophia and Jane playing in the sand at ○○ Beach at 2 PM in June, Jane made three sandcastles, and Sophia is decorating the sandcastles with shells.”) may include the first information (405), the second information (410), the third information (415), and / or the fourth information (420).
[0054] For example, at least one processor (210) can generate a prompt (425) based on an audio signal without user input to generate the prompt (425) (or without receiving user input).
[0055] Referring again to FIG. 3, in operation 340, at least one processor (210) may identify (or acquire) at least one image containing a representation of the speaker of at least one voice signal included in the audio signal. For example, the at least one image containing the representation of the speaker may include at least one image stored in the electronic device (200), at least one image registered in association with a user account of the electronic device (200), and / or at least one image retrieved via the Internet. For example, in FIG. 3, operation 340 is depicted to be performed after operation 330 is performed, but operation 340 may be performed while operation 330 is performed or before operation 330 is performed. Identifying (or acquiring) at least one image containing the representation of the speaker is illustrated in the description of FIG. 5.
[0056] Figure 5 illustrates an example of identifying at least one image.
[0057] Referring to FIG. 5, the state (500) can be described as a state for identifying at least one image (505) containing a speech speaker's expression. In the state (500), at least one processor (210) can identify a plurality of images stored in the electronic device (200). For example, at least one processor (210) can identify expressions of users each included in the plurality of images stored in the electronic device (200). For example, at least one processor (210) can identify whether each of the users corresponds to a speech speaker. For example, the users and the speech speaker may have identifiers. For example, at least one processor (210) can identify a user among the users who has an identifier corresponding to the speech speaker's identifier. For example, a user who has an identifier corresponding to the speech speaker's identifier may be described as a speech speaker. For example, at least one processor (210) can identify at least one image (505) containing a user’s expression having an identifier corresponding to the speaker’s identifier. For example, at least one processor (210) can identify at least one image (505) containing a speaker’s expression among a plurality of images stored in an electronic device (200).
[0058] For example, at least one processor (210) can identify a plurality of images linked to a user account of an electronic device (200). For example, at least one processor (210) can identify at least one image (505) containing a speaker's expression among the plurality of images linked to a user account of an electronic device (200). For example, the plurality of images linked to a user account of an electronic device (200) can be described as a plurality of images stored in a cloud server of the user account of an electronic device (200). For example, at least one processor (210) can identify expressions of users included in each of the plurality of images linked to a user account of an electronic device (200). For example, at least one processor (210) can identify whether each of the users corresponds to a speaker. For example, the users and the speaker may have an identifier. For example, at least one processor (210) can identify a user among the users who has an identifier corresponding to the identifier of the speaker. For example, a user who has an identifier corresponding to the identifier of the speaker may be described as the speaker. For example, at least one processor (210) can identify at least one image (505) containing an expression of a user who has an identifier corresponding to the identifier of the speaker. For example, at least one processor (210) can identify at least one image (505) containing an expression of the speaker among a plurality of images registered in association with a user account of the electronic device (200).
[0059] According to another embodiment, at least one processor (210) can identify the location (or place) where the audio signal was acquired using a GPS receiver. For example, the location (or place) where the audio signal was acquired can be described as the location (or place) of the electronic device (200) at the time the audio signal was acquired. For example, at least one processor (210) can identify the time and / or date when the audio signal was acquired. For example, at least one processor (210) can identify at least one image (510) containing a representation of the location (or place) where the audio signal was acquired among a plurality of images (a plurality of images registered in association with the user account of the electronic device (200)) stored in the electronic device (200), based on the context of the audio signal, the location (or place) where the audio signal was acquired, the time when the audio signal was acquired, and / or the date when the audio signal was acquired.
[0060] State (515) can be described as a state for obtaining at least one image (525) based on a search over the Internet. In state (515), at least one processor (210) can search over the Internet for a location (or place) where a speaker and / or audio signal was obtained. For example, at least one processor (210) can search over the Internet for a speaker based on identifying the speaker of at least one voice signal contained within the audio signal. For example, at least one processor (210) can obtain at least one image (525) containing a representation of the speaker based on searching over the Internet for a speaker. For example, text corresponding to the speaker's name can be used as a search term to search for the speaker over the Internet.
[0061] For example, at least one processor (210) can search the location (or place) where the audio signal was obtained via the Internet based on the context of the audio signal, the location where the audio signal was obtained, the time when the audio signal was obtained, and / or the date when the audio signal was obtained. For example, at least one processor (210) can obtain at least one image (525) including a representation of the location (or place) where the audio signal was obtained based on searching the location (or place) where the audio signal was obtained via the Internet. For example, at least one processor (210) can determine a search term to be used to search the location (or place) where the audio signal was obtained via the Internet using the context of the audio signal, the location where the audio signal was obtained, the time when the audio signal was obtained, and / or the date when the audio signal was obtained.
[0062] Referring again to FIG. 3, in operation 350, at least one processor (210) may provide at least one image including a prompt and a speaker's expression to a second trained model. For example, at least one processor (210) may acquire an image for an audio signal generated by the second trained model using the prompt and at least one image. Acquiring an image for an audio signal using the second trained model is exemplified in the description of FIG. 6.
[0063] Figure 6 illustrates an example of generating an image for an audio signal using a second trained model.
[0064] Referring to FIG. 6, the second trained model (600) may be contained within the electronic device (200) or within a server outside the electronic device (200). For example, the second trained model (600) may be described as a model trained to generate images for audio signals. For example, the second trained model (600) may include the generative AI (artificial intelligence) model (1130) of FIG. 11. For example, the second trained model (600) may include a machine learning model and / or a deep learning model.
[0065] For example, the second trained model (600) may include an artificial neural network model comprising multiple layers and / or operations (or computations). As an example, but not limited to, at least one trained model may include one of a feedforward neural network (FNN), a deep neural network (DNN), a convolutional neural network (CNN), a region with convolutional neural network (R-CNN), a region proposal network (RPN), a recurrent neural network (RNN), a stacking-based deep neural network (S-DNN), a state-space dynamic neural network (S-SDNN), a deconvolution network, a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), deep Q-networks, a fully convolutional network, a long short-term memory (LSTM) network, a classification network, or a combination of two or more of these. However, it is not limited thereto. For example, the second trained model (600) can be trained on input data and can obtain output data by performing operations (or inference) based on receiving input data. The second trained model (600) may include a hardware structure or a software structure that is additional (or alternative) to the hardware structure.
[0066] For example, at least one processor (210) may provide at least one image (605) including a prompt (425) and a speaker's expression to a second trained model (600). For example, the at least one image (605) including a speaker's expression may be described as at least one image identified among a plurality of images stored in the electronic device (200), at least one image identified among a plurality of images registered in association with a user account of the electronic device (200), or at least one image obtained through an internet search. For example, at least one processor (210) may obtain an image (615) for an audio signal generated by the second trained model (600) using the prompt (425) and at least one image (605).
[0067] For example, the image (615) for the audio signal may include a representation of the speaker of at least one voice signal included in the audio signal. For example, since at least one image (605) including a representation of the speaker is used to obtain the image (615) for the audio signal, the representation of the speaker included in the image (615) for the audio signal may correspond to the representation of the speaker included in at least one image (605). For example, the speaker represented in the image (615) for the audio signal may correspond to the speaker of at least one voice signal included in the audio signal.
[0068] For example, the image (615) for the audio signal may include representations based on information included in the prompt (425) (e.g., first information (405), second information (410), third information (415), and / or fourth information (420) of FIG. 4). For example, the image (615) for the audio signal may include a representation of the speaker based on first information about the speaker included in the prompt (425). For example, the image (615) for the audio signal may include a representation of the context of the audio signal (e.g., the speaker's actions, background, and / or objects) based on second information about the context of the audio signal included in the prompt (425). For example, the image (615) for the audio signal may include a representation of the location (or place) where the audio signal was acquired based on third information about the location (or place) included in the prompt (425). For example, the image (615) for the audio signal may include a representation of the time and / or date at which the audio signal was acquired, based on the fourth information regarding the time and / or date included in the prompt (425).
[0069] For example, at least one processor (210) may further provide at least one image (610) containing a representation of a location (or place) to the second trained model (600). For example, at least one image (610) containing a representation of a location (or place) may be described as at least one image identified among a plurality of images stored in the electronic device (200), at least one image identified among a plurality of images registered in association with a user account of the electronic device (200), or at least one image obtained through an internet search. For example, at least one processor (210) may obtain an image (615) for an audio signal generated using a prompt (425), at least one image (605), and at least one image (610) by the second trained model (600).
[0070] For example, the image (615) for the audio signal may include a representation of the location (or place) where the audio signal was acquired. For example, the location (or place) where the audio signal was acquired may be described as the location (or place) of the electronic device (200) at the time the audio signal was acquired. For example, the location (or place) where the audio signal was acquired may be identified using a GPS receiver or through the context of the audio signal. For example, since at least one image (610) including a representation of the location (or place) is used to acquire the image (615) for the audio signal, the representation of the location (or place) included in the image (615) for the audio signal may correspond to the representation of the location (or place) included in at least one image (610). For example, the location (or place) represented in the image (615) for the audio signal may correspond to the location (or place) where the audio signal was acquired.
[0071] For example, the image (615) for the audio signal can represent the external environment of the electronic device (200) at the time the audio signal is acquired. For example, at least one processor (210) can acquire an image representing the external environment of the electronic device (200) without photographing the external environment of the electronic device (200) at the time the audio signal is acquired by acquiring the image (615) for the audio signal.
[0072] For example, at least one processor (210) can acquire a video for an audio signal using an image (615) for an audio signal. Acquiring a video for an audio signal using an image (615) for an audio signal is exemplified in the description of FIG. 7.
[0073] Figure 7 illustrates an example of generating a video for an audio signal using a third trained model.
[0074] Referring to FIG. 7, the third trained model (700) may be contained within the electronic device (200) or within a server outside the electronic device (200). For example, the third trained model (700) may be described as a model trained to generate video for audio signals. For example, the third trained model (700) may include the generative AI (artificial intelligence) model (1130) of FIG. 11. For example, the third trained model (700) may include a machine learning model and / or a deep learning model. For example, the third trained model (700) may correspond to the second trained model (600) of FIG. 6 or may be different from the second trained model (600).
[0075] For example, the third trained model (700) may include an artificial neural network model comprising multiple layers and / or operations (or computations). As an example, but not limited to, at least one trained model may include one of a feedforward neural network (FNN), a deep neural network (DNN), a convolutional neural network (CNN), a region with convolutional neural network (R-CNN), a region proposal network (RPN), a recurrent neural network (RNN), a stacking-based deep neural network (S-DNN), a state-space dynamic neural network (S-SDNN), a deconvolution network, a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), deep Q-networks, a fully convolutional network, a long short-term memory (LSTM) network, a classification network, or a combination of two or more of these. However, it is not limited thereto. For example, the third trained model (700) can be trained on input data and can obtain output data by performing operations (or inference) based on receiving input data. The third trained model (700) may include a hardware structure or a software structure that is additional (or alternative) to the hardware structure.
[0076] For example, at least one processor (210) may provide an audio signal (705) and an image (615) for the audio signal to a third trained model (700). For example, the audio signal (705) may be described as an audio signal obtained using a microphone. For example, at least one processor (210) may obtain a video (715) for the audio signal generated by the third trained model (700) using the audio signal (705) and the image (615) for the audio signal.
[0077] For example, the video (715) for the audio signal may have the image (615) for the audio signal as an initial value. For example, the video (715) for the audio signal may have the image (615) for the audio signal as a keyframe image. For example, the video (715) for the audio signal may include an animation of an object (e.g., a speaker's expression) contained within the image (615) for the audio signal. For example, within the video (715) for the audio signal, the object (e.g., a speaker's expression) contained within the image (615) for the audio signal may operate according to the audio signal (705) or speak according to at least one voice signal contained within the audio signal (705). For example, the video (715) for the audio signal may be referred to as a sync (or synchronization) video of the audio signal (705). For example, the video (715) for the audio signal may include an animation of a speaker uttering at least one voice signal included in the audio signal (705). For example, the animation of a speaker uttering at least one voice signal included in the audio signal (705) may be referred to as a lip sync (or lip synchronization) animation of at least one voice signal included in the audio signal (705).
[0078] For example, at least one processor (210) may further provide another image (710) for an audio signal to a third trained model (700). For example, another image (710) for an audio signal may be obtained using a second trained model (600). For example, the operations of an electronic device (200) for obtaining another image (710) for an audio signal may correspond to the operations of an electronic device (200) for obtaining an image (615) for an audio signal. For example, at least one processor (210) may obtain a video (715) for an audio signal generated using an audio signal (705), an image (615) for an audio signal, and another image (710) for an audio signal by the third trained model (700). For example, at least one image for the audio signal (e.g., image for the audio signal (615) and another image for the audio signal (710)) may be used to obtain a video (715) for the audio signal.
[0079] For example, the video (715) for the audio signal can represent the external environment of the electronic device (200) at the time when the audio signal (705) is acquired. For example, at least one processor (210) can acquire the video (715) for the audio signal, thereby acquiring a video representing the external environment of the electronic device (200) without capturing the external environment of the electronic device (200) at the time when the audio signal (705) is acquired.
[0080] According to another embodiment, at least one processor (210) may provide a third trained model (700) with a prompt (e.g., prompt (425) of FIG. 4), at least one image containing a speaker's expression (e.g., at least one image (605) of FIG. 6), and / or at least one image containing a position (or place) (e.g., at least one image (610) of FIG. 6). For example, at least one processor (210) may acquire a video (715) for an audio signal generated using the prompt, at least one image containing a speaker's expression, and / or at least one image containing a position (or place). For example, at least one processor (210) may bypass (or skip, or refrain from, or not acquire) the acquisition of the image (615) for the audio signal and acquire the video (715) for the audio signal.
[0081] An image (615) for an audio signal and / or a video (715) for an audio signal is exemplified in the description of FIG. 8.
[0082] FIG. 8 illustrates an example of an image for an audio signal and / or a video for an audio signal.
[0083] Referring to FIG. 8, an image for an audio signal and / or a video for an audio signal (800) may include a representation (805) of a speaker of at least one voice signal included in the audio signal. For example, the audio signal may include at least one voice signal caused by a plurality of speakers. For example, the image for an audio signal and / or a video for an audio signal (800) may include representations (805-1, 805-2) of a plurality of speakers. For example, a speaker represented in the image for an audio signal and / or a video for an audio signal (800) may correspond to a speaker of at least one voice signal included in the audio signal.
[0084] For example, an image for an audio signal and / or a video for an audio signal (800) may include a representation of the context of the audio signal. For example, a representation of a speaker (805) included in the image for an audio signal and / or a video for an audio signal (800) may perform an action indicated by the context of the audio signal. For example, an image for an audio signal and / or a video for an audio signal (800) may include a representation of a background (810). For example, a representation of a background (810) included in the image for an audio signal and / or a video for an audio signal (800) may be determined based on the context of the audio signal, the location (or place) where the audio signal was acquired, the time when the audio signal was acquired, and / or the date when the audio signal was acquired. For example, the background represented by the image for an audio signal and / or a video for an audio signal (800) may correspond to the background of the external environment of the electronic device (200) at the time when the audio signal was acquired.
[0085] For example, an image for an audio signal and / or a video for an audio signal (800) may represent the external environment of the electronic device (200) at the time the audio signal is acquired. For example, representations included in the image for an audio signal and / or a video for an audio signal (800) may correspond to, or substantially correspond to, the external environment of the electronic device (200) at the time the audio signal is acquired. For example, by acquiring an image for an audio signal and / or a video for an audio signal (800) by at least one processor (210), an image representing the external environment of the electronic device (200) and / or a video representing the external environment of the electronic device (200) may be acquired without capturing the external environment of the electronic device (200) at the time the audio signal is acquired.
[0086] Figure 9 illustrates examples of components of an electronic device.
[0087] Referring to FIG. 9, the electronic device (200) may include an audio analysis unit (900), an event detection unit (920), a prompt generation unit (925), an image acquisition unit (930), an image generation unit (950), and / or a video generation unit (955). For example, the audio analysis unit (900), the event detection unit (920), the prompt generation unit (925), the image acquisition unit (930), the image generation unit (950), and / or the video generation unit (955) may support the function of processing audio signals through an algorithm stored in memory (220). For example, although the term 'unit' is used, the audio analysis unit (900), the event detection unit (920), the prompt generation unit (925), the image acquisition unit (930), the image generation unit (950), and / or the video generation unit (955) may perform the following functions in software and / or functionally.
[0088] For example, the audio analysis unit (900) may include an audio signal enhancement unit (905), an audio signal extraction unit (910), and / or a context identification unit (915). For example, the audio signal enhancement unit (905) may perform noise removal on the audio signal. For example, the audio signal enhancement unit (905) may obtain a noise-removed audio signal by performing noise removal on the audio signal. For example, noise removal on the audio signal may be described as preprocessing on the audio signal. For example, the audio signal enhancement unit (905) may perform operations 310 and 320 of FIG. 3. For example, the audio signal extraction unit (910) may extract at least one voice signal from the noise-removed audio signal. For example, the audio signal extraction unit (910) may identify the speaker of at least one voice signal included in the audio signal. For example, the audio signal extraction unit (910) can perform operation 310 of FIG. 3. For example, the context identification unit (915) can identify the context of the noise-removed audio signal. For example, the context identification unit (915) can perform operation 320 of FIG. 3.
[0089] For example, the event detection unit (920) can detect an event that generates an image for an audio signal. For example, the event detection unit (920) can identify (or receive) a user input that generates an image for an audio signal. For example, the event detection unit (920) can perform operations 310 and 320 of FIG. 3.
[0090] For example, the prompt generation unit (925) can generate a prompt including first information about the speaker and second information about the context. For example, the prompt generation unit (925) can generate a prompt using a first trained model (e.g., the first trained model of FIG. 4). For example, the prompt may further include third information about the location. For example, the prompt may further include fourth information about the time and / or date. For example, the prompt generation unit (925) can perform operation 330 of FIG. 3.
[0091] For example, the image acquisition unit (930) may include an electronic device search unit (935), a cloud server search unit (940), and / or an internet search unit (945). For example, the electronic device search unit (935) may identify at least one image containing the speaker's expression among a plurality of images stored in the electronic device (200). For example, the cloud server search unit (940) may identify at least one image containing the speaker's expression among a plurality of images stored in conjunction with the user account of the electronic device (200). For example, the internet search unit (945) may acquire at least one image containing the speaker's expression by searching for the speaker through the internet. For example, the image acquisition unit (930) may perform operation 340 of FIG. 3.
[0092] For example, the image generation unit (950) may provide at least one image including a prompt and a speaker's expression to a second trained model (e.g., the second trained model (600) of FIG. 6). For example, the image generation unit (950) may acquire an image for an audio signal generated by the second trained model using at least one image including a prompt and a speaker's expression. For example, the image generation unit (950) may perform operation 350 of FIG. 3.
[0093] For example, the video generation unit (955) may provide an audio signal and an image for the audio signal to a third trained model (e.g., the third trained model (700) of FIG. 7). For example, the video generation unit (955) may obtain a video for the audio signal generated by the third trained model using the audio signal and the image for the audio signal. For example, the video generation unit (955) may perform the operations of FIG. 7.
[0094] FIG. 10 is a block diagram of an electronic device in a network environment according to various embodiments.
[0095] Referring to FIG. 10, in a network environment (1000), an electronic device (1001) may communicate with an electronic device (1002) through a first network (1098) (e.g., a short-range wireless communication network) or with at least one of an electronic device (1004) or a server (1008) through a second network (1099) (e.g., a long-range wireless communication network). According to one embodiment, the electronic device (1001) may communicate with the electronic device (1004) through a server (1008). According to one embodiment, the electronic device (1001) may include a processor (1020), memory (1030), input module (1050), sound output module (1055), display module (1060), audio module (1070), sensor module (1076), interface (1077), connection terminal (1078), haptic module (1079), camera module (1080), power management module (1088), battery (1089), communication module (1090), subscriber identification module (1096), or antenna module (1097). In some embodiments, at least one of these components (e.g., connection terminal (1078)) may be omitted from the electronic device (1001), or one or more other components may be added. In some embodiments, some of these components (e.g., sensor module (1076), camera module (1080), or antenna module (1097)) may be integrated into a single component (e.g., display module (1060)).
[0096] The processor (1020) can, for example, execute software (e.g., program (1040)) to control at least one other component (e.g., hardware or software component) of the electronic device (1001) connected to the processor (1020) and can perform various data processing or operations. According to one embodiment, as at least part of the data processing or operations, the processor (1020) can store commands or data received from other components (e.g., sensor module (1076) or communication module (1090)) in volatile memory (1032), process the commands or data stored in volatile memory (1032), and store the resulting data in non-volatile memory (1034). According to one embodiment, the processor (1020) may include a main processor (1021) (e.g., a central processing unit or an application processor) or an auxiliary processor (1023) that can operate independently or together with it (e.g., a graphics processing unit, a neural processing unit (NPU), an image signal processor, a sensor hub processor, or a communication processor). For example, if the electronic device (1001) includes a main processor (1021) and an auxiliary processor (1023), the auxiliary processor (1023) may be configured to use lower power than the main processor (1021) or to be specialized for a specified function. The auxiliary processor (1023) may be implemented separately from the main processor (1021) or as part thereof.
[0097] The auxiliary processor (1023) may control at least some of the functions or states associated with at least one component of the electronic device (1001) (e.g., display module (1060), sensor module (1076), or communication module (1090)) on behalf of the main processor (1021) while the main processor (1021) is in an inactive (e.g., sleep) state, or together with the main processor (1021) while the main processor (1021) is in an active (e.g., application execution) state. According to one embodiment, the auxiliary processor (1023) (e.g., image signal processor or communication processor) may be implemented as part of another functionally related component (e.g., camera module (1080) or communication module (1090)). According to one embodiment, the auxiliary processor (1023) (e.g., neural network processing unit) may include a hardware structure specialized for processing an artificial intelligence model. The artificial intelligence model may be generated through machine learning. Such learning may be performed, for example, on the electronic device (1001) itself where the artificial intelligence model is executed, or through a separate server (e.g., server (1008)). The learning algorithm may include, for example, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but is not limited to the examples described above. The artificial intelligence model may include a plurality of artificial neural network layers.An artificial neural network may be a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), a deep Q-network, or a combination of two or more of the above, but is not limited to the examples described above. In addition to the hardware structure, the artificial intelligence model may include a software structure, either additionally or substantially.
[0098] The memory (1030) can store various data used by at least one component of the electronic device (1001) (e.g., processor (1020) or sensor module (1076)). The data may include, for example, input data or output data for software (e.g., program (1040)) and related commands. The memory (1030) may include volatile memory (1032) or non-volatile memory (1034).
[0099] The program (1040) may be stored as software in memory (1030) and may include, for example, an operating system (1042), middleware (1044), or an application (1046).
[0100] The input module (1050) can receive commands or data to be used for a component of the electronic device (1001) (e.g., processor (1020)) from outside the electronic device (1001) (e.g., user). The input module (1050) may include, for example, a microphone, a mouse, a keyboard, a key (e.g., a button), or a digital pen (e.g., a stylus pen).
[0101] The sound output module (1055) can output a sound signal to the outside of the electronic device (1001). The sound output module (1055) may include, for example, a speaker or a receiver. The speaker may be used for general purposes, such as multimedia playback or recording playback. The receiver may be used to receive incoming calls. According to one embodiment, the receiver may be implemented separately from the speaker or as part thereof.
[0102] The display module (1060) can visually provide information to an external (e.g., user) of the electronic device (1001). The display module (1060) may include, for example, a display, a holographic device, or a projector and a control circuit for controlling said device. According to one embodiment, the display module (1060) may include a touch sensor configured to detect a touch, or a pressure sensor configured to measure the intensity of the force generated by said touch.
[0103] The audio module (1070) can convert sound into an electrical signal or, conversely, convert an electrical signal into sound. According to one embodiment, the audio module (1070) can acquire sound through the input module (1050) or output sound through the sound output module (1055) or an external electronic device (e.g., electronic device (1002)) (e.g., speaker or headphones) connected directly or wirelessly to the electronic device (1001).
[0104] The sensor module (1076) can detect the operating state of the electronic device (1001) (e.g., power or temperature) or the external environmental state (e.g., user state) and generate an electrical signal or data value corresponding to the detected state. According to one embodiment, the sensor module (1076) may include, for example, a gesture sensor, a gyroscope sensor, a barometric pressure sensor, a magnetic sensor, an accelerometer sensor, a grip sensor, a proximity sensor, a color sensor, an IR (infrared) sensor, a biosensor, a temperature sensor, a humidity sensor, or an illuminance sensor.
[0105] The interface (1077) may support one or more specified protocols that can be used for the electronic device (1001) to be connected directly or wirelessly to an external electronic device (e.g., electronic device (1002)). According to one embodiment, the interface (1077) may include, for example, a high definition multimedia interface (HDMI), a universal serial bus (USB) interface, an SD card interface, or an audio interface.
[0106] The connection terminal (1078) may include a connector through which the electronic device (1001) can be physically connected to an external electronic device (e.g., electronic device (1002)). According to one embodiment, the connection terminal (1078) may include, for example, an HDMI connector, a USB connector, an SD card connector, or an audio connector (e.g., a headphone connector).
[0107] The haptic module (1079) can convert an electrical signal into a mechanical stimulus (e.g., vibration or movement) or an electrical stimulus that the user can perceive through tactile or kinesthetic senses. According to one embodiment, the haptic module (1079) may include, for example, a motor, a piezoelectric element, or an electric stimulation device.
[0108] The camera module (1080) can capture still images and video. According to one embodiment, the camera module (1080) may include one or more lenses, image sensors, image signal processors, or flashes.
[0109] The power management module (1088) can manage power supplied to the electronic device (1001). According to one embodiment, the power management module (1088) can be implemented, for example, as at least part of a power management integrated circuit (PMIC).
[0110] The battery (1089) can supply power to at least one component of the electronic device (1001). According to one embodiment, the battery (1089) may include, for example, a non-rechargeable primary battery, a rechargeable secondary battery, or a fuel cell.
[0111] The communication module (1090) can support the establishment of a direct (e.g., wired) communication channel or a wireless communication channel between an electronic device (1001) and an external electronic device (e.g., electronic device (1002), electronic device (1004), or server (1008)), and the performance of communication through the established communication channel. The communication module (1090) may include one or more communication processors that operate independently of the processor (1020) (e.g., application processor) and support direct (e.g., wired) communication or wireless communication. According to one embodiment, the communication module (1090) may include a wireless communication module (1092) (e.g., cellular communication module, short-range wireless communication module, or GNSS (global navigation satellite system) communication module) or a wired communication module (1094) (e.g., LAN (local area network) communication module, or power line communication module). The corresponding communication module among these communication modules can communicate with an external electronic device (1004) through a first network (1098) (e.g., a short-range communication network such as Bluetooth, Wi-Fi (wireless fidelity) direct or IrDA (infrared data association)) or a second network (1099) (e.g., a legacy cellular network, a 5G network, a next-generation communication network, the Internet, or a computer network (e.g., a LAN or WAN)). These various types of communication modules may be integrated into a single component (e.g., a single chip) or implemented as multiple separate components (e.g., multiple chips). The wireless communication module (1092) can identify or authenticate the electronic device (1001) within a communication network such as the first network (1098) or the second network (1099) using subscriber information (e.g., International Mobile Subscriber Identifier (IMSI)) stored in the subscriber identification module (1096).
[0112] The wireless communication module (1092) can support 5G networks and next-generation communication technologies following 4G networks, for example, new radio access technology. NR access technology can support high-speed transmission of high-capacity data (enhanced mobile broadband (eMBB)), minimization of terminal power and connection of multiple terminals (massive machine type communications (mMTC)), or high reliability and low latency (ultra-reliable and low-latency communications (URLLC)). The wireless communication module (1092) can support a high-frequency band (e.g., mmWave band) to achieve a high data transmission rate, for example. The wireless communication module (1092) can support various technologies for securing performance in the high-frequency band, such as beamforming, massive MIMO (multiple-input and multiple-output), full-dimensional MIMO (FD-MIMO), array antenna, analog beamforming, or large-scale antenna. The wireless communication module (1092) can support various requirements specified in the electronic device (1001), external electronic device (e.g., electronic device (1004)), or network system (e.g., second network (1099)). According to one embodiment, the wireless communication module (1092) can support a Peak data rate (e.g., 20 Gbps or more) for realizing eMBB, loss coverage (e.g., 164 dB or less) for realizing mMTC, or U-plane latency (e.g., downlink (DL) and uplink (UL) each 0.5 ms or less, or round trip 1 ms or less) for realizing URLLC.
[0113] An antenna module (1097) can transmit a signal or power to or from an external source (e.g., an external electronic device). According to one embodiment, the antenna module (1097) may include an antenna comprising a radiator made of a conductor or a conductive pattern formed on a substrate (e.g., a PCB). According to one embodiment, the antenna module (1097) may include a plurality of antennas (e.g., an array antenna). In this case, at least one antenna suitable for a communication method used in a communication network, such as a first network (1098) or a second network (1099), may be selected from the plurality of antennas, for example, by a communication module (1090). A signal or power may be transmitted or received between the communication module (1090) and an external electronic device through the selected at least one antenna. According to some embodiments, in addition to the radiator, other components (e.g., a radio frequency integrated circuit (RFIC)) may be additionally formed as part of the antenna module (1097).
[0114] According to various embodiments, the antenna module (1097) may form a mmWave antenna module. According to one embodiment, the mmWave antenna module may include a printed circuit board, an RFIC disposed on or adjacent to a first surface (e.g., bottom surface) of the printed circuit board and capable of supporting a specified high frequency band (e.g., mmWave band), and a plurality of antennas (e.g., array antennas) disposed on or adjacent to a second surface (e.g., top surface or side surface) of the printed circuit board and capable of transmitting or receiving a signal of the specified high frequency band.
[0115] At least some of the above components can be connected to each other via a communication method between peripheral devices (e.g., bus, GPIO (general purpose input and output), SPI (serial peripheral interface), or MIPI (mobile industry processor interface)) and exchange signals (e.g., commands or data) with each other.
[0116] According to one embodiment, commands or data may be transmitted or received between the electronic device (1001) and an external electronic device (1004) through a server (1008) connected to a second network (1099). Each of the external electronic devices (1002, or 1004) may be the same or a different type of device as the electronic device (1001). According to one embodiment, all or part of the operations performed on the electronic device (1001) may be performed on one or more of the external electronic devices (1002, 1004, or 1008). For example, if the electronic device (1001) needs to perform a function or service automatically or in response to a request from a user or another device, the electronic device (1001) may request one or more external electronic devices to perform at least part of the function or service instead of performing the function or service itself or additionally. One or more external electronic devices that receive the above request may execute at least part of the requested function or service, or additional function or service related to the request, and transmit the result of the execution to the electronic device (1001). The electronic device (1001) may provide the result as is or additionally processed as at least part of the response to the request. For this purpose, for example, cloud computing, distributed computing, mobile edge computing (MEC), or client-server computing technology may be used. The electronic device (1001) may provide ultra-low latency services using, for example, distributed computing or mobile edge computing. In another embodiment, the external electronic device (1004) may include an Internet of Things (IoT) device. The server (1008) may be an intelligent server using machine learning and / or neural networks.According to one embodiment, an external electronic device (1004) or server (1008) may be included within the second network (1099). The electronic device (1001) may be applied to intelligent services (e.g., smart home, smart city, smart car, or healthcare) based on 5G communication technology and IoT-related technology.
[0117] The electronic device according to the various embodiments disclosed in this document may be of various forms. The electronic device may include, for example, a portable communication device (e.g., a smartphone), a computer device, a portable multimedia device, a portable medical device, a camera, a wearable device, or a consumer electronics device. The electronic device according to the embodiments of this document is not limited to the devices described above.
[0118] The various embodiments of this document and the terms used therein are not intended to limit the technical features described in this document to specific embodiments, and should be understood to include various modifications, equivalents, or substitutions of said embodiments. In connection with the description of the drawings, similar reference numerals may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more of said items unless the relevant context clearly indicates otherwise. In this document, phrases such as "A or B," "at least one of A and B," "at least one of A or B," "A, B or C," "at least one of A, B and C," and "at least one of A, B, or C" may each include any one of the items listed together in the corresponding phrase, or all possible combinations thereof. Terms such as "first," "second," or "first" or "second" may be used simply to distinguish said components from other said components and do not limit said components in any other aspect (e.g., importance or order). Where any (e.g., 1st) component is referred to as "coupled" or "connected" to another (e.g., 2nd) component, with or without the terms "functionally" or "communicationly," it means that said any component may be connected to said other component directly (e.g., via a wire), wirelessly, or through a third component.
[0119] The term “module” as used in the various embodiments of this document may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit, for example. A module may be a component formed integrally, or a minimum unit of said component or a part thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).
[0120] Various embodiments of the present document may be implemented as software (e.g., program (1040)) comprising one or more instructions stored in a storage medium (e.g., internal memory (1036) or external memory (1038)) readable by a machine (e.g., electronic device (1001)). For example, a processor (e.g., processor (1020)) of the machine (e.g., electronic device (1001)) may call at least one of the one or more instructions stored from the storage medium and execute it. This enables the machine to be operated to perform at least one function according to the at least one called instruction. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. The storage medium readable by the machine may be provided in the form of a non-transitory storage medium. Here, 'non-transient' simply means that the storage medium is a tangible device and does not contain a signal (e.g., electromagnetic waves), and this term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily.
[0121] According to one embodiment, the method according to the various embodiments disclosed herein may be provided by being included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)), or distributed online (e.g., download or upload) through an application store (e.g., Play Store™) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily created on a device-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.
[0122] According to various embodiments, each component (e.g., module or program) of the components described above may include a singular or multiple entities, and some of the multiple entities may be separated and placed in other components. According to various embodiments, one or more of the components or operations of the aforementioned components may be omitted, or one or more other components or operations may be added. Generally or additionally, multiple components (e.g., module or program) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the multiple components in the same or similar manner as those performed by the corresponding component among the multiple components prior to integration. According to various embodiments, operations performed by the module, program, or other components may be executed sequentially, in parallel, iteratively, or heuristically, or one or more of the operations may be executed in a different order, omitted, or one or more other operations may be added.
[0123] The technical problems to be solved in this disclosure are not limited to those mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art to which this disclosure pertains.
[0124] FIG. 11 illustrates an example of a generative artificial intelligence system according to one embodiment.
[0125] Referring to FIG. 11, the AI system (1100) may include an input / output interface (1110), an AI framework (1120), a generative AI model (1130), and / or a knowledge repository (1190).
[0126] The input / output interface (1110) can receive input. The input may include user input and / or data acquired or generated by an electronic device (e.g., the electronic device (200) or electronic device (1001) described above). The data may include images, videos, and / or sensor data generated by at least one processor of the electronic device (e.g., at least one processor (210) or processor (1020)), such as illuminance data around the electronic device acquired from a sensor or sensor hub (e.g., auxiliary processor (1023), attitude data (or orientation data) of the electronic device, temperature inside the electronic device (e.g., display (230)), or temperature of at least one processor (210), size information of the display area of the display (230), and / or images acquired through an image sensor of the electronic device (e.g., included in a camera module (1080)). The user input may include natural language, touch data obtained through a touch circuit included in the display panel (e.g., used to identify input from a finger and / or stylus), an image displayed (and / or to be displayed) on the display panel, and / or video. By example, without limitation, the user input may be received by an input / output interface (1110) along with context information. The context information may be described as additional information obtained in relation to the user input. The context information may be related to the state at the time the user input is received (e.g., the state of the electronic device and / or the state of the surroundings of the electronic device (e.g., user state)). For example, the context information may include information about one or more software applications executed within the electronic device at the time the user input is received.For example, the above situation information may include information about the location of the electronic device (or the location of the user of the electronic device) when the user input is received. For example, the user input may be integrated with the situation information. For example, the user input with the situation information integrated as the input may be received by the input / output interface (1110).
[0127] The input / output interface (1110) may transmit (or provide) an output. The output may include a result (or result information) generated or obtained by the AI system (1100) based on at least part of the input. The format of the output may vary. For example, the output may include natural language. For example, the output may include content (e.g., media content and / or multimedia content). For example, the output may include an action related to the user of the electronic device. For example, the output may have a format according to the user settings of the electronic device.
[0128] The AI framework (1120) can be used to obtain information (or data) about the input from the input / output interface (1110) and to control one or more components related to the AI system (1100) using the obtained information.
[0129] For example, a prompt design component (1121) within an AI framework (1120) can generate or obtain prompts for a generative AI model (1130) (e.g., including a large language model (LLM) or a large multimodal model (LMM)) using the acquired information. For example, the prompt design component (1121) may be described as an AI component that uses a learning algorithm and / or a neural network to provide prompts that are enhanced over time. For example, the prompt design component (1121) can generate or obtain prompts by accessing a knowledge component (e.g., a knowledge repository (1190)) containing user preference data, a prompt library, and / or prompt examples using the acquired information. The generated prompts may be provided to the generative AI model (1130) (e.g., including an LLM or LMM).
[0130] For example, an API / plugin management component (1122) within the AI framework (1120) may be used to support communication for additional information requested (or induced) in relation to the prompt provided (or to be provided) to the generative AI model (1130). For example, the API / plugin management component (1122) may be used to create or establish a channel for communication with various data sources (e.g., knowledge repository (1190)). For example, the API / plugin management component (1122) may support access to at least some of the data sources. For example, the API / plugin management component (1122) may be used to request another component (e.g., application / service component (1180)) that performs feedback (or response) according to the prompt. As a non-limiting example, information obtained (or generated) through the API / plugin management component (1122) may be provided to the prompt design component (1121) for generating a prompt. As a non-limiting example, information obtained (or generated) through the API / plugin management component (1122) may be provided to the generative AI model (1130).
[0131] For example, an improvement component (1123) within the AI framework (1120) can at least partially tune (or adjust) (or change) the result (e.g., content) obtained (or output) from the generative AI model (1130). For example, the improvement component (1123) can determine or verify whether the content obtained from the generative AI model (1130) is related to the input. For example, the improvement component (1123) can determine or verify whether the content obtained from the generative AI model (1130) contains biased content. For example, the improvement component (1123) can determine or verify whether the content obtained from the generative AI model (1130) contains harmful content. For example, the improvement component (1123) can support or assist in performing additional processing to improve the content obtained from the generative AI model (1130). For example, the improvement component (1123) may support providing a hint to the user to improve the content.
[0132] A generative AI model (1130) can be described as an artificial intelligence neural network that generates feedback in response to a prompt. For example, the feedback may include additional data and / or information relative to the prompt, but relative to the prompt. For example, the feedback may include new content relative to the prompt. For example, the generative AI model (1130) may include a model that generates images and / or a model that generates language. For example, the model that generates images may include a generative adversarial network (GAN) and / or a variational autoencoder (VAE). For example, the model that generates images may include a diffusion-based generative model (e.g., a transformer VAE). For example, the model that generates language may include CHAT-GPT 3 and / or CHAT-GPT 4. For example, the generative AI model (1130) may include an LMM that generates the feedback by recognizing text, images, and / or speech.
[0133] As an example without limitation, the AI framework (1120) and / or generative AI model (1130) may be included within an AI module (e.g., including a processing circuit) within the electronic device (200). For example, the AI module may be operatively coupled with at least one processor of the electronic device (200) (e.g., at least one processor (210) or processor (1020)). For example, the AI module may be operatively coupled with a display driving circuit of the electronic device. For example, the AI module may be operatively coupled with a sensor hub of the electronic device for one or more sensors within the electronic device.
[0134] The technical problems to be solved in this disclosure are not limited to those mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art to which this disclosure pertains.
[0135] The electronic device described above (e.g., the electronic device (200) of FIG. 2) may include at least one processor (e.g., at least one processor (210) of FIG. 2) comprising a processing circuit, a microphone (e.g., the microphone (230) of FIG. 2), and a memory (e.g., the memory (220) of FIG. 2) comprising one or more storage media configured to store one or more programs configured to be executed individually or collectively by the at least one processor. The instructions may cause the electronic device to acquire an audio signal using the microphone when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify the context indicated by the audio signal based on an event that generates an image for the audio signal (e.g., the image for the audio signal (615) of FIG. 6) when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to identify the utterer of at least one voice signal included in the audio signal based on the event when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to identify at least one image containing the utterer's representation among images stored in the electronic device and images registered in association with the user's account of the electronic device when executed individually or collectively by the at least one processor.The above instructions may cause the electronic device to generate a prompt (e.g., prompt (425) of FIG. 4) comprising first information about the speaker (e.g., first information (405) of FIG. 4) and second information about the context (e.g., second information (410) of FIG. 4) when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to provide the prompt and the at least one image to a trained model (e.g., second trained model (600) of FIG. 6) when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to acquire the image for the audio signal generated by the trained model using the prompt and the at least one image when executed individually or collectively by the at least one processor.
[0136] For example, the electronic device may further include a GPS (global positioning system) receiver. The instructions may cause the electronic device to identify the location of the electronic device at the time the audio signal is acquired using the GPS receiver, based on the event, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to generate the prompt, which includes the first information, the second information, and the third information regarding the location, when executed individually or collectively by the at least one processor.
[0137] For example, the instructions may cause the electronic device to identify the time and / or date at which the audio signal was acquired based on the event, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to generate the prompt, which includes the first information, the second information, and third information regarding the time and / or date, when executed individually or collectively by the at least one processor.
[0138] For example, the instructions may cause the electronic device to obtain another audio signal by performing noise cancellation on the audio signal based on the event, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify the context indicated by the other audio signal, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify the speaker of at least one voice signal included in the other audio signal, when executed individually or collectively by the at least one processor.
[0139] For example, the above instructions may cause the electronic device to acquire at least one other image containing the speaker's expression based on searching for the speaker via the internet when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to further provide the at least one other image to the trained model when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to acquire the image for the audio signal generated by the trained model using the prompt, the at least one image, and the at least one other image when executed individually or collectively by the at least one processor.
[0140] For example, the instructions may cause the electronic device to extract another audio signal indicating a background sound included in the audio signal based on the event, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to obtain third information regarding a location indicated by the background sound using the other audio signal, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to generate the prompt including the first information, the second information, and the third information, when executed individually or collectively by the at least one processor.
[0141] For example, the above instructions may cause the electronic device to provide the audio signal and the image for the audio signal to another trained model based on an event that generates a video for the audio signal when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to obtain the video for the audio signal generated by the other trained model using the audio signal and the image for the audio signal when executed individually or collectively by the at least one processor. The above instructions, when executed individually or collectively by the at least one processor,
[0142] For example, the video for the audio signal may include an expression of the speaker uttering the at least one voice signal.
[0143] For example, the instructions may cause the electronic device to extract the at least one voice signal from the audio signal based on the event when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to acquire feature information of the at least one voice signal when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify the speaker by comparing the feature information with the feature information of each of one or more voice signals stored in the electronic device when executed individually or collectively by the at least one processor.
[0144] For example, the above feature information may include a feature vector, an embedding vector, and / or an acoustic feature.
[0145] For example, the image for the audio signal may include the speaker's expression according to the context.
[0146] The above-described method may be performed within an electronic device including a microphone. The method may include an operation of acquiring an audio signal using the microphone. The method may include an operation of identifying a context indicated by the audio signal based on an event of generating an image for the audio signal. The method may include an operation of identifying a utterer of at least one voice signal included in the audio signal based on the event. The method may include an operation of identifying at least one image containing a representation of the utterer among images stored in the electronic device and images registered in association with the user's account of the electronic device. The method may include an operation of generating a prompt including first information about the utterer and second information about the context. The method may include an operation of providing the prompt and the at least one image to a trained model. The method may include an operation of acquiring the image for the audio signal generated by the trained model using the prompt and the at least one image.
[0147] For example, the electronic device may further include a GPS (global positioning system) receiver. The method may include an operation of identifying the location of the electronic device at the time the audio signal is acquired using the GPS receiver based on the event. The method may include an operation of generating the prompt including the first information, the second information, and the third information regarding the location.
[0148] For example, the method may include an operation of identifying the time and / or date at which the audio signal was acquired based on the event. The method may include an operation of generating the prompt, which includes the first information, the second information, and third information regarding the time and / or date.
[0149] For example, the above method may include an operation of obtaining another audio signal by performing noise cancellation on the audio signal based on the event. The above method may include an operation of identifying the context indicated by the other audio signal. The above method may include an operation of identifying the speaker of at least one voice signal included in the other audio signal.
[0150] For example, the above method may include the operation of acquiring at least one other image containing an expression of said speaker based on searching for said speaker via the Internet. The above method may include the operation of further providing said at least one other image to said trained model. The above method may include the operation of acquiring the image for said audio signal generated by said trained model using said prompt, said at least one image, and said at least one other image.
[0151] For example, the above method may include an operation of extracting another audio signal from the audio signal that indicates a background sound included in the audio signal based on the event. The above method may include an operation of obtaining third information about a location indicated by the background sound using the other audio signal. The above method may include an operation of generating the prompt including the first information, the second information, and the third information.
[0152] For example, the above method may include an operation of providing the audio signal and the image for the audio signal to another trained model based on an event of generating a video for the audio signal. The above method may include an operation of acquiring the video for the audio signal generated by the other trained model using the audio signal and the image for the audio signal.
[0153] For example, the video for the audio signal may include an expression of the speaker uttering the at least one voice signal.
[0154] For example, the above method may include an operation of extracting at least one voice signal from the audio signal based on the event. The above method may include an operation of obtaining feature information of the at least one voice signal. The above method may include an operation of identifying the speaker by comparing the feature information with the feature information of each of one or more voice signals stored in the electronic device.
[0155] For example, the above feature information may include a feature vector, an embedding vector, and / or an acoustic feature.
[0156] For example, the image for the audio signal may include the speaker's expression according to the context.
[0157] The above-described non-transient computer-readable storage medium may store one or more programs. The one or more programs may include instructions that cause the electronic device to acquire an audio signal using the microphone when executed by the electronic device including the microphone. The one or more programs may include instructions that cause the electronic device to identify the context indicated by the audio signal based on an event that generates an image for the audio signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify the utterer of at least one voice signal included in the audio signal based on the event when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify at least one image containing the utterer's representation among the images stored in the electronic device and the images registered in association with the user's account of the electronic device when executed by the electronic device. The one or more programs above may include instructions that cause the electronic device to generate a prompt including first information about the speaker and second information about the context when executed by the electronic device. The one or more programs above may include instructions that cause the electronic device to provide the prompt and the at least one image to a trained model when executed by the electronic device.The above one or more programs may include instructions that cause the electronic device to acquire the image for the audio signal generated by the trained model using the prompt and the at least one image when executed by the electronic device.
[0158] For example, the electronic device may further include a global positioning system (GPS) receiver. The one or more programs may include instructions that cause the electronic device to identify the location of the electronic device at the time the audio signal is acquired, using the GPS receiver based on the event, when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to generate the prompt, which includes the first information, the second information, and the third information regarding the location, when executed by the electronic device.
[0159] For example, the one or more programs may include instructions that cause the electronic device to identify the time and / or date at which the audio signal was acquired based on the event when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to generate the prompt, which includes the first information, the second information, and third information regarding the time and / or date, when executed by the electronic device.
[0160] For example, the one or more programs may include instructions that cause the electronic device to obtain another audio signal by performing noise cancellation on the audio signal based on the event when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify the context indicated by the other audio signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify the speaker of at least one voice signal included in the other audio signal when executed by the electronic device.
[0161] For example, the one or more programs may include instructions that cause the electronic device to acquire at least one other image containing an expression of the speaker based on searching for the speaker via the Internet when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to further provide the at least one other image to the trained model when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to acquire the image for the audio signal generated by the trained model using the prompt, the at least one image, and the at least one other image when executed by the electronic device.
[0162] For example, the one or more programs may include instructions that cause the electronic device to extract another audio signal indicating a background sound included in the audio signal from the audio signal based on the event when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to obtain third information about a location indicated by the background sound using the other audio signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to generate the prompt including the first information, the second information, and the third information when executed by the electronic device.
[0163] For example, the one or more programs may include instructions that cause the electronic device to provide the audio signal and the image for the audio signal to another trained model based on an event that generates a video for the audio signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to acquire the video for the audio signal generated by the other trained model using the audio signal and the image for the audio signal when executed by the electronic device.
[0164] For example, the video for the audio signal may include an expression of the speaker uttering the at least one voice signal.
[0165] For example, the one or more programs may include instructions that cause the electronic device to extract the at least one voice signal from the audio signal based on the event when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to obtain feature information of the at least one voice signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify the speaker by comparing the feature information with the feature information of each of the one or more voice signals stored in the electronic device when executed by the electronic device.
[0166] For example, the above feature information may include a feature vector, an embedding vector, and / or an acoustic feature.
[0167] For example, the image for the audio signal may include the speaker's expression according to the context.
[0168] The electronic device described above may include at least one processor comprising a processing circuit, a microphone, and a memory comprising one or more storage media for storing one or more programs configured to be executed individually or collectively by the at least one processor. The instructions may cause the electronic device to acquire an audio signal using the microphone when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify the context indicated by the audio signal based on an event that generates video for the audio signal when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify the utterer of at least one voice signal included in the audio signal based on the event when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to identify at least one image containing the speaker's representation among images stored in the electronic device and images registered in association with the user's account of the electronic device, when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to generate a prompt including first information about the speaker and second information about the context, when executed individually or collectively by the at least one processor. The above instructions may cause the electronic device to provide the prompt and the at least one image to a trained model, when executed individually or collectively by the at least one processor.The above instructions, when executed individually or collectively by the at least one processor, may cause the electronic device to acquire the video for the audio signal generated by the trained model using the prompt and the at least one image.
[0169] For example, the electronic device may further include a GPS (global positioning system) receiver. The instructions may cause the electronic device to identify the location of the electronic device at the time the audio signal is acquired using the GPS receiver, based on the event, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to generate the prompt, which includes the first information, the second information, and the third information regarding the location, when executed individually or collectively by the at least one processor.
[0170] For example, the instructions may cause the electronic device to identify the time and / or date at which the audio signal was acquired based on the event, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to generate the prompt, which includes the first information, the second information, and third information regarding the time and the date, when executed individually or collectively by the at least one processor.
[0171] For example, the instructions may cause the electronic device to obtain another audio signal by performing noise cancellation on the audio signal based on the event, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify the context indicated by the other audio signal, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to identify the speaker of at least one voice signal included in the other audio signal, when executed individually or collectively by the at least one processor.
[0172] For example, the instructions may cause the electronic device to acquire at least one other image containing the speaker's expression based on searching for the speaker via the internet, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to further provide the at least one other image to the trained model when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to acquire the video for the audio signal generated by the trained model using the prompt, the at least one image, and the at least one other image when executed individually or collectively by the at least one processor.
[0173] For example, the instructions may cause the electronic device to extract another audio signal indicating a background sound included in the audio signal based on the event, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to obtain third information regarding a location indicated by the background sound using the other audio signal, when executed individually or collectively by the at least one processor. The instructions may cause the electronic device to generate the prompt including the first information, the second information, and the third information, when executed individually or collectively by the at least one processor.
[0174] For example, the video for the audio signal may include an expression of the speaker uttering the at least one voice signal.
[0175] The above-described method may be performed within an electronic device including a microphone. The method may include an operation of acquiring an audio signal using the microphone. The method may include an operation of identifying a context indicated by the audio signal based on an event that generates a video for the audio signal. The method may include an operation of identifying a utterer of at least one voice signal included in the audio signal based on the event. The method may include an operation of identifying at least one image containing a representation of the utterer among images stored in the electronic device and images registered in association with the user's account of the electronic device. The method may include an operation of generating a prompt including first information about the utterer and second information about the context. The method may include an operation of providing the prompt and the at least one image to a trained model. The method may include an operation of acquiring the video for the audio signal generated by the trained model using the prompt and the at least one image.
[0176] For example, the electronic device may further include a GPS (global positioning system) receiver. The method may include an operation of identifying the location of the electronic device at the time the audio signal is acquired using the GPS receiver based on the event. The method may include an operation of generating the prompt including the first information, the second information, and the third information regarding the location.
[0177] For example, the method may include an operation of identifying the time and / or date at which the audio signal was acquired based on the event. The method may include an operation of generating the prompt, which includes the first information, the second information, and third information regarding the time and the date.
[0178] For example, the above method may include an operation of obtaining another audio signal by performing noise cancellation on the audio signal based on the event. The above method may include an operation of identifying the context indicated by the other audio signal. The above method may include an operation of identifying the speaker of at least one voice signal included in the other audio signal.
[0179] For example, the above method may include the operation of acquiring at least one other image containing an expression of said speaker based on searching for said speaker via the Internet. The above method may include the operation of further providing said at least one other image to said trained model. The above method may include the operation of acquiring the video for said audio signal generated by said trained model using said prompt, said at least one image, and said at least one other image.
[0180] For example, the above method may include an operation of extracting another audio signal from the audio signal that indicates a background sound included in the audio signal based on the event. The above method may include an operation of obtaining third information about a location indicated by the background sound using the other audio signal. The above method may include an operation of generating the prompt including the first information, the second information, and the third information.
[0181] For example, the video for the audio signal may include an expression of the speaker uttering the at least one voice signal.
[0182] The above-described non-transient computer-readable storage medium may store one or more programs. The one or more programs may include instructions that cause the electronic device to acquire an audio signal using the microphone when executed by the electronic device including the microphone. The one or more programs may include instructions that cause the electronic device to identify the context indicated by the audio signal based on an event that generates video for the audio signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify the utterer of at least one voice signal included in the audio signal based on the event when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify at least one image containing the utterer's representation among the images stored in the electronic device and the images registered in association with the user's account of the electronic device when executed by the electronic device. The one or more programs above may include instructions that cause the electronic device to generate a prompt including first information about the speaker and second information about the context when executed by the electronic device. The one or more programs above may include instructions that cause the electronic device to provide the prompt and the at least one image to a trained model when executed by the electronic device.The above one or more programs may include instructions that cause the electronic device to acquire the video for the audio signal generated by the trained model using the prompt and the at least one image when executed by the electronic device.
[0183] For example, the electronic device may further include a global positioning system (GPS) receiver. The one or more programs may include instructions that cause the electronic device to identify the location of the electronic device at the time the audio signal is acquired, using the GPS receiver based on the event, when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to generate the prompt, which includes the first information, the second information, and the third information regarding the location, when executed by the electronic device.
[0184] For example, the one or more programs may include instructions that cause the electronic device to identify the time and / or date at which the audio signal was acquired based on the event when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to generate the prompt, which includes the first information, the second information, and third information regarding the time and the date, when executed by the electronic device.
[0185] For example, the one or more programs may include instructions that cause the electronic device to obtain another audio signal by performing noise cancellation on the audio signal based on the event when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify the context indicated by the other audio signal when executed by the electronic device. The one or more programs may include instructions that cause the electronic device to identify the speaker of at least one voice signal included in the other audio signal when executed by the electronic device.
[0186] For example, the above method may include the operation of acquiring at least one other image containing an expression of said speaker based on searching for said speaker via the Internet. The above method may include the operation of further providing said at least one other image to said trained model. The above method may include the operation of acquiring the video for said audio signal generated by said trained model using said prompt, said at least one image, and said at least one other image.
[0187] For example, the above method may include an operation of extracting another audio signal from the audio signal that indicates a background sound included in the audio signal based on the event. The above method may include an operation of obtaining third information about a location indicated by the background sound using the other audio signal. The above method may include an operation of generating the prompt including the first information, the second information, and the third information.
[0188] For example, the video for the audio signal may include an expression of the speaker uttering the at least one voice signal.
[0189] The effects obtainable from the present disclosure are not limited to those mentioned above, and other unmentioned effects will be clearly understood by those skilled in the art to which the present disclosure belongs.
Claims
1. In an electronic device, Memory that stores instructions and includes one or more storage media; Microphone; and It includes at least one processor comprising a processing circuit, and When the above instructions are executed individually or collectively by the at least one processor: Using the above microphone, an audio signal is acquired; Based on an event that generates an image for the above audio signal, the context indicated by the audio signal is identified; Based on the above event, identify the utterer of at least one voice signal included in the audio signal; Identifying at least one image containing the speaker's representation among the images stored in the electronic device and the images registered in association with the user's account of the electronic device; Generate a prompt including first information about the above-mentioned speaker and second information about the above-mentioned context; Providing the above prompt and the above at least one image to a trained model; and To obtain the image for the audio signal generated by the above-trained model using the above-trained prompt and the above-mentioned at least one image, causing the above electronic device, Electronic device.
2. In Claim 1, It further includes a GPS (global positioning system) receiver, When the above instructions are executed individually or collectively by the at least one processor, Based on the above event, using the GPS receiver, the location of the electronic device at the time the audio signal is acquired is identified; and To generate the prompt including the first information, the second information, and the third information regarding the location, causing the above electronic device, Electronic device.
3. In Claim 1, When the above instructions are executed individually or collectively by the at least one processor, Based on the above event, identify the time and / or date when the audio signal was acquired; and To generate the prompt including the first information, the second information, and the third information regarding the time and / or date, causing the above electronic device, Electronic device.
4. In Claim 1, When the above instructions are executed individually or collectively by the at least one processor, Based on the above event, another audio signal is obtained by performing noise cancellation on the audio signal; Identifying the context indicated by the other audio signal; and To identify the speaker of at least one voice signal included in the other audio signal, causing the above electronic device, Electronic device.
5. In Claim 1, When the above instructions are executed individually or collectively by the at least one processor, Based on searching for the above-mentioned speaker via the Internet, at least one other image including the expression of the above-mentioned speaker is obtained; Providing at least one other image to the trained model; and To obtain the image for the audio signal generated by the above-mentioned trained model using the above-mentioned prompt, the above-mentioned at least one image, and the above-mentioned at least one other image, causing the above electronic device, Electronic device.
6. In Claim 1, When the above instructions are executed individually or collectively by the at least one processor, Based on the above event, another audio signal indicating a background sound included in the audio signal is extracted from the audio signal; Using the above other audio signal, obtain third information about the location indicated by the background sound; and To generate the prompt including the first information, the second information, and the third information, causing the above electronic device, Electronic device.
7. In Claim 1, When the above instructions are executed individually or collectively by the at least one processor, Based on an event that generates a video for the above audio signal, the above audio signal and the above image for the audio signal are provided to another trained model; and To acquire the video for the audio signal generated using the audio signal and the image for the audio signal by the above other trained model, causing the above electronic device, Electronic device.
8. In Claim 7, The video for the above audio signal is, including the expression of the speaker uttering the above at least one voice signal, Electronic device.
9. In Claim 1, When the above instructions are executed individually or collectively by the at least one processor: Based on the above event, extract the at least one voice signal from the audio signal; Acquiring feature information of at least one voice signal; and By comparing the above feature information with the feature information of each of one or more voice signals stored in the electronic device, the speaker is identified. causing the above electronic device, Electronic device.
10. In Claim 9, The above feature information is, including feature vectors, embedding vectors, and / or acoustic features, Electronic device.
11. In Claim 1, The image for the above audio signal is, including the speaker's expression according to the above context, Electronic device.
12. In electronic devices, Memory that stores instructions and includes one or more storage media; Microphone; and It includes at least one processor comprising a processing circuit, and When the above instructions are executed individually or collectively by the at least one processor: Using the above microphone, an audio signal is acquired; Based on an event that generates a video for the above audio signal, the context indicated by the audio signal is identified; Based on the above event, identify the speaker of at least one voice signal included in the audio signal; Identifying at least one image containing the speaker's expression among images stored in the electronic device and images registered in association with the user's account of the electronic device; Generate a prompt including first information about the above-mentioned speaker and second information about the above-mentioned context; Providing the above prompt and the above at least one image to a trained model; and To acquire the video for the audio signal generated using the prompt and at least one image by the above-mentioned trained model, causing the above electronic device, Electronic device.
13. In Claim 12, It further includes a GPS (global positioning system) receiver, When the above instructions are executed individually or collectively by the at least one processor, Based on the above event, using the GPS receiver, the location of the electronic device at the time the audio signal is acquired is identified; and To generate the prompt including the first information, the second information, and the third information regarding the location, causing the above electronic device, Electronic device.
14. A method executed in an electronic device including a microphone, wherein the method comprises: The operation of acquiring an audio signal using the above microphone; An operation to identify the context indicated by the audio signal based on an event that generates an image for the audio signal; An operation to identify the speaker of at least one voice signal included in the audio signal based on the above event; An operation of identifying at least one image containing the speaker's expression among images stored in the electronic device and images registered in association with the user's account of the electronic device; The operation of generating a prompt including first information about the above-mentioned speaker and second information about the above-mentioned context; The operation of providing the above prompt and the above at least one image to a trained model; and The method includes the operation of acquiring the image for the audio signal generated using the prompt and at least one image by the above-mentioned trained model. method.
15. In a non-transient computer-readable storage medium storing one or more programs, When one or more of the above programs are executed by an electronic device having a microphone: Using the above microphone, an audio signal is acquired; Based on an event that generates an image for the above audio signal, the context indicated by the audio signal is identified; Based on the above event, identify the utterer of at least one voice signal included in the audio signal; Identifying at least one image containing the speaker's representation among the images stored in the electronic device and the images registered in association with the user's account of the electronic device; Generate a prompt including first information about the above-mentioned speaker and second information about the above-mentioned context; Providing the above prompt and the above at least one image to a trained model; and To obtain the image for the audio signal generated by the above-trained model using the above-trained prompt and the above-mentioned at least one image, Including instructions that cause the above electronic device, Non-transient computer-readable storage media.