Electronic device for acquiring target modality on basis of at least one modality, and control method therefor

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The electronic device integrates cross-modality dependency information and neural networks to efficiently process and restore biometric data, addressing computational overhead and functional limitations.

WO2026121605A1PCT designated stage Publication Date: 2026-06-11SAMSUNG ELECTRONICS CO LTD

View PDF 5 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: SAMSUNG ELECTRONICS CO LTD
Filing Date: 2025-11-07
Publication Date: 2026-06-11

Application Information

Patent Timeline

07 Nov 2025

Application

11 Jun 2026

Publication

WO2026121605A1

IPC: G06N3/08; H04N7/14

AI Tagging

Application Domain

Two-way working systemsNeural learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Method for controlling hardware equipment based on real-time atmosphere information, electronic apparatus and computer-readable recording medium
US20260162433A1Image enhancement Image analysis
System and method for managing conferencing in a distributed communication network
US20260163751A1Special service provision for substationTelevision conference systemsComputer network Engineering
Sharing media items to a video conference session
CN122207247ATelevision conference systemsCathode-ray tube indicators
Intelligent classroom video flow control scheduling method and system
CN122205026ATwo-way working systemsSelective content distribution
Video conference security assurance method, device and server
CN122205027ATelevision conference systemsTwo-way working systems

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure KR2025018316_11062026_PF_FP_ABST

Patent Text Reader

Abstract

This electronic device comprises: a memory for storing cross-modality dependencies information including information related to a correlation between modalities, a neural network model and instructions; and at least one processor including processing circuitry, wherein the at least one processor individually or collectively executes the instructions so that the electronic device can: acquire a first modality of a first type from among a plurality of modality types; identify, on the basis of a context, information corresponding to the first type and a second type among the cross-modality dependencies information; and acquire a second modality of the second type by inputting, into the neural network model, the first modality and the identified information.

Need to check novelty before this filing date? Find Prior Art

Description

Electronic device for acquiring a target modality based on at least one modality and a method for controlling the same

[0001] The present disclosure relates to an electronic device and a method for controlling the same, for example, to an electronic device and a method for controlling the same that acquire a target modality based on at least one modality.

[0002] With the advancement of electronic technology, electronic devices offering various functions are being developed. In particular, electronic devices can collect diverse biometric information from users. For example, electronic devices can collect various biometric data such as facial features, voice, fingerprints, heart rate, and bioacoustic signals. This biometric information varies depending on the user's physical and emotional state and is also referred to as modality.

[0003] However, conventional electronic devices process each modality independently, which can lead to increased computational overhead in resource-constrained devices.

[0004] In addition, conventional electronic devices often primarily utilize biometric information such as facial features, making a camera essential; without a camera, functions related to biometric information could be limited.

[0005] According to one embodiment of the present disclosure, an electronic device may be configured to include at least one processor comprising a memory and processing circuitry that stores cross-modality dependency information including information on correlations between modalities, a neural network model, and instructions, wherein the at least one processor executes the instructions individually or collectively, and the electronic device may be configured to acquire a first modality of a first type among a plurality of modality types, identify information corresponding to the first type and the second type among the cross-modality dependency information based on context, and input the first modality and the identified information into the neural network model to acquire a second modality of the second type.

[0006] The above at least one processor may be configured such that, individually or collectively, the electronic device acquires the first modality of the first type and the third modality of the second type among the plurality of modality types, and if a part of the third modality is identified as damaged or requires restoration, identifies information corresponding to the first type and the second type among the cross-modality dependency information, inputs the first modality and the identified information into the neural network model to acquire the second modality of the second type corresponding to the third modality, and restores a part of the third modality based on the second modality.

[0007] The above at least one processor may be configured to individually or collectively identify information corresponding to the first type and the second type among the cross-modality dependency information based on an application running on the electronic device.

[0008] The device further includes a communication interface including a communication circuit and a display, wherein the at least one processor may be configured such that, when the electronic device executes a video call application, the processor obtains the first modality of a voice type among the plurality of modality types from another electronic device through the communication interface, identifies information corresponding to the voice type and the video type among the cross-modality dependency information based on the video call application, inputs the first modality and the identified information into the neural network model to obtain the second modality of the video type, and displays a screen corresponding to the second modality through the display.

[0009] The above at least one processor may be configured so that the electronic device individually or collectively updates the second modality based on the state of the communication channel with the other electronic device, and displays a screen corresponding to the updated second modality through the display.

[0010] The device further includes a communication interface comprising a microphone and a communication circuit, and the at least one processor may be configured to control the communication interface so that, when the electronic device executes a video call application, the device acquires a first modality of a voice type among a plurality of modality types through the microphone, identifies information corresponding to the voice type and the video type among the cross-modality dependency information based on the video call application, inputs the first modality and the identified information into the neural network model to acquire a second modality of the video type, and transmits the second modality to another electronic device.

[0011] The above at least one processor may be configured such that the electronic device individually or collectively acquires the first modality of the first type and the third modality of the third type among the plurality of modality types, identifies information corresponding to the first type and the third type among the cross-modality dependency information, and verifies the other one of the first modality and the third modality based on the acquired information by inputting one of the first modality and the third modality and the identified information into the neural network model.

[0012] The above at least one processor may be configured individually or collectively for the electronic device to update the second modality based on the user's health condition corresponding to the first modality.

[0013] The above at least one processor may be configured such that the electronic device individually or collectively acquires the first modality of the first type among the plurality of modality types, encodes the first modality, identifies information corresponding to the first type and the second type among the cross-modality dependency information based on the context, inputs the encoded first modality and the identified information into the neural network model to acquire output data, and decodes the output data to acquire the second modality.

[0014] The above cross-modality dependency information can be obtained based on at least two types of sample modalities among the plurality of modality types.

[0015] According to one embodiment of the present disclosure, a method of operating an electronic device may include the steps of: acquiring a first modality of a first type among a plurality of modality types; identifying information corresponding to the first type and the second type among cross-modality dependencies information including information on correlations between modalities based on context; and inputting the first modality and the identified information into a neural network model to acquire a second modality of the second type.

[0016] The step of acquiring the first modality involves acquiring the first modality of the first type and the third modality of the second type among the plurality of modality types, and the step of identifying involves identifying information corresponding to the first type and the second type among the cross-modality dependency information when it is identified that a part of the third modality is damaged or requires restoration, and the step of acquiring the second modality involves inputting the first modality and the identified information into the neural network model to acquire the second modality of the second type corresponding to the third modality, and the control method may further include the step of restoring a part of the third modality based on the second modality.

[0017] The above identifying step can identify information corresponding to the first type and the second type among the cross-modality dependency information based on an application running on the electronic device.

[0018] The step of acquiring the first modality involves acquiring the first modality of a voice type among the plurality of modality types from another electronic device when a video call application is executed, the step of identifying involves identifying information corresponding to the voice type and the video type among the cross-modality dependency information based on the video call application, and the step of acquiring the second modality involves inputting the first modality and the identified information into the neural network model to acquire the second modality of the video type, and the control method may further include the step of displaying a screen corresponding to the second modality.

[0019] The method further includes a step of updating the second modality based on the state of the communication channel with the other electronic device, and the displaying step may display a screen corresponding to the updated second modality.

[0020] The step of acquiring the first modality involves acquiring the first modality of a voice type among the plurality of modality types through a microphone included in the electronic device when a video call application is executed, the step of identifying involves identifying information corresponding to the voice type and the video type among the cross-modality dependency information based on the video call application, and the step of acquiring the second modality involves inputting the first modality and the identified information into the neural network model to acquire the second modality of the video type, and the control method may further include the step of transmitting the second modality to another electronic device.

[0021] The step of acquiring the first modality involves acquiring the first modality of the first type and the third modality of the third type among the plurality of modality types, and the control method may further include the step of identifying information corresponding to the first type and the third type among the cross-modality dependency information, and the step of inputting one of the first modality and the third modality and the identified information into the neural network model and verifying the other of the first modality and the third modality based on the acquired information.

[0022] The method may further include a step of updating the second modality based on the user's health condition corresponding to the first modality.

[0023] The above control method further includes the step of encoding the first modality, and the step of acquiring the second modality may input the encoded first modality and the identified information into the neural network model to acquire output data, and decode the output data to acquire the second modality.

[0024] The above cross-modality dependency information can be obtained based on at least two types of sample modalities among the plurality of modality types.

[0025] The above and other aspects, features, and advantages of specific embodiments of the present disclosure will become more apparent from the following detailed description, which is taken into account in conjunction with the accompanying drawings. In the drawings:

[0026] FIG. 1 is a block diagram showing the configuration of an electronic device according to various embodiments of the present disclosure.

[0027] FIG. 2 is a block diagram showing the detailed configuration of an electronic device according to various embodiments of the present disclosure.

[0028] FIG. 3 is a drawing for illustrating a neural network model according to various embodiments of the present disclosure.

[0029] FIG. 4 is a drawing for explaining operation according to an incomplete modality according to various embodiments of the present disclosure.

[0030] FIG. 5 is a drawing for explaining a method for processing modalities according to various embodiments of the present disclosure.

[0031] FIG. 6 is a diagram illustrating cross-modality dependency information and a learning method for a neural network model according to various embodiments of the present disclosure.

[0032] FIG. 7 is a diagram illustrating neural network operations according to various embodiments of the present disclosure.

[0033] FIG. 8 is a drawing for explaining operation due to differences in specifications between devices according to various embodiments of the present disclosure.

[0034] FIG. 9 is a drawing for illustrating a method of providing emojis according to various embodiments of the present disclosure.

[0035] FIGS. 10 and FIGS. 11 are drawings for explaining operations according to transmission errors according to various embodiments of the present disclosure.

[0036] FIG. 12 is a diagram illustrating a verification operation between modalities according to various embodiments of the present disclosure.

[0037] FIG. 13 is a drawing for explaining the effects according to various embodiments of the present disclosure.

[0038] FIG. 14 is a flowchart illustrating a method of operation of an electronic device according to various embodiments of the present disclosure.

[0039] Various embodiments of the present disclosure may be modified in various ways. Accordingly, various embodiments are illustrated in the drawings and described in detail in the detailed description. However, the present disclosure is not limited to specific embodiments and should be understood to include all modifications, equivalents, and substitutions that do not depart from the spirit and scope of the invention. Furthermore, specific descriptions of known functions or configurations that may unnecessarily obscure the gist of the present disclosure may be omitted.

[0040] One aspect of the present disclosure is to provide an electronic device and a control method thereof that resolve nonlinearity and complexity between modalities and acquire a target modality based on at least one modality.

[0041] The present disclosure will be described in more detail below with reference to the attached drawings.

[0042] The terms used in the various embodiments of this disclosure have been selected to be as widely used and general as possible, taking into account their functions within this disclosure; however, these terms may vary depending on the intent of those skilled in the art, case law, the emergence of new technologies, etc. Additionally, in specific cases, terms have been selected at the applicant's discretion, and in such cases, their meanings will be described in detail in the relevant description section of this disclosure. Therefore, terms used in this disclosure should be defined not merely by their names, but based on their meanings and the overall content of this disclosure.

[0043] In this specification, expressions such as “have,” “may have,” “include,” or “may include” indicate the presence of such features (e.g., numerical values, functions, operations, or components such as parts) and do not exclude the presence of additional features.

[0044] The expression "at least one of A or / and B" should be understood as representing either "A" or "B" or "A and B".

[0045] Expressions such as "first," "second," "first," or "second" used in this specification may modify various components regardless of order and / or importance, and are used only to distinguish one component from another and do not limit said components.

[0046] The singular expression includes the plural expression unless the context clearly indicates otherwise. In this application, terms such as "comprising" or "consisting of" are intended to specify the existence of the features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, and should be understood as not precluding the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof.

[0047] In this specification, the term "user" may refer to a person using an electronic device or a device using an electronic device (e.g., an artificial intelligence electronic device).

[0048] Various embodiments of the present disclosure will be described in more detail below with reference to the attached drawings.

[0049] FIG. 1 is a block diagram showing the configuration of an electronic device (100) according to various embodiments of the present disclosure.

[0050] The electronic device (100) can handle modalities. For example, the electronic device (100) may include devices such as a computer main body, a set-top box (STB), a server, an AI speaker, a TV, a desktop PC, a laptop, a smartphone, a tablet PC, smart glasses, a smart watch, etc. However, it is not limited thereto, and the electronic device (100) may be any device capable of handling modalities.

[0051] According to FIG. 1, the electronic device (100) includes a memory (110) and a processor (e.g., 120 including a processing circuit).

[0052] Memory (110) may refer to hardware that stores information, such as data, in an electrical or magnetic form so that a processor (120), etc., can access it. To this end, memory (110) may be implemented as at least one piece of hardware among non-volatile memory, volatile memory, flash memory, hard disk drive (HDD) or solid-state drive (SSD), RAM, ROM, etc.

[0053] At least one instruction required for the operation of an electronic device (100) or a processor (120) may be stored in the memory (110). The instruction is a unit of code that directs the operation of the electronic device (100) or the processor (120), and may be written in machine language, which is a language that a computer can understand. Alternatively, a plurality of instructions that perform a specific task of the electronic device (100) or the processor (120) may be stored in the memory (110) as an instruction set.

[0054] Data, which is information in bit or byte units capable of representing characters, numbers, images, etc., may be stored in the memory (110). For example, cross-modality dependency information, neural network models, etc. may be stored in the memory (110). Cross-modality dependency information may include information regarding correlations between modalities. For example, cross-modality dependency information may be obtained based on at least two types of sample modalities among a plurality of modality types. For instance, cross-modality dependency information may include information regarding correlations between a user's face and a user's voice. Cross-modality dependency information may further include information regarding correlations between a user's face and a user's electrocardiogram (ECG). However, it is not limited thereto, and cross-modality dependency information may further include information regarding correlations between any number of various biometric information. Furthermore, cross-modality dependency information may further include information regarding correlations between modalities of more than two types. The neural network model may include a model trained to output a new modality. For example, the neural network model may be a model trained to output a modality of the target type when information corresponding to the type of the first modality and the target type among the first modality and cross-modality dependency information is input.

[0055] The memory (110) is accessed by the processor (120), and the processor (120) may perform read / write / modify / delete / update, etc. on instructions, instruction sets, or data.

[0056] The processor (120) includes various processing circuits and controls the overall operation of the electronic device (100). For example, the processor (120) may be connected to each component of the electronic device (100) to control the overall operation of the electronic device (100). For example, the processor (120) may be connected to a component such as a memory (110) to control the operation of the electronic device (100).

[0057] For example, the processor (120) may include one or more processors among a CPU, a GPU (Graphics Processing Unit), an APU (Accelerated Processing Unit), a MIC (Many Integrated Core), a NPU (Neural Processing Unit), a hardware accelerator, or a machine learning accelerator. One or more processors (120) may control one or any combination of other components of the electronic device (100) and may perform operations or data processing related to communication. One or more processors (120) may execute one or more programs or instructions stored in memory. For example, one or more processors (120) may perform a method according to one embodiment of the present disclosure by executing one or more instructions stored in memory.

[0058] When a method according to one embodiment of the present disclosure includes a plurality of operations, the plurality of operations may be performed by a single processor or by a plurality of processors. For example, when a first operation, a second operation, and a third operation are performed by a method according to one embodiment, the first operation, the second operation, and the third operation may all be performed by a first processor, or the first operation and the second operation may be performed by a first processor (e.g., a general-purpose processor) and the third operation may be performed by a second processor (e.g., an artificial intelligence dedicated processor).

[0059] One or more processors (120) may be implemented as a single-core processor including one core, or as one or more multicore processors including multiple cores (e.g., homogeneous multicore or heterogeneous multicore). When one or more processors (120) are implemented as multicore processors, each of the multiple cores included in the multicore processor may include internal processor memory such as cache memory or on-chip memory, and a common cache shared by multiple cores may be included in the multicore processor. Additionally, each of the multiple cores included in the multicore processor (or some of the multiple cores) may independently read and execute program instructions for implementing a method according to one embodiment of the present disclosure, or all (or some) of the multiple cores may be linked together to read and execute program instructions for implementing a method according to one embodiment of the present disclosure.

[0060] When a method according to one embodiment of the present disclosure includes a plurality of operations, the plurality of operations may be performed by one of the plurality of cores included in a multi-core processor, or may be performed by a plurality of cores. For example, when a first operation, a second operation, and a third operation are performed by a method according to one embodiment, the first operation, the second operation, and the third operation may all be performed by a first core included in a multi-core processor, or the first operation and the second operation may be performed by a first core included in a multi-core processor and the third operation may be performed by a second core included in a multi-core processor.

[0061] In the embodiments of the present disclosure, one or more processors (120) may be, for example, a system-on-chip (SoC) in which one or more processors and other electronic components are integrated, a single-core processor, a multi-core processor, or a core included in a single-core processor or a multi-core processor, wherein the core may be implemented as a CPU, GPU, APU, MIC, NPU, hardware accelerator, or machine learning accelerator, but the embodiments of the present disclosure are not limited thereto. However, for convenience of explanation, the operation of the electronic device (100) is described below using the expression "processor (120)."

[0062] The processor (120) can acquire a first modality of a first type among a plurality of modality types. For example, the processor (120) can acquire a first modality of a first type through a camera, microphone, sensor, etc. included in the electronic device (100). The processor (120) may also receive a first modality of a first type from another electronic device. The modality may be various biometric information, such as a user's face, voice, fingerprint, heart rate, bioacoustic signal, etc. When the processor (120) acquires a first modality of a first type through a camera, microphone, sensor, etc. included in the electronic device (100), it may acquire a first modality related to the user of the electronic device (100). When the processor (120) receives a first modality of a first type from another electronic device, it may acquire a first modality related to another user of the other electronic device.

[0063] The processor (120) can identify information corresponding to a first type and a second type among cross-modality dependency information stored in memory (110) based on context. For example, the cross-modality dependency information may include information regarding the correlation between the user's face and the user's voice, information regarding the correlation between the user's face and the user's electrocardiogram, information regarding the correlation between the user's face and the user's fine motion, information regarding the correlation between the user's steps and the user's fine motion, information regarding the correlation between the user's fingerprint and the vein structure of the user's palm, etc. For instance, if the processor (120) acquires the user's voice as a first modality of a first type and identifies the user's face as a second type, it can identify information regarding the correlation between the user's face and the user's voice among the cross-modality dependency information.

[0064] However, it is not limited thereto, and the processor (120) may identify information corresponding to the first type and the second type among the cross-modality dependency information based on the currently running application. The processor (120) may also identify information corresponding to the first type and the second type among the cross-modality dependency information based on the user's location. The processor (120) may identify the second type among a plurality of modality types based on the first type, and identify information corresponding to the first type and the second type among the cross-modality dependency information.

[0065] The processor (120) can obtain a second type of second modality by inputting the first modality and identified information into a neural network model. In the example described above, the processor (120) can obtain the user's face as a second type of second modality by inputting information regarding the correlation between the user's face and the user's voice among the cross-modality dependency information and the user's voice, which is the first type of first modality, into a neural network model.

[0066] The processor (120) further utilizes cross-modality dependency information in addition to the first modality to acquire the second modality, thereby reducing the load caused by non-linearity and complexity between modalities, and enabling modality acquisition in an on-device form.

[0067] The processor (120) obtains a first modality of a first type and a third modality of a second type among a plurality of modality types, and if it is identified that a part of the third modality is damaged or needs to be restored, it identifies information corresponding to the first type and the second type among cross-modality dependency information, inputs the first modality and the identified information into a neural network model to obtain a second modality of a second type corresponding to the third modality, and can restore a part of the third modality based on the second modality.

[0068] For example, when a user is in a video call with another user, the processor (120) may obtain a first modality of video type and a third modality of voice type from another electronic device used by the other user. While providing the video call function, the processor (120) may identify damage to the received data or identify that restoration is required. For example, the processor (120) may identify damage to a part of the third modality of voice type or identify that restoration is required. In this case, the processor (120) may identify information regarding the correlation between video and voice among the cross-modality dependency information, and input the first modality and the identified information into a neural network model to obtain a second modality of voice type corresponding to the third modality. Here, the second modality is information restored to a voice type using the cross-modality dependency information from the first modality of video type, and the processor (120) may restore a part of the third modality based on the second modality. That is, the processor (120) can retain the remainder of the undamaged third modality and restore only a portion of the damaged third modality based on the second modality.

[0069] If the processor (120) is unable to read a part of the third modality, it may identify that a part of the third modality is damaged. Alternatively, the processor (120) may identify that a part of the third modality requires restoration based on an error detection method, such as an error correction code. However, it is not limited thereto, and the processor (120) may identify that a part of the third modality is damaged or requires restoration in any number of different ways.

[0070] In the above description, the processor (120) is described as acquiring a first modality and a third modality, but is not limited thereto. For example, the processor (120) may acquire one of the first modality and the third modality, and if a part of the acquired modality is identified as damaged or requires restoration, it may acquire the other of the first modality and the third modality.

[0071] The processor (120) can identify information corresponding to a first type and a second type among cross-modality dependency information based on an application running on the electronic device (100). For example, the electronic device (100) further includes a communication interface and a display, and when a video call application is executed, the processor (120) can obtain a first modality of a voice type among a plurality of modality types from another electronic device through the communication interface, identify a video type as a second type based on the video call application, identify information corresponding to the voice type and the video type among cross-modality dependency information, input the first modality and the identified information into a neural network model to obtain a second modality of a video type, and display a screen corresponding to the second modality through the display. That is, when the processor (120) receives only voice data without video data from another electronic device that is the target of the video call application, it can provide a video call function without receiving video data by generating video from the voice.

[0072] The processor (120) may update a second modality based on the state of a communication channel with another electronic device and display a screen corresponding to the updated second modality through a display. Accordingly, the processor (120) may provide a screen corresponding to the second modality that reflects the state of the communication channel. The processor (120) may update a first modality based on the state of a communication channel with another electronic device, input the updated first modality and identified information into a neural network model to obtain a second modality of an image type, and display a screen corresponding to the second modality through a display.

[0073] However, it is not limited thereto, and the processor (120) may perform an operation to acquire a second modality of video type even if a video call application is executed and a first modality of voice type and a second modality of video type are received from another electronic device through a communication interface. For example, the processor (120) may acquire a second modality of video type from a first modality of voice type based on at least one of the state of a communication channel with another electronic device or the performance of another electronic device, and may display a screen corresponding to the second modality through a display. In this case, the processor (120) may request the other electronic device to stop the transmission of the second modality of video type.

[0074] The electronic device (100) further includes a microphone and a communication interface (e.g., including a communication circuit), and when a video call application is executed, the processor (120) may acquire a first modality of a voice type among a plurality of modality types through the microphone, identify information corresponding to the voice type and the video type among cross-modality dependency information based on the video call application, input the first modality and the identified information into a neural network model to acquire a second modality of a video type, and control the communication interface to transmit the second modality to another electronic device.

[0075] For example, if the electronic device (100) does not include a camera, the processor (120) may control a communication interface to obtain a first modality of voice type through a microphone, obtain a second modality of image type from the first modality, and transmit the second modality to another electronic device.

[0076] The processor (120) can acquire a first modality of a first type and a third modality of a third type among a plurality of modality types, identify information corresponding to the first type and the third type among cross-modality dependency information, and input one of the first modality and the third modality and the identified information into a neural network model to verify the other of the first modality and the third modality based on the acquired information. For example, the processor (120) can identify whether the face of another user of another electronic device and the voice of another user correspond during a video call.

[0077] The processor (120) can update a second modality based on the user's health condition corresponding to the first modality. For example, cross-modality dependency information may be generated when the user's health condition is good, but if the user's health condition subsequently deteriorates, it may need to be corrected, and the processor (120) can obtain a second modality that reflects the user's health condition by updating the second modality based on the user's health condition corresponding to the first modality. The processor (120) may also update the first modality based on the user's health condition and obtain a second modality based on the updated first modality.

[0078] A processor (120) can obtain a first modality of a first type among a plurality of modality types, encode the first modality, identify a second type based on context, identify information corresponding to the first type and the second type among cross-modality dependency information, input the encoded first modality and the identified information into a neural network model to obtain output data, and decode the output data to obtain a second modality.

[0079] Although it has been described as obtaining another modality from one modality, it is not limited thereto. For example, the processor (120) may obtain a target modality from at least two types of modalities.

[0080] The artificial intelligence-related functions according to the present disclosure can be operated through a processor (120) and a memory (110).

[0081] The processor (120) may be composed of one or more processors. The one or more processors may be a general-purpose processor such as a CPU, AP, DSP, etc., a graphics-dedicated processor such as a GPU, VPU (Vision Processing Unit), and / or an artificial intelligence-dedicated processor such as an NPU.

[0082] One or more processors control the processing of input data according to predefined (e.g., specified) operation rules or artificial intelligence models stored in memory (110). If one or more processors are dedicated artificial intelligence processors, the dedicated artificial intelligence processors may be designed with a hardware structure specialized for processing a specific artificial intelligence model. The predefined operation rules or artificial intelligence models are characterized by being created through learning.

[0083] Here, being created through learning means, for example, that a basic artificial intelligence model is trained using a number of training data by a learning algorithm, thereby creating a predefined rule of action or an artificial intelligence model configured to perform a desired characteristic (or objective). Such learning may be performed on the device itself where the artificial intelligence according to the present disclosure is executed, or it may be performed through a separate server and / or system. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

[0084] An artificial intelligence model can be composed of multiple neural network layers. Each of the multiple neural network layers has multiple weight values and performs neural network operations through calculations between the results of previous layers and the multiple weights. The multiple weights possessed by the multiple neural network layers can be optimized based on the learning results of the artificial intelligence model. For example, the multiple weights can be updated during the learning process so that the loss or cost values obtained by the artificial intelligence model are reduced or minimized.

[0085] Artificial neural networks may include deep neural networks (DNNs), such as, but are not limited to, Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Restricted Boltzmann Machines (RBMs), Deep Belief Networks (DBNs), Bidirectional Recurrent Deep Neural Networks (BRDNNs), Generative Adversarial Networks (GANs), or Deep Q-Networks.

[0086] FIG. 2 is a block diagram showing the detailed configuration of an electronic device (100) according to various embodiments of the present disclosure. The electronic device (100) may include a memory (110) and a processor (e.g., 120 including a processing circuit). Additionally, according to FIG. 2, the electronic device (100) may further include a communication interface (e.g., 130 including a communication circuit), a display (140), a microphone (150), a user interface (e.g., 160 including a circuit), a camera (170), a sensor (180), and a speaker (190). Detailed descriptions of parts of the components shown in FIG. 2 that overlap with the components shown in FIG. 1 may not be repeated.

[0087] The communication interface (130) is configured to include various communication circuits and to communicate with various types of external devices according to various types of communication methods. For example, an electronic device (100) can communicate with other electronic devices through the communication interface (130).

[0088] The communication interface (130) may include a Wi-Fi module, a Bluetooth module, an infrared communication module, and a wireless communication module, etc. Here, each communication module may be implemented in the form of at least one hardware chip.

[0089] The Wi-Fi module and Bluetooth module perform communication using the Wi-Fi and Bluetooth methods, respectively. When using the Wi-Fi or Bluetooth module, various connection information, such as the SSID and session key, is transmitted and received first; after establishing a communication connection using this information, various data can be transmitted and received. The infrared communication module performs communication according to infrared communication (IrDA, Infrared Data Association) technology, which wirelessly transmits data over short distances using infrared rays that lie between visible light and millimeter waves.

[0090] In addition to the communication method described above, the wireless communication module may include at least one communication chip that performs communication according to various wireless communication standards such as Zigbee, 3G (3rd Generation), 3GPP (3rd Generation Partnership Project), LTE (Long Term Evolution), LTE-A (LTE Advanced), 4G (4th Generation), and 5G (5th Generation).

[0091] The communication interface (130) may include wired communication interfaces such as HDMI, DP, Thunderbolt, USB, RGB, D-SUB, DVI, etc.

[0092] The communication interface (130) may include at least one of a LAN (Local Area Network) module, an Ethernet module, or a wired communication module that performs communication using a pair cable, a coaxial cable, or a fiber optic cable.

[0093] The display (140) is configured to display an image and can be implemented as various types of displays such as LCD (Liquid Crystal Display), OLED (Organic Light Emitting Diodes) display, PDP (Plasma Display Panel), etc., but is not limited thereto. The display (140) may also include a driving circuit, a backlight unit, etc., which can be implemented in forms such as a-si TFT, LTPS (low temperature poly silicon) TFT, OTFT (organic TFT), etc. The display (140) can be implemented as a touch screen combined with a touch sensor, a flexible display, a 3D display, etc.

[0094] The microphone (150) is configured to receive sound input and convert it into an audio signal. The microphone (150) is electrically connected to the processor (120) and can receive sound under the control of the processor (120).

[0095] For example, the microphone (150) may be formed as an integrated unit that is integrated into the upper side, front direction, side direction, etc. of the electronic device (100). The microphone (150) may also be provided in a remote control, etc., separate from the electronic device (100). In this case, the remote control may receive sound through the microphone (150) and provide the received sound to the electronic device (100).

[0096] The microphone (150) may include various configurations such as a microphone that collects analog sound, an amplifier circuit that amplifies the collected sound, an A / D conversion circuit that samples the amplified sound and converts it into a digital signal, and a filter circuit that removes noise components from the converted digital signal.

[0097] Meanwhile, the microphone (150) may be implemented in the form of a sound sensor, and any configuration capable of collecting sound is acceptable.

[0098] The user interface (160) may include various circuits and be implemented as a button, touchpad, mouse, and keyboard, or as a touch screen capable of performing display functions and operation input functions. The button may be a various type of button, such as a mechanical button, touchpad, or wheel, formed in any area of the exterior of the main body of the electronic device (100), such as the front, side, or back.

[0099] The camera (170) is configured to capture still images or video. The camera (170) can capture a still image at a specific point in time, but can also capture a series of still images.

[0100] The camera (170) includes a lens, a shutter, an aperture, a solid-state image sensor, an AFE (Analog Front End), and a TG (Timing Generator). The shutter controls the time when light reflected from a subject enters the camera (170), and the aperture controls the amount of light incident on the lens by mechanically increasing or decreasing the size of the opening through which light enters. When light reflected from a subject accumulates as photocharge, the solid-state image sensor outputs an image based on the photocharge as an electrical signal. The TG outputs a timing signal for reading out pixel data from the solid-state image sensor, and the AFE samples and digitizes the electrical signal output from the solid-state image sensor.

[0101] The sensor (180) includes various circuits and is configured to acquire biometric information related to the user of the electronic device (100), and may include a temperature sensor, a PPG (PhotoPlethysmoGraphy) sensor, etc.

[0102] A temperature sensor may be a sensor that measures the temperature of a body or part. The temperature sensor may be implemented in a contact or non-contact manner, and the measured temperature value may be provided to a memory (110) or a processor (120). The processor (120) may modify the skin temperature or body temperature or identify the situation based on the temperature value measured by the temperature sensor.

[0103] A PPG sensor may be a sensor for measuring changes in blood flow in blood vessels near the skin. A processor (120) may obtain breathing information of a user based on the PPG sensor. The user’s heart rate increases while inhaling and decreases while exhaling, and the processor (120) may obtain breathing information from data obtained from the PPG sensor based on a relationship between these breathing and heart rate called respiratory sinus arrhythmia.

[0104] The sensor (180) may include a configuration for acquiring posture information of the electronic device (100). For example, the sensor (180) may further include at least one of a gyroscope sensor, an accelerometer sensor, or a magnetometer sensor. The processor (120) may acquire motion information of the user based on the posture information of the electronic device (100) acquired through the sensor (180).

[0105] A gyro sensor is a sensor for detecting the rotation angle of an electronic device (100), and can measure changes in the orientation of an object by utilizing the property of always maintaining a constant direction initially set with high accuracy regardless of the rotation of the Earth. A gyro sensor is also called a gyroscope and can be implemented in a mechanical manner or an optical manner using light.

[0106] A gyroscope sensor can measure angular velocity. Angular velocity can be, for example, the angle of rotation per unit of time, and the measurement principle of a gyroscope sensor is as follows. For instance, the angular velocity is 0 degrees / sec in a horizontal state (stationary state). If an object tilts by 50 degrees while moving for 10 seconds, the average angular velocity over those 10 seconds is 5 degrees / sec. If the tilt angle of 50 degrees is maintained while stationary, the angular velocity becomes 0 degrees / sec. Through this process, the angular velocity changes from 0 to 5 to 0, and the angle increases from 0 degrees to 50 degrees. To calculate the angle from the angular velocity, integration must be performed over the entire time. Since the gyroscope sensor measures angular velocity in this manner, the tilt angle can be calculated by integrating this angular velocity over the entire time. However, errors occur in the gyroscope sensor due to the influence of temperature, and as these errors accumulate during the integration process, the final value may drift. Accordingly, the electronic device (100) may further be equipped with a temperature sensor and can compensate for the error of the gyro sensor using the temperature sensor.

[0107] An acceleration sensor is a sensor that measures the acceleration or the intensity of an impact of an electronic device (100), and is also called an accelerometer. An acceleration sensor detects dynamic forces such as acceleration, vibration, and impact, and can be implemented as an inertial type, a gyro type, a silicon semiconductor type, etc., depending on the detection method. That is, an acceleration sensor is a sensor that senses the degree of tilt of an electronic device (100) using gravitational acceleration, and can typically be composed of a 2-axis or 3-axis flux gate.

[0108] A magnetometer sensor is generally a sensor that measures the strength and direction of the Earth's magnetic field; however, in a broader sense, it also includes sensors that measure the magnetization strength of an object and is also referred to as a magnetometer. Magnetometer sensors can be implemented by suspending a magnet horizontally within a magnetic field and measuring the direction of its movement, or by rotating a coil within the field and measuring the induced electromotive force generated in the coil to measure the strength of the magnetic field.

[0109] A geomagnetic sensor, which measures the strength of the Earth's magnetic field as a type of magnetometer, can generally be implemented as a fluxgate-type geomagnetic sensor that detects geomagnetism using a fluxgate. A fluxgate-type geomagnetic sensor is a device that uses a high-permeability material such as permalloy as a magnetic core and applies an excitation field through a driving coil wound around the core. By measuring the second harmonic component proportional to the external magnetic field generated according to the magnetic saturation and nonlinear magnetic characteristics of the core, the magnitude and direction of the external magnetic field can be measured. By measuring the magnitude and direction of the external magnetic field, the current azimuth angle is detected, and accordingly, the degree of rotation can be measured. The geomagnetic sensor can be composed of a 2-axis or 3-axis fluxgate. A 2-axis fluxgate sensor, i.e., a 2-axis sensor, may be a sensor composed of mutually orthogonal X-axis fluxgates and Y-axis fluxgates, and a 3-axis fluxgate, i.e., a 3-axis sensor, may be a sensor in which a Z-axis fluxgate is added to the X-axis and Y-axis fluxgates.

[0110] By using the geomagnetic sensor and acceleration sensor as described above, attitude information of the electronic device (100) can be obtained. For example, the attitude information of the electronic device (100) can be expressed as pitch angle, roll angle, and azimuth angle.

[0111] The azimuth (yaw angle) can be, for example, an angle that changes in the left and right directions on a horizontal plane, and when the azimuth is calculated, it is possible to know which direction the electronic device (100) is facing. For example, if a geomagnetic sensor is used, the azimuth can be measured using the following formula.

[0112] ψ=arctan(sinψ / cosψ)

[0113] Here, ψ can be an azimuth, and cosψ and sinψ can be X-axis and Y-axis fluxgate output values.

[0114] The roll angle may be an angle at which the horizontal plane tilts to the left or right, and by calculating the roll angle, the left or right tilt of the electronic device (100) can be determined. The pitch angle may be an angle at which the horizontal plane tilts up or down, and by calculating the pitch angle, the tilt angle at which the electronic device (100) tilts upward or downward can be determined. For example, using an acceleration sensor, the roll angle and pitch angle can be measured through the following formula.

[0115] φ=arcsin(ay / g)

[0116] θ=arcsin(ax / g)

[0117] Here, g represents the acceleration due to gravity, φ represents the roll angle, θ represents the pitch angle, ax represents the X-axis acceleration sensor output value, and ay represents the Y-axis acceleration sensor output value.

[0118] For convenience of explanation, the sensor (180) has been described above as including at least one of a gyroscope sensor, an accelerometer sensor, a magnetometer sensor, or a sound sensor. However, it is not limited thereto, and the sensor (180) may be any sensor capable of acquiring posture information of the electronic device (100). The processor (120) can detect the user's motion based on the posture information of the electronic device (100).

[0119] The speaker (190) is a component that outputs various audio data processed by the processor (120), as well as various notification sounds or voice messages.

[0120] As described above, the electronic device (100) can acquire a target modality by utilizing not only modality but also cross-modality dependency information, thereby reducing the modality processing load while improving performance.

[0121] The operation of the electronic device (100) will be described in more detail below through FIGS. 3 to 13. FIGS. 3 to 13 describe individual embodiments for convenience of explanation. However, the individual embodiments of FIGS. 3 to 13 may be implemented in any combination.

[0122] FIG. 3 is a drawing for illustrating a neural network model according to various embodiments of the present disclosure.

[0123] The processor (120) can acquire a first type of first modality. For example, as shown in FIG. 3, if the electronic device (100) is a smartphone, the processor (120) can acquire the user's voice (waveform) as a first type of first modality through a microphone (150) provided in the electronic device (100).

[0124] The processor (120) can identify a second type among multiple modality types based on context. For example, when a video call application is executed, the processor (120) can identify a video as the second type among multiple modality types. The processor (120) can identify information corresponding to the video of the first type and the video of the second type among cross-modality dependency information (DB), and input the user's voice and the identified information into a neural network model (ML models) to obtain a second modality (310) of the video type.

[0125] The processor (120) can obtain a video stream (330) based on a second modality (310) of the image type and a real face image (320) of a person included in the second modality (310) of the image type.

[0126] It has been described that the processor (120) identifies a second type among multiple modality types based on context. The context may include information about a first modality of a first type currently obtained by the processor (120). For example, if the processor (120) has obtained a first modality of a first type, it may identify one of multiple types paired with the first type as a second type. For instance, if the first type is voice and the cross-modality dependency information includes voice-video dependency information and voice-fingerprint dependency information, the processor (120) may not identify a heart rate or bioacoustic signal as a second type, but may identify a video or fingerprint as a second type.

[0127] FIG. 4 is a drawing for explaining operation according to an incomplete modality according to various embodiments of the present disclosure.

[0128] The processor (120) can acquire a first modality of a first type among a plurality of modality types. For example, as illustrated in FIG. 4, when a video call application is executed, the processor (120) can acquire a first modality (410) of a video type.

[0129] The processor (120) can identify that there is an error in the first modality (410) of the image type. For example, the processor (120) can identify that there is an error in the first modality (410) of the image type based on the resolution, capacity, etc. of the first modality (410) of the image type.

[0130] If the processor (120) identifies that there is an error in the first modality (410) of the image type, it can identify another type of second modality that can be obtained. For example, the processor (120) can identify a voice type (420-1), an image type (420-2), a fine motion (420-3), etc., as another type of second modality that can be obtained.

[0131] The processor (120) identifies information corresponding to an image type and other obtainable types among a plurality of modality types, and inputs the second modality and the identified information into a neural network model to re-obtain the first modality (430) of the image type.

[0132] The processor (120) may provide the user with a first modality (430) of a reacquired image type. Alternatively, the processor (120) may remove errors from the first modality (410) of the initially acquired image type based on the first modality (430) of the reacquired image type and provide the first modality (410) of the image type with errors removed.

[0133] Through this action, incomplete or erroneous modalities can be corrected.

[0134] FIG. 5 is a drawing for explaining a method for processing modalities according to various embodiments of the present disclosure.

[0135] The processor (120) can obtain system's parameters through the system API (510) of the electronic device (100).

[0136] The processor (120) can identify a context through a context identification module (520) and identify a second type corresponding to the context through a target modality selection module (530-1). Additionally, the processor (120) can obtain update information through a tracking module (530-2) for updating cross-modality dependency information and update cross-modality dependency information (540).

[0137] The processor (120) can obtain a first modality (Mk) of a first type among a plurality of modality types, and can obtain an encoded first modality (V) by encoding the first modality (Mk) through an encoding model (Menc, 560). This first modality (Mk) may also be obtained from system's parameters through an interface module (570).

[0138] The processor (120) can identify a type corresponding to the first modality (V) encoded through the correlation model (Mcorr, 550) and identify information corresponding to the identification result (f) among the cross-modality dependency information.

[0139] The processor (130) can obtain a second type of second modality (ML) based on the identified information among the first modality (Mk) and cross-modality dependency information through the modality generation model (Mdec, 580).

[0140] FIG. 6 is a diagram illustrating cross-modality dependency information and a learning method for a neural network model according to various embodiments of the present disclosure.

[0141] When multiple types of modalities are obtained, the processor (120) can update cross-modality dependency information based on multiple types of modalities.

[0142] For example, when voice (610-1) and video (610-2) are acquired as shown at the top of FIG. 6, the processor (120) can acquire audio features and video features from voice (610-1) and video (610-2), respectively, and acquire information on feature correlation between modalities based on the audio features and video features. The processor (120) can add or update information on feature correlation between modalities to cross-modality dependency information.

[0143] The audio (610-1) and video (610-2) may be modalities acquired during the same time interval.

[0144] The processor (120) can train the encoding model (Menc), correlation model (Mcorr), and modality generation model (Mdec) of FIG. 5 so that a mathematical equation such as the bottom of FIG. 6 is minimized. For example, the neural network model may be implemented in a form including the encoding model (Menc), correlation model (Mcorr), and modality generation model (Mdec). However, it is not limited thereto, and the neural network model may be implemented in any various form.

[0145] FIG. 7 is a diagram illustrating neural network operations according to various embodiments of the present disclosure.

[0146] The processor (120) can obtain various information (710) from the system API. For example, as illustrated in FIG. 7, the processor (120) can obtain information regarding the user's health status, the usage status of the electronic device (100), etc. from the system API. Additionally, the processor (120) can obtain a first modality of a first type among a plurality of modality types.

[0147] The processor (120) can identify a second type among multiple modality types based on context and identify information corresponding to the first type and the second type among cross-modality dependency information (720). For example, the processor (120) can acquire audio features and video features and acquire differences between features (730), as shown in FIG. 7.

[0148] The processor (120) can obtain a second type of second modality by inputting the difference between the first modality and the feature into the neural network model (740).

[0149] The processor (120) may encode the user's health status, etc., and input the encoded information (750) into a neural network operation, or evaluate a time delay (760) and input the time delay (760) into a neural network operation.

[0150] FIG. 8 is a drawing for explaining operation due to differences in specifications between devices according to various embodiments of the present disclosure.

[0151] A smartphone (810) can perform a video call with a smart watch (820). At this time, the smart watch (820) acquires voice (820-2) through a microphone and acquires the user's fine motion (820-3) through a sensor, but may not be able to acquire video (820-1) because it is not equipped with a camera.

[0152] The smartphone (810) can receive at least one of voice (820-2) or micro-motion (820-3) from the smartwatch (820). The smartphone (810) can identify that video is required based on the context in which it is performing a video call and at least one of voice (820-2) or micro-motion (820-3) is received from the smartwatch (820).

[0153] The smartphone (810) can identify information corresponding to the type of modality and image received from the smart watch (820) among the cross-modality dependency information, and input the modality received from the smart watch (820) and the identified information into a neural network model to obtain a modality (830) of the image type.

[0154] The smartphone (810) can provide a video type modality (830).

[0155] FIG. 9 is a drawing for illustrating a method of providing emojis according to various embodiments of the present disclosure.

[0156] The processor (120) can acquire a first modality of a voice type among a plurality of modality types, identify an emoji type as a second type based on a user command, identify information corresponding to the first type and the second type among cross-modality dependency information, and input the first modality and the identified information into a neural network model to acquire a second modality (910) of an emoji type. An emoji may be a special character in which the picture itself consists of a single character, allowing emotions to be expressed on a character-by-character basis.

[0157] For example, the processor (120) may add information regarding the correlation between the user's voice type modality and the user's image type modality, as well as information regarding the correlation between the user's voice type modality and the emoji type modality, to the cross-modality dependency information. Through this operation, the electronic device (100) can protect the user's personal information.

[0158] FIGS. 10 and FIGS. 11 are drawings for explaining operations according to transmission errors according to various embodiments of the present disclosure.

[0159] The electronic device (100) can perform video calls with other electronic devices. However, due to a problem with the communication channel, not both video and audio may be transmitted, and only audio may be transmitted.

[0160] For example, the processor (120) can perform a video call with another electronic device and receive video and voice, as shown in FIG. 10. The processor (120) can update cross-modality dependency information based on information about the correlation between video and voice.

[0161] Afterwards, if only voice is received from another electronic device due to a problem with the communication channel, the processor (120) may identify information corresponding to voice and video among cross-modality dependency information, and input the voice and identified information into a neural network model to obtain a video type modality (1010).

[0162] The processor (120) can receive sensor data and video data as illustrated in FIG. 11. When missing data (1110) among the video data is identified, the processor (120) may identify information corresponding to the sensor data and video data among cross-modality dependency information, and input the sensor data and the identified information into a neural network model to recover the video data (1120).

[0163] FIG. 12 is a diagram illustrating a verification operation between modalities according to various embodiments of the present disclosure.

[0164] The processor (120) can acquire a first modality of voice type (real voice) and a first modality of video type (real video), as shown in FIG. 12.

[0165] The processor (120) can identify information corresponding to voice and video among cross-modality dependency information, and input a first modality of voice type (real voice) and the identified information into a neural network model to obtain a second modality of video type (estimated video). Additionally, the processor (120) can identify information corresponding to voice and video among cross-modality dependency information, and input a first modality of video type (real video) and the identified information into a neural network model to obtain a second modality of voice type (estimated voice).

[0166] The processor (120) can compare a first modality of voice type (real voice, 1210) and a second modality of voice type (estimated voice) obtained through a neural network model, and compare a first modality of video type (real video, 1220) and a second modality of video type (estimated video) obtained through a neural network model.

[0167] The processor (120) can detect whether tampering has occurred through the comparison result. Through this operation, fraud caused by deepfakes, etc. can be prevented and / or reduced.

[0168] FIG. 13 is a drawing for explaining the effects according to various embodiments of the present disclosure.

[0169] As illustrated in FIG. 13, bit error rates and frame skipping possibilities can be reduced through various exemplary operations such as those described in the present disclosure.

[0170] FIG. 14 is a flowchart illustrating a method for operating or controlling an electronic device according to various embodiments of the present disclosure.

[0171] Among multiple modality types, a first modality of a first type is obtained (S1410). Information corresponding to the first type and the second type is identified among cross-modality dependency information containing information on correlations between modalities based on context (S1420). The first modality and the identified information are input into a neural network model to obtain a second modality of a second type (S1430).

[0172] The step of acquiring a first modality (S1410) acquires a first modality of a first type and a third modality of a second type among a plurality of modality types, and the step of identifying (S1420) identifies information corresponding to the first type and the second type among cross-modality dependency information when it is identified that a part of the third modality is damaged or requires restoration, and the step of acquiring a second modality (S1430) inputs the first modality and the identified information into a neural network model to acquire a second modality of a second type corresponding to the third modality, and the control method may further include the step of restoring a part of the third modality based on the second modality.

[0173] The identification step (S1420) can identify information corresponding to the first type and the second type among cross-modality dependency information based on an application running on an electronic device.

[0174] The step of acquiring a first modality (S1410) involves acquiring a first modality of voice type among multiple modality types from another electronic device when a video call application is executed, and the step of identifying (S1420) involves identifying information corresponding to voice type and video type among cross-modality dependency information based on the video call application, and the step of acquiring a second modality (S1430) involves inputting the first modality and the identified information into a neural network model to acquire a second modality of video type, and the control method may further include the step of displaying a screen corresponding to the second modality.

[0175] The method further includes a step of updating a second modality based on the state of a communication channel with another electronic device, and the display step may display a screen corresponding to the updated second modality.

[0176] The step of acquiring a first modality (S1410) involves acquiring a first modality of a voice type among a plurality of modality types through a microphone included in an electronic device when a video call application is executed, and the step of identifying (S1420) involves identifying information corresponding to the voice type and video type among cross-modality dependency information based on the video call application, and the step of acquiring a second modality (S1430) involves inputting the first modality and the identified information into a neural network model to acquire a second modality of a video type, and the control method may further include the step of transmitting the second modality to another electronic device.

[0177] The step of acquiring a first modality (S1410) acquires a first modality of a first type and a third modality of a third type among a plurality of modality types, and the control method may further include a step of identifying information corresponding to the first type and the third type among cross-modality dependency information, and a step of inputting one of the first modality and the third modality and the identified information into a neural network model and verifying the other of the first modality and the third modality based on the acquired information.

[0178] It may further include a step of updating a second modality based on the user's health condition corresponding to the first modality.

[0179] The method further includes the step of encoding a first modality, and the step of acquiring a second modality (S1430) may input the encoded first modality and identified information into a neural network model to acquire output data, and decode the output data to acquire the second modality.

[0180] Cross-modality dependency information can be obtained based on at least two types of sample modalities among a plurality of modality types.

[0181] According to various embodiments of the present disclosure as described above, an electronic device can acquire a target modality by utilizing not only modality but also cross-modality dependency information, thereby reducing the modality processing load while improving performance.

[0182] According to various embodiments of the present disclosure, the various embodiments described above may be implemented as software comprising instructions stored on a machine-readable storage medium (e.g., a computer). The machine may include an electronic device (e.g., electronic device (A)) according to the disclosed embodiments, which is a device capable of calling instructions stored from the storage medium and operating according to the called instructions. When instructions are executed by a processor, the processor may perform a function corresponding to the instructions directly or by using other components under the control of the processor. Instructions may include code generated or executed by a compiler or an interpreter. The machine-readable storage medium may be provided in the form of a non-transitory storage medium. Here, a 'non-transitory' storage medium does not contain a signal and is tangible, and the term does not distinguish whether data is stored semi-permanently or temporarily on the storage medium.

[0183] According to one embodiment of the present disclosure, the method according to the various embodiments described above may be provided as included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed online in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)) or through an application store (e.g., Play Store™). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily created in a storage medium such as the memory of a manufacturer's server, an application store's server, or a relay server.

[0184] According to one embodiment of the present disclosure, the various embodiments described above may be implemented in a recording medium readable by a computer or a similar device using software, hardware, or a combination thereof. In some cases, the various embodiments described herein may be implemented as the processor itself. According to a software implementation, embodiments such as the procedures and functions described herein may be implemented as separate software. Each of the software may perform one or more functions and operations described herein.

[0185] Computer instructions for performing processing operations of the device according to the various embodiments described above may be stored in a non-transitory computer-readable medium. When computer instructions stored in such a non-transitory computer-readable medium are executed by the processor of a specific device, they cause the specific device to perform processing operations in the device according to the various embodiments described above. A non-transitory computer-readable medium refers to a medium that stores data semi-permanently and is readable by a device. Specific examples of a non-transitory computer-readable medium may include CDs, DVDs, hard disks, Blu-ray discs, USBs, memory cards, ROMs, etc.

[0186] Each component (e.g., module or program) according to the various embodiments described above may be composed of a single or multiple entities, and some of the aforementioned sub-components may be omitted, or other sub-components may be further included in the various embodiments. Generally or additionally, some components (e.g., module or program) may be integrated into a single entity to perform the same or similar functions as those performed by each of the respective components prior to integration. The operations performed by the module, program, or other components according to the various embodiments may be executed sequentially, in parallel, iteratively, or heuristically, or at least some operations may be executed in a different order, omitted, or other operations added.

[0187] Although preferred embodiments of the present disclosure have been illustrated and described above, the present disclosure is not limited to the specific various embodiments described above. It is understood that various modifications can be made by those skilled in the art without departing from the essence of the present disclosure as claimed in the claims, and such modifications should not be understood individually from the technical spirit or perspective of the present disclosure. Furthermore, it will be understood that any embodiment described herein may be used in combination with any other embodiment described herein.

Claims

1. In an electronic device, A memory storing cross-modality dependency information including information on correlations between modalities, neural network models, and instructions; and at least one processor including processing circuitry; and The above at least one processor executes the instructions individually or collectively, and the electronic device, Among multiple modality types, the first modality of the first type is obtained, and Based on the context, information corresponding to the first and second types among the cross-modality dependency information is identified, and An electronic device configured to input the first modality and the identified information into the neural network model to obtain the second modality of the second type.

2. In Paragraph 1, The above at least one processor individually or collectively comprises the electronic device, Among the plurality of modality types above, the first modality of the first type and the third modality of the second type are obtained, If a part of the above third modality is identified as damaged or requires restoration, the information corresponding to the above first type and the above second type among the above cross-modality dependency information is identified, and The first modality and the identified information are input into the neural network model to obtain the second modality of the second type corresponding to the third modality, and An electronic device configured to restore a portion of the third modality based on the second modality.

3. In Paragraph 1, The above at least one processor individually or collectively comprises the electronic device, An electronic device configured to identify information corresponding to the first type and the second type among the cross-modality dependency information based on an application running on the electronic device.

4. In Paragraph 3, A communication interface including a communication circuit; and It further includes a display; The above at least one processor individually or collectively comprises the electronic device, When a video call application is executed, the first modality of the voice type among the plurality of modality types is obtained from another electronic device through the communication interface, and Based on the above video call application, information corresponding to the voice type and video type among the cross-modality dependency information is identified, and The first modality and the identified information are input into the neural network model to obtain a second modality of the image type, and An electronic device configured to display a screen corresponding to the second modality through the display.

5. In Paragraph 4, The above at least one processor individually or collectively comprises the electronic device, Update the second modality based on the state of the communication channel with the aforementioned other electronic device, and An electronic device configured to display a screen corresponding to the above-mentioned updated second modality through the above-mentioned display.

6. In Paragraph 3, Microphone; and A communication interface including a communication circuit; further comprising, The above at least one processor individually or collectively comprises the electronic device, When a video call application is executed, the first modality of the voice type among the plurality of modality types is obtained through the microphone, and Based on the above video call application, information corresponding to the voice type and video type among the cross-modality dependency information is identified, and The first modality and the identified information are input into the neural network model to obtain a second modality of the image type, and An electronic device configured to control the communication interface to transmit the above second modality to another electronic device.

7. In Paragraph 1, The above at least one processor individually or collectively comprises the electronic device, Among the plurality of modality types above, the first modality of the first type and the third modality of the third type are obtained, Identifying information corresponding to the first type and the third type among the cross-modality dependency information, and An electronic device configured to verify one of the first modality and the other of the third modality based on information obtained by inputting one of the first modality and the third modality and the identified information into the neural network model.

8. In Paragraph 1, The above at least one processor individually or collectively comprises the electronic device, An electronic device configured to update the second modality based on the user's health condition corresponding to the first modality.

9. In Paragraph 1, The above at least one processor individually or collectively comprises the electronic device, Among the plurality of modality types above, the first modality of the first type is obtained, and Encoding the above-mentioned first modality, Based on the above context, information corresponding to the first type and the second type among the cross-modality dependency information is identified, and The above-mentioned encoded first modality and the above-mentioned identified information are input into the neural network model to obtain output data, and An electronic device configured to obtain the second modality by decoding the output data above.

10. In Paragraph 1, The above cross-modality dependency information is, An electronic device obtained based on at least two types of sample modalities among the plurality of modality types above.

11. In a method for controlling an electronic device, A step of acquiring a first modality of a first type among a plurality of modality types; A step of identifying information corresponding to the first type and the second type among cross-modality dependency information including information on correlations between modalities based on context; and A control method comprising the step of inputting the first modality and the identified information into a neural network model to obtain the second modality of the second type.

12. In Paragraph 11, The step of acquiring the first modality above is, Among the plurality of modality types above, the first modality of the first type and the third modality of the second type are obtained, The above identification step is, If a part of the above third modality is identified as damaged or requires restoration, the information corresponding to the above first type and the above second type among the above cross-modality dependency information is identified, and The step of acquiring the second modality above is, The first modality and the identified information are input into the neural network model to obtain the second modality of the second type corresponding to the third modality, and The above method is, A control method further comprising the step of restoring a portion of the third modality based on the second modality.

13. In Paragraph 11, The above identification step is, A control method for identifying information corresponding to the first type and the second type among the cross-modality dependency information based on an application running on the electronic device.

14. In Paragraph 13, The step of acquiring the first modality above is, When a video call application is executed, the first modality of the voice type among the plurality of modality types is obtained from another electronic device, and The above identification step is, Based on the above video call application, information corresponding to the voice type and video type among the cross-modality dependency information is identified, and The step of acquiring the second modality above is, The first modality and the identified information are input into the neural network model to obtain a second modality of the image type, and The above method is, A control method further comprising the step of displaying a screen corresponding to the second modality.

15. In Paragraph 14, The method further includes the step of updating the second modality based on the state of the communication channel with the other electronic device; The above-mentioned displaying step is, A control method for displaying a screen corresponding to the above-mentioned updated second modality.