Electronic device, method, and non-transitory storage medium for performing interpretation by using artificial intelligence model

The electronic device uses AI models to automatically identify and interpret audio in mixed sound environments, addressing inefficiencies in conventional methods by reducing setup time and enhancing language translation accuracy.

WO2026121801A1PCT designated stage Publication Date: 2026-06-11SAMSUNG ELECTRONICS CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
SAMSUNG ELECTRONICS CO LTD
Filing Date
2025-12-02
Publication Date
2026-06-11

AI Technical Summary

Technical Problem

Conventional interpretation technologies require significant effort and time for initial setup to identify the audio interpretation target and often translate or interpret all audio, regardless of language, leading to inefficiencies in identifying desired audio in mixed sound environments.

Method used

An electronic device with cameras, microphones, a processor, and memory, utilizing artificial intelligence models to identify and interpret audio based on user selection and context, enabling real-time translation of audio inputs and identifying additional interpretation targets.

🎯Benefits of technology

Facilitates efficient and context-aware audio interpretation by automatically identifying and translating desired audio in mixed sound environments, reducing the need for manual setup and improving language interpretation accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025020422_11062026_PF_FP_ABST
    Figure KR2025020422_11062026_PF_FP_ABST
Patent Text Reader

Abstract

The present disclosure relates to an electronic device, a method, and a non-transitory storage medium for performing interpretation by using an artificial intelligence model. According to an embodiment, a head-wearable electronic device may comprise at least one camera, a display, two or more microphones, a speaker, at least one processor, and a memory for storing instructions. According to an embodiment, the instructions, when executed individually or collectively by the at least one processor, may cause the electronic device to: acquire images regarding external environments of the electronic device captured in real time by the at least one camera; identify, as a first interpretation target, at least one first object selected by the user on a screen displayed on the display on the basis of the images; acquire first audio information in real time from the external environments through the two or more microphones; acquire, on the basis of first information for interpretation of the first audio information, first interpretation information obtained by interpreting the first audio of the first interpretation target included in the first audio information and context information related to the first interpretation information by using the artificial intelligence model; output the audio of the first interpretation information through the speaker or display the text of the first interpretation information through the display; acquire second audio information including multiple audio streams detected from the external environments in real time through the two or more microphones from a timepoint at which the first information is provided to the artificial intelligence model to a timepoint at which the first interpretation information is acquired; acquire second interpretation information by interpreting at least one audio among the multiple audio streams included in the second audio information by using the artificial intelligence model on the basis of second information
Need to check novelty before this filing date? Find Prior Art

Description

Electronic device, method, and non-transient storage medium for interpreting using an artificial intelligence model

[0001] The present disclosure relates to an electronic device, method, and non-transient storage medium for interpretation using an artificial intelligence model.

[0002] With the advancement of digital technology, electronic devices are being provided in various forms, such as smartphones, tablet PCs, and PDAs. Electronic devices are also being developed in wearable forms to enhance portability and user accessibility. Electronic devices can be configured in various forms to be worn on parts of the user's body, and as technology advances, technologies are being developed to provide real-world spaces that correspond to the actual external environment (e.g., virtual reality, augmented reality, or mixed reality).

[0003] Meanwhile, the electronic device may utilize artificial intelligence (AI) models to provide various services. At least some of the various AI models for various services may be implemented as generative AI models. Depending on the implementation, the AI ​​models may operate in a form where multiple AI models are connected.

[0004] With increasing interest in artificial intelligence models, the field is experiencing rapid growth, and various technologies for interpreting or translating different languages ​​have advanced quickly. Depending on the input, there can be various types of AI models; for interpretation, a model can receive audio directly as input to enable interpretation, receive audio converted into text for translation, or accept both inputs. There is a type of AI called a multi-modal model that can receive such diverse forms of input.

[0005] The information described above may be provided as related art for the purpose of aiding understanding of the present disclosure. No claim or determination is made as to whether any of the foregoing may be applied as prior art related to the present disclosure.

[0006] In the external environment of an electronic device, various sounds (e.g., speech (voice), device sounds) may be mixed due to various situations, such as conversations between people or situations where sound is output through the device. In such situations where various sounds are mixed, even if a user detects various sounds through the electronic device and acquires audio, it may be difficult to identify the audio of the interpretation target desired by the user. Conventional interpretation technology can learn and provide audio information of the interpretation target, but this method requires the audio information for the target to be set initially, which requires a lot of effort and time for the initial setup, and even after the interpretation target is set, there is the inconvenience of having to repeat the same process for other targets.

[0007] Conventional interpretation technology translates or interprets all audio corresponding to the same language to output a result, but the present disclosure may provide an electronic device, method, and non-transient storage medium for interpretation using an artificial intelligence model so as to translate or interpret languages ​​determined to be audio participating in the same topic that can be identified through the context of the conversation, even if they are not the same language.

[0008] The present disclosure describes an electronic device wearable on the head comprising at least one camera, a display, two or more microphones, a speaker, at least one processor, and a memory for storing instructions.

[0009] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device,

[0010] Images of the external environment of the electronic device captured in real time by at least one camera are obtained.

[0011] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device identifies at least one first object selected by the user on a screen displayed on the display based on the images as a first interpretation target.

[0012] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device is configured to acquire first audio information in real time from the external environment through the two or more microphones.

[0013] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device is configured to obtain, based on first information for interpreting the first audio information, first interpretation information that interprets the first audio of the first interpretation target included in the first audio information using the artificial intelligence model, and context information related to the first interpretation information.

[0014] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device is configured to output the audio of the first interpretation information through the speaker or display the text of the first interpretation information through the display.

[0015] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device is configured to acquire second audio information comprising a plurality of audios detected in real time from the external environment through the two or more microphones from the time the first information is provided to the artificial intelligence model until the time the first interpretation information is acquired.

[0016] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device obtains second interpretation information in which at least one of the plurality of audios included in the second audio information is interpreted using the artificial intelligence model based on second information for interpreting the second audio information, and obtains information regarding an additional interpretation target identified by comparing the context of the plurality of audios with the context information using the artificial intelligence model.

[0017] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device is configured to output the audio of the second interpretation information through the speaker or display the text of the second interpretation information through the display.

[0018] According to one embodiment, a method of operation in a head-wearable electronic device includes acquiring images of the external environment of the electronic device captured in real time by at least one camera of the electronic device.

[0019] According to one embodiment, the method includes the operation of identifying at least one first object selected by the user on a screen displayed on a display of the electronic device based on the images as a first interpretation target.

[0020] According to one embodiment, the method includes the operation of acquiring first audio information in real time from the external environment through two or more microphones of the electronic device.

[0021] According to one embodiment, the method includes the operation of obtaining first interpretation information and context information related to the first interpretation information by using the artificial intelligence model based on first information for interpreting the first audio information, wherein the first audio of the first interpretation target included in the first audio information is interpreted.

[0022] According to one embodiment, the method includes the operation of outputting the audio of the first interpretation information through a speaker of the electronic device or displaying the text of the first interpretation information through the display.

[0023] According to one embodiment, the method includes the operation of acquiring second audio information comprising a plurality of audios detected in real time from the external environment through the two or more microphones from the time when the first information is provided to an artificial intelligence model until the time when the first interpretation information is acquired.

[0024] According to one embodiment, the method comprises the operation of obtaining second interpretation information by interpreting at least one of the plurality of audios included in the second audio information using the artificial intelligence model based on second information for interpreting the second audio information, and obtaining information about an additional interpretation target identified by comparing the context of the plurality of audios with the context information using the artificial intelligence model.

[0025] According to one embodiment, the method includes the operation of outputting the audio of the second interpretation information through the speaker or displaying the text of the second interpretation information through the display.

[0026] According to one embodiment, in a non-transient storage medium for storing one or more programs, the one or more programs include a command that causes the electronic device to execute an operation of acquiring images of the external environment of the electronic device captured in real time by at least one camera of the electronic device when executed by at least one processor of a head-wearable electronic device.

[0027] According to one embodiment, the one or more programs include a command that, when executed by at least one processor of a head-wearable electronic device, causes the electronic device to execute an operation of identifying at least one first object selected by the user as a first interpretation target on a screen displayed on the display of the electronic device based on the images.

[0028] According to one embodiment, the one or more programs include a command that causes the electronic device to perform an operation of acquiring first audio information in real time from the external environment through two or more microphones of the electronic device when executed by at least one processor of the electronic device that is wearable on the head.

[0029] According to one embodiment, the one or more programs include a command that, when executed by at least one processor of a head-wearable electronic device, causes the electronic device to execute an operation of obtaining first interpretation information and context information related to the first interpretation information, using an artificial intelligence model based on first information for interpreting the first audio information, wherein the first audio of the first interpretation target included in the first audio information is interpreted.

[0030] According to one embodiment, the one or more programs include a command that, when executed by at least one processor of a head-wearable electronic device, causes the electronic device to perform an operation of outputting the audio of the first interpretation information through the speaker of the electronic device or displaying the text of the first interpretation information through the display.

[0031] According to one embodiment, the one or more programs include a command to cause the electronic device to execute, when executed by at least one processor of a head-wearable electronic device, an operation to acquire second audio information including a plurality of audios detected in real time from the external environment through the two or more microphones from the time when the first information is provided to the artificial intelligence model until the time when the first interpretation information is acquired.

[0032] According to one embodiment, the one or more programs include a command to, when executed by at least one processor of a head-wearable electronic device, cause the electronic device to execute an operation of obtaining second interpretation information in which at least one of the plurality of audios included in the second audio information is interpreted using the artificial intelligence model based on second information for interpreting the second audio information, and obtaining information about an additional interpretation target identified by comparing the context of the plurality of audios with the context information using the artificial intelligence model.

[0033] According to one embodiment, the one or more programs include a command that, when executed by at least one processor of a head-wearable electronic device, causes the electronic device to perform an operation of outputting the audio of the second interpretation information through the speaker or displaying the text of the second interpretation information through the display.

[0034] FIG. 1 is a block diagram of an electronic device in a network environment according to various embodiments.

[0035] FIG. 2a is a perspective view showing the structure of an electronic device according to one embodiment.

[0036] FIG. 2b is a diagram showing the structure of a display and an eye-tracking camera of an electronic device according to one embodiment.

[0037] FIGS. 3A, FIGS. 3B, and FIGS. 3C are perspective views showing the structure of an electronic device according to one embodiment.

[0038] FIG. 4 is a block diagram showing an example of the configuration of a wearable electronic device according to one embodiment.

[0039] FIG. 5 is a block diagram illustrating an example of interpreting audio of a target to be interpreted using an artificial intelligence model in an electronic device according to one embodiment.

[0040] FIGS. 6a and FIGS. 6b are drawings illustrating examples for designating an interpretation target in an electronic device according to one embodiment.

[0041] FIGS. 7a and 7b are drawings illustrating an example of interpreting audio of an interpretation target in an electronic device according to one embodiment.

[0042] FIGS. 8A, FIGS. 8B, FIGS. 8C, and FIGS. 8D are drawings illustrating examples of interpreting audio of an interpretation target in an electronic device according to one embodiment.

[0043] FIGS. 9a and 9b are drawings illustrating examples of interpreting audio of an interpretation target in an electronic device according to one embodiment.

[0044] FIG. 10 is a diagram illustrating an example of interpreting audio of an interpretation target in an electronic device according to one embodiment.

[0045] FIG. 11 is a diagram illustrating an example of a method of operation in an electronic device according to one embodiment.

[0046] FIG. 12 is a diagram showing an example of a method of operation in an electronic device according to one embodiment.

[0047] FIGS. 13a and FIGS. 13b are drawings illustrating an example of interpreting audio of a subject to interpretation using an artificial intelligence model in an electronic device according to one embodiment.

[0048] FIG. 14 is a diagram illustrating a generative artificial intelligence system according to one embodiment.

[0049] In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components.

[0050] Hereinafter, embodiments of the present disclosure are described in detail with reference to the drawings so that those skilled in the art can easily implement them. However, the present disclosure may be embodied in various different forms and is not limited to the embodiments described herein. In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components. Furthermore, in the drawings and related descriptions, descriptions of well-known functions and configurations may be omitted for clarity and brevity. The term "user" as used in the embodiments of the present disclosure may refer to a person using an electronic device or a device using an electronic device (e.g., an artificial intelligence electronic device).

[0051] FIG. 1 is a block diagram of an electronic device (101) in a network environment (100) according to various embodiments. Referring to FIG. 1, in the network environment (100), the electronic device (101) may communicate with an electronic device (102) through a first network (198) (e.g., a short-range wireless communication network) or may communicate with at least one of an electronic device (104) or a server (108) through a second network (199) (e.g., a long-range wireless communication network). According to one embodiment, the electronic device (101) may communicate with the electronic device (104) through a server (108). According to one embodiment, the electronic device (101) may include a processor (120), memory (130), input module (150), sound output module (155), display module (160), audio module (170), sensor module (176), interface (177), connection terminal (178), haptic module (179), camera module (180), power management module (188), battery (189), communication module (190), subscriber identification module (196), or antenna module (197). In some embodiments, at least one of these components (e.g., connection terminal (178)) may be omitted from the electronic device (101), or one or more other components may be added. In some embodiments, some of these components (e.g., sensor module (176), camera module (180), or antenna module (197)) may be integrated into a single component (e.g., display module (160)).

[0052] The processor (120) can control at least one other component (e.g., hardware or software component) of the electronic device (101) connected to the processor (120) by executing software (e.g., program (140)), for example, and can perform various data processing or operations. According to one embodiment, as at least part of the data processing or operations, the processor (120) can store commands or data received from other components (e.g., sensor module (176) or communication module (190)) in volatile memory (132), process the commands or data stored in volatile memory (132), and store the resulting data in non-volatile memory (134). According to one embodiment, the processor (120) may include a main processor (121) (e.g., central processing unit or application processor) or an auxiliary processor (123) that can operate independently or together with it (e.g., graphics processing unit, neural processing unit (NPU), image signal processor, sensor hub processor, or communication processor). For example, if the electronic device (101) includes a main processor (121) and an auxiliary processor (123), the auxiliary processor (123) may be configured to use lower power than the main processor (121) or to be specialized for a designated function. The auxiliary processor (123) may be implemented separately from the main processor (121) or as part thereof.

[0053] The auxiliary processor (123) may control at least some of the functions or states associated with at least one component of the electronic device (101) (e.g., display module (160), sensor module (176), or communication module (190)) on behalf of the main processor (121) while the main processor (121) is in an inactive (e.g., sleep) state, or together with the main processor (121) while the main processor (121) is in an active (e.g., application execution) state. According to one embodiment, the auxiliary processor (123) (e.g., image signal processor or communication processor) may be implemented as part of another functionally related component (e.g., camera module (180) or communication module (190)). According to one embodiment, the auxiliary processor (123) (e.g., neural network processing unit) may include a hardware structure specialized for processing an artificial intelligence model. The artificial intelligence model may be generated through machine learning. Such learning may be performed, for example, on the electronic device (101) itself where the artificial intelligence model is executed, or through a separate server (e.g., server (108)). The learning algorithm may include, for example, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but is not limited to the examples described above. The artificial intelligence model may include a plurality of artificial neural network layers.An artificial neural network may be a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), a deep Q-network, or a combination of two or more of the above, but is not limited to the examples described above. In addition to the hardware structure, the artificial intelligence model may include a software structure, either additionally or substantially.

[0054] The memory (130) can store various data used by at least one component of the electronic device (101) (e.g., processor (120) or sensor module (176)). The data may include, for example, input data or output data for software (e.g., program (140)) and related commands. The memory (130) may include volatile memory (132) or non-volatile memory (134).

[0055] The program (140) may be stored as software in memory (130) and may include, for example, an operating system (142), middleware (144), or an application (146).

[0056] The input module (150) can receive commands or data to be used for a component of the electronic device (101) (e.g., processor (120)) from outside the electronic device (101) (e.g., user). The input module (150) may include, for example, a microphone, a mouse, a keyboard, a key (e.g., a button), or a digital pen (e.g., a stylus pen).

[0057] The sound output module (155) can output a sound signal to the outside of the electronic device (101). The sound output module (155) may include, for example, a speaker or a receiver. The speaker may be used for general purposes, such as multimedia playback or recording playback. The receiver may be used to receive incoming calls. According to one embodiment, the receiver may be implemented separately from the speaker or as part thereof.

[0058] The display module (160) can visually provide information to an external (e.g., user) of the electronic device (101). The display module (160) may include, for example, a display, a holographic device, or a projector and a control circuit for controlling said device. According to one embodiment, the display module (160) may include a touch sensor configured to detect a touch, or a pressure sensor configured to measure the intensity of the force generated by said touch.

[0059] The audio module (170) can convert sound into an electrical signal or, conversely, convert an electrical signal into sound. According to one embodiment, the audio module (170) can acquire sound through the input module (150) or output sound through the sound output module (155) or an external electronic device (e.g., electronic device (102)) (e.g., speaker or headphones) connected directly or wirelessly to the electronic device (101).

[0060] The sensor module (176) can detect the operating state of the electronic device (101) (e.g., power or temperature) or the external environmental state (e.g., user state) and generate an electrical signal or data value corresponding to the detected state. According to one embodiment, the sensor module (176) may include, for example, a gesture sensor, a gyroscope sensor, a barometric pressure sensor, a magnetic sensor, an accelerometer sensor, a grip sensor, a proximity sensor, a color sensor, an IR (infrared) sensor, a biosensor, a temperature sensor, a humidity sensor, or an illuminance sensor.

[0061] The interface (177) may support one or more specified protocols that can be used for the electronic device (101) to be connected directly or wirelessly to an external electronic device (e.g., electronic device (102)). According to one embodiment, the interface (177) may include, for example, a high definition multimedia interface (HDMI), a universal serial bus (USB) interface, an SD card interface, or an audio interface.

[0062] The connection terminal (178) may include a connector through which the electronic device (101) can be physically connected to an external electronic device (e.g., electronic device (102)). According to one embodiment, the connection terminal (178) may include, for example, an HDMI connector, a USB connector, an SD card connector, or an audio connector (e.g., a headphone connector).

[0063] The haptic module (179) can convert an electrical signal into a mechanical stimulus (e.g., vibration or movement) or an electrical stimulus that the user can perceive through tactile or kinesthetic senses. According to one embodiment, the haptic module (179) may include, for example, a motor, a piezoelectric element, or an electric stimulation device.

[0064] The camera module (180) can capture still images and video. According to one embodiment, the camera module (180) may include one or more lenses, image sensors, image signal processors, or flashes.

[0065] The power management module (188) can manage the power supplied to the electronic device (101). According to one embodiment, the power management module (188) can be implemented, for example, as at least part of a power management integrated circuit (PMIC).

[0066] The battery (189) can supply power to at least one component of the electronic device (101). According to one embodiment, the battery (189) may include, for example, a non-rechargeable primary battery, a rechargeable secondary battery, or a fuel cell.

[0067] The communication module (190) can support the establishment of a direct (e.g., wired) communication channel or a wireless communication channel between an electronic device (101) and an external electronic device (e.g., electronic device (102), electronic device (104), or server (108)), and the performance of communication through the established communication channel. The communication module (190) may include one or more communication processors that operate independently of the processor (120) (e.g., application processor) and support direct (e.g., wired) communication or wireless communication. According to one embodiment, the communication module (190) may include a wireless communication module (192) (e.g., cellular communication module, short-range wireless communication module, or GNSS (global navigation satellite system) communication module) or a wired communication module (194) (e.g., LAN (local area network) communication module, or power line communication module). The corresponding communication module among these communication modules can communicate with an external electronic device (104) through a first network (198) (e.g., a short-range communication network such as Bluetooth, WiFi (wireless fidelity) direct, or IrDA (infrared data association)) or a second network (199) (e.g., a legacy cellular network, a 5G network, a next-generation communication network, the Internet, or a computer network (e.g., a LAN or WAN)). These various types of communication modules may be integrated into a single component (e.g., a single chip) or implemented as multiple separate components (e.g., multiple chips). The wireless communication module (192) can identify or authenticate the electronic device (101) within a communication network such as the first network (198) or the second network (199) using subscriber information (e.g., International Mobile Subscriber Identifier (IMSI)) stored in the subscriber identification module (196).

[0068] The wireless communication module (192) can support 5G networks and next-generation communication technologies following 4G networks, for example, new radio access technology. NR access technology can support high-speed transmission of high-capacity data (enhanced mobile broadband (eMBB)), minimization of terminal power and connection of multiple terminals (massive machine type communications (mMTC)), or high reliability and low latency (ultra-reliable and low-latency communications (URLLC)). The wireless communication module (192) can support a high-frequency band (e.g., mmWave band) to achieve a high data transmission rate, for example. The wireless communication module (192) can support various technologies for securing performance in the high-frequency band, such as beamforming, massive MIMO (multiple-input and multiple-output), full-dimensional MIMO (FD-MIMO), array antenna, analog beam-forming, or large-scale antenna. The wireless communication module (192) can support various requirements specified in the electronic device (101), external electronic device (e.g., electronic device (104)), or network system (e.g., second network (199)). According to one embodiment, the wireless communication module (192) can support a Peak data rate (e.g., 20 Gbps or more) for realizing eMBB, loss coverage (e.g., 164 dB or less) for realizing mMTC, or U-plane latency (e.g., downlink (DL) and uplink (UL) each 0.5 ms or less, or round trip 1 ms or less) for realizing URLLC.

[0069] An antenna module (197) can transmit a signal or power to or from an external source (e.g., an external electronic device). According to one embodiment, the antenna module (197) may include an antenna comprising a radiator made of a conductor or a conductive pattern formed on a substrate (e.g., a PCB). According to one embodiment, the antenna module (197) may include a plurality of antennas (e.g., an array antenna). In this case, at least one antenna suitable for a communication method used in a communication network, such as a first network (198) or a second network (199), may be selected from the plurality of antennas, for example, by a communication module (190). A signal or power may be transmitted or received between the communication module (190) and an external electronic device through the selected at least one antenna. According to some embodiments, in addition to the radiator, other components (e.g., a radio frequency integrated circuit (RFIC)) may be additionally formed as part of the antenna module (197).

[0070] According to various embodiments, the antenna module (197) may form a mmWave antenna module. According to one embodiment, the mmWave antenna module may include a printed circuit board, an RFIC disposed on or adjacent to a first surface (e.g., bottom surface) of the printed circuit board and capable of supporting a specified high frequency band (e.g., mmWave band), and a plurality of antennas (e.g., array antennas) disposed on or adjacent to a second surface (e.g., top surface or side surface) of the printed circuit board and capable of transmitting or receiving a signal of the specified high frequency band.

[0071] At least some of the above components can be connected to each other via a communication method between peripheral devices (e.g., bus, GPIO (general purpose input and output), SPI (serial peripheral interface), or MIPI (mobile industry processor interface)) and exchange signals (e.g., commands or data) with each other.

[0072] According to one embodiment, commands or data may be transmitted or received between the electronic device (101) and an external electronic device (104) through a server (108) connected to a second network (199). Each of the external electronic devices (102, or 104) may be the same or different type of device as the electronic device (101). According to one embodiment, all or part of the operations performed on the electronic device (101) may be performed on one or more of the external electronic devices (102, 104, or 108). For example, if the electronic device (101) needs to perform a function or service automatically or in response to a request from a user or another device, the electronic device (101) may request one or more external electronic devices to perform at least part of the function or service instead of performing the function or service itself or additionally. One or more external electronic devices that receive the above request may execute at least part of the requested function or service, or additional function or service related to the request, and transmit the result of the execution to the electronic device (101). The electronic device (101) may provide the result as is or additionally processed as at least part of the response to the request. For this purpose, for example, cloud computing, distributed computing, mobile edge computing (MEC), or client-server computing technology may be used. The electronic device (101) may provide ultra-low latency services using, for example, distributed computing or mobile edge computing. In another embodiment, the external electronic device (104) may include an Internet of Things (IoT) device. The server (108) may be an intelligent server using machine learning and / or neural networks. According to one embodiment, the external electronic device (104) or the server (108) may be included within a second network (199).The electronic device (101) can be applied to intelligent services (e.g., smart home, smart city, smart car, or healthcare) based on 5G communication technology and IoT-related technology.

[0073] FIG. 2a is a perspective view showing the structure of an electronic device according to one embodiment. FIG. 2b is a drawing showing the structure of a display and an eye-tracking camera of an electronic device according to one embodiment.

[0074] Referring to FIGS. 2a and 2b, an electronic device (200) according to one embodiment may be an electronic device (101) of FIG. 1, an electronic device (102 or 104) communicating with the electronic device (101) of FIG. 1, or a device capable of providing services related to virtual reality technology that provides a virtual environment similar to the electronic device (101) of FIG. 1. Virtual reality (VR) technology, which is a technology that provides a virtual environment, may be developed into augmented reality (AR), mixed reality (MR), and / or extended reality (XR) that encompasses these.

[0075] The electronic device (200) may be a device configured to be wearable on a user's body, as shown in FIG. 2a (e.g., a head-mounted display (HMD) or a glasses-type AR glasses device). For example, the electronic device (200) may be configured to combine with an external electronic device, such as a mobile device, and may utilize components of the external electronic device (e.g., a display module, a camera module, an audio output module, or other components). Not limited thereto, the electronic device (200) may be implemented in various forms that can be worn on a user's body.

[0076] According to one embodiment, the electronic device (200) may configure a real space (e.g., virtual reality space, augmented reality space, or mixed reality space) that displays a real space corresponding to an actual external environment captured in the surrounding environment where the user is located (e.g., augmented reality image) or a virtual image provided (e.g., 2D or 3D image), and may control a display module (160) to display at least one virtual object corresponding to the user and / or at least one virtual object corresponding to an object for user interaction in the real space.

[0077] According to one embodiment, the electronic device (200) may include a processor (120), memory (130), display module (160), sensor module (176), camera module (180), charging module (e.g., battery (189) of FIG. 1) and communication module (190) as shown in FIG. 1. The electronic device (200) may further include an audio output device (155), an input module (150) as shown in FIG. 1, or other components as shown in FIG. 1. In addition, the electronic device (200) may be configured to include other components necessary to provide virtual reality functions, augmented reality functions, or mixed reality functions (e.g., services or methods).

[0078] According to one embodiment, the processor (120) is electrically connected to other components and can control other components. The processor (120) can perform various data processing or operations in accordance with the execution of various functions (e.g., operations, services, or programs) provided by the electronic device (200). The processor (120) can perform various data processing or operations to display at least one virtual object related to real objects included in an image captured in real space and / or a virtual object corresponding to a user (e.g., an avatar) in a virtual reality space. The processor (120) can perform various data processing or operations to express user interaction or movement of the virtual object displayed in the virtual reality space.

[0079] Again, with reference to FIG. 2a, an electronic device (200) according to one embodiment will be described. As described above, the electronic device (200) is not limited to a glasses-type (e.g., AR glasses) augmented reality device, and can be implemented as various devices capable of providing immersive content (e.g., content based on XR technology) to the user's eyes (e.g., AR head-mounted display type, 2D / 3D head-mounted display device or VR head-mounted display device).

[0080] According to one embodiment, a camera module of an electronic device (200) (e.g., camera module (180) of FIG. 1 or camera circuit) can capture still images and / or video. According to one embodiment, the camera module may be placed within a lens frame and around a first display (251) and a second display (252). According to one embodiment, the camera module may include one or more first cameras (211-1, 211-2), one or more second cameras (212-1, 212-2), and one or more third cameras (213). According to one embodiment, images acquired through one or more first cameras (211-1, 211-2) may be used for detecting hand gestures by a user, tracking the user's head, and / or spatial recognition. One or more first cameras (211-1, 211-2) may be a GS (global shutter) camera or an RS (rolling shutter) camera. One or more first cameras (211-1, 211-2) can perform simultaneous localization and mapping (SLAM) operations through depth imaging. One or more first cameras (211-1, 211-2) can perform spatial recognition and / or motion recognition for 3DoF (depth of field) and / or 6DoF. According to one embodiment, the first cameras (211-1, 211-2) can periodically or non-periodically transmit information (e.g., trajectory information) related to the user's eyes (e.g., left eye and right eye) or the trajectory of the gaze (e.g., eye tracking) to a processor (e.g., processor (120) of FIG. 1). The first camera (211-1, 211-2) can be used to position the center of a virtual image projected onto an electronic device (200) (e.g., AR glasses) according to the direction in which the user's pupil gazes, and a GS camera can be primarily used to detect the pupil and track rapid pupil movements.The first camera (211-1, 211-2) can be configured for the left eye and the right eye, respectively, and the first camera (211-1, 211-2) configured for the left eye and the right eye, respectively, may have the same performance and specifications.

[0081] According to one embodiment, the electronic device (200) may use another camera (e.g., a third camera (213)) for hand detection and tracking and user gesture recognition. According to one embodiment, at least one of the first camera (211-1, 211-2) to the third camera (213) may be replaced with a sensor module (e.g., a LiDAR sensor). For example, the sensor module may include at least one of a vertical cavity surface emitting laser (VCSEL), an infrared sensor, and / or a photodiode.

[0082] According to one embodiment, images acquired through one or more second cameras (212-1, 212-2) may be used to detect and track the user's pupils. One or more second cameras (212-1, 212-2) may be eye tracking (ET) cameras, as shown in FIG. 2b. One or more second cameras (212-1, 212-2) may be GS cameras. One or more second cameras (212-1, 212-2) may correspond to the left eye and the right eye, respectively, and the performance of one or more second cameras (212-1, 212-2) may be substantially identical. One or more third cameras (213) may be relatively high-resolution cameras. One or more third cameras (213) may perform auto-focusing (AF) and optical image stabilization (OIS) functions. One or more third cameras (213) may be GS (global shutter) cameras or RS (rolling shutter) cameras. One or more third cameras (213) may be color cameras. One or more third cameras (213) may be high-resolution cameras, referred to as HR (high resolution) or PV (photo video). Color cameras equipped with AF functions and optical image stabilization (OIS) functions for obtaining high-quality images may be primarily used. The first camera (211-1, 211-2) or one or more fourth cameras (not shown) may be FT (face tracking) cameras and may be used to detect and track the user's facial expressions. A depth sensor may be used to determine the distance to an object, such as with TOF. TOF (time of flight) is a technology that measures the distance of an object using a signal (near-infrared, ultrasound, or laser). TOF technology involves a transmitter emitting a signal and a receiver measuring the signal, which can measure the flight time of the signal.

[0083] According to one embodiment, the electronic device (200) may include one or more light-emitting elements (214-1, 214-2) (illumination). The light-emitting elements (214-1, 214-2) are different from the light source described below, which irradiates light onto a screen output area of ​​a display. According to one embodiment, the light-emitting elements (214-1, 214-2) may irradiate light to facilitate pupil detection in detecting and tracking a user's pupils through one or more second cameras (212-1, 212-2). According to one embodiment, the light-emitting elements (214-1, 214-2) may each include an LED. According to one embodiment, the light-emitting elements (214-1, 214-2) may irradiate light in the infrared region. According to various embodiments, the light-emitting elements (214-1, 214-2) may be attached around the frame of the augmented reality device (200). According to one embodiment, a light-emitting element (214-1, 214-2) is positioned around one or more first cameras (211-1, 211-2) and can assist gesture detection, head tracking, and spatial recognition by one or more first cameras (211-1, 211-2) when the augmented reality device (200) is used in a dark environment. According to one embodiment, a light-emitting element (214-1, 214-2) is positioned around one or more third cameras (213) and can assist image acquisition by one or more third cameras (213) when the augmented reality device (200) is used in a dark environment.

[0084] According to one embodiment, the electronic device (200) may include a battery (235-1, 235-2) (e.g., battery (189) of FIG. 1). The battery (235-1, 235-2) may store power to operate the remaining components of the augmented reality device (200).

[0085] According to one embodiment, a display module of an electronic device (200) (e.g., a display module (160) of FIG. 1) may include a first display (251), a second display (252), one or more input optical members (253-1, 253-2), one or more transparent members (290-1, 290-2), and one or more screen display portions (254-1, 254-2). According to one embodiment, the first display (251) and the second display (252) may be light output modules and may include, for example, a liquid crystal display (LCD), a digital mirror device (DMD), a liquid crystal on silicon (LCoS), an organic light emitting diode (OLED), or a micro light emitting diode (micro LED). According to one embodiment, if the first display (251) and the second display (252) are composed of a liquid crystal display device, a digital mirror display device, or a silicon liquid crystal display device, the augmented reality device (200) may include a light source that irradiates light onto the screen output area of ​​the display. According to one embodiment, if the first display (251) and the second display (252) can generate light themselves, for example, if they are composed of an organic light-emitting diode or a micro LED, the augmented reality device (200) may provide a user with a good quality virtual image (e.g., an image of a virtual reality space) without including a separate light source. In one embodiment, if the display is implemented as an organic light-emitting diode or a micro LED, a light source is unnecessary, so the electronic device may be made lighter.

[0086] According to one embodiment, one or more transparent members (290-1, 290-2) included in the electronic device (200) may be positioned to face the user's eyes (e.g., left and right eyes) when the user wears the augmented reality device (200). The one or more transparent members (290-1, 290-2) may include at least one of a glass plate, a plastic plate, or a polymer. When the user wears the augmented reality device (e.g., the electronic device (200)), the user can view the external environment through the one or more transparent members (290-1, 290-2).

[0087] According to one embodiment, one or more input optical members (253-1, 253-2) included in the electronic device (200) can guide light generated from a first display (251) and a second display (252) to the user's eye. An image based on the light generated from the first display (251) and the second display (252) is formed on one or more screen display portions (254-1, 254-2) on one or more transparent members (290-1, 290-2), and the user can see the image formed on the one or more screen display portions (254-1, 254-2).

[0088] According to one embodiment, the electronic device (200) may include one or more optical waveguides (not shown). The optical waveguides may transmit light generated from the first display (251) and the second display (252) to the user's eye. The electronic device (200) may include one optical waveguide corresponding to the left eye and one optical waveguide corresponding to the right eye. According to one embodiment, the optical waveguides may include at least one of glass, plastic, or polymer. The optical waveguides may include a nano-pattern formed on an inner or outer surface, for example, a polygonal or curved grating structure. The optical waveguides may include a free-form prism, in which case the optical waveguides may provide incident light to the user through a reflective mirror. According to one embodiment, the optical waveguide includes at least one of a diffractive element (e.g., a diffractive optical element (DOE), a holographic optical element (HOE)) or a reflective element (e.g., a reflective mirror), and can guide display light emitted from a light source to the user's eye using at least one diffractive element or reflective element included in the optical waveguide. According to one embodiment, the diffractive element may include an input / output optical member. According to one embodiment, the reflective element may include a member that causes total internal reflection (TIR) ​​(e.g., a total internal reflection optical element or a total internal reflection waveguide). For example, total internal reflection is a method of guiding light, which may mean creating an angle of incidence such that light (e.g., a virtual image) input through an input grating area is 100% reflected from one surface (e.g., a specific surface) of the waveguide and is transmitted 100% to an output grating area.

[0089] In one embodiment, light emitted from a display (e.g., a first display (251) and a second display (252)) may be guided along a light path to a waveguide through an input optical member (e.g., an optical waveguide). Light traveling within the waveguide may be guided toward the user's eye through an output optical member. A screen display may be determined based on the light emitted toward the eye. The waveguide may include an input optical member, an output optical member, and / or an extension optical member (not shown).

[0090] According to one embodiment, the electronic device (200) may include one or more voice input devices (262-1, 262-2, 262-3) and one or more voice output devices (263-1, 263-2).

[0091] According to one embodiment, the electronic device (200) may include a first PCB (270-1) and a second PCB (270-2). The first PCB (270-1) and the second PCB (270-2) may transmit electrical signals to components included in the electronic device (200), such as a first camera (211-1, 211-2), a second camera (212-1, 212-2), a third camera (213), a display (251, 252), an audio module (e.g., the audio module (170) of FIG. 1), and a sensor module (e.g., the sensor module (176) of FIG. 1). According to one embodiment, the first PCB (270-1) and the second PCB (270-2) may be flexible printed circuit boards (FPCB). According to one embodiment, the first PCB (270-1) and the second PCB (270-2) may each include a first substrate, a second substrate, and an interposer disposed between the first substrate and the second substrate. The first PCB (270-1) and the second PCB (270-2) may be disposed in the temple portion of the glasses or in the center portion of the set. According to one embodiment, the electronic device (200) may further include a microphone, an SPK, an antenna, and a sensor (an accelerometer, a gyroscope, and / or a touch sensor).

[0092] With reference to FIG. 2b, the structure of a display and an eye-tracking camera according to one embodiment will be described. The electronic device (200) may include a display, a projection lens (225), an input optical member (253-1, 253-2), a display optical waveguide (256), an output optical member (257), an eye-tracking camera (212-1, 212-2), an eye-tracking optical waveguide (258), a first splitter (259-1), and / or a second splitter (259-2).

[0093] The display may be the first display (251) or the second display (252) shown in FIG. 2a. Light output from the display (251, 252) may be refracted by a projection lens (225) and converge into a smaller aperture area. Light refracted by the projection lens (225) may pass through an input optical member (253-1, 253-2) and be incident on a display optical waveguide (256), and may be output through an output optical member (257) after passing through the display optical waveguide (256). Light output from the output optical member (257) may be visible to the user's eye (201). In the following specification, the expression "displaying an object on the display" may mean that light output from the display (252) is output through the output optical member (257), and the shape of the object is visible to the user's eye (201) by the light output through the output optical member (257). Additionally, the expression “control the display to display an object” may mean that the light output from the display (251, 252) is output through the output optical member (257), and the display (251, 252) is controlled so that the shape of the object is visible to the user’s eye (2010) by the light output through the output optical member (257).

[0094] Light (203) reflected from the user's eye (201) passes through the first splitter (259-1) and is incident on the eye-tracking optical waveguide (442), and can be output to the eye-tracking camera (410) through the second splitter (259-2) after passing through the eye-tracking optical waveguide (442). According to one embodiment, the light (203) reflected from the user's eye (201) may be light output from the light-emitting elements (214-1, 214-2) of FIG. 1 and reflected from the user's eye (201).

[0095] In the case where the electronic device (200) of FIG. 2a described above is in the form of AR glasses that do not display virtual objects, it is a smart glass (e.g., a metal ray-bean smart glass), and the smart glass may include a camera for identifying external objects (e.g., including at least one of an RGB or IR camera), a camera (IR) for recognizing the wearer's gaze, a microphone, and a speaker.

[0096] If the electronic device (200) of FIG. 2a described above is in the form of AR glasses capable of displaying virtual objects, the smart glasses may include a camera for verifying external objects (e.g., including at least one of RGB and IR cameras), a camera (IR) for recognizing the wearer's gaze, a microphone, a speaker, and a display for displaying virtual objects (e.g., a display placed in both eyes or in one eye).

[0097] FIG. 3a is a perspective view showing the structure of an electronic device according to one embodiment.

[0098] Referring to FIG. 3a, an electronic device (300) according to one embodiment (e.g., the electronic device (101) of FIG. 1 or the electronic device (200) of FIG. 2a and FIG. 2b) may be a wearable device such as a head-mounted device (HMD) that can be worn on a user's head to provide an image (e.g., a virtual reality space image) in front of the eyes. The configuration of the electronic device (300) of FIG. 3a, FIG. 3b and FIG. 3c may be all or partly the same as the configuration of the electronic device (200) of FIG. 2a and FIG. 2b.

[0099] According to one embodiment, the electronic device (300) may include a housing (310, 320, 330) that can form an exterior and provide a space in which components of the electronic device (300) can be placed.

[0100] According to one embodiment, the electronic device (300) may include a first housing (310) that can surround at least a portion of the user's head. According to one embodiment, the first housing (310) may include a first surface (300a) facing the outside of the electronic device (300) (e.g., in the +X direction).

[0101] According to one embodiment, the first housing (310) may surround at least a portion of the internal space (I). For example, the first housing (310) may include a second surface (300b) facing the internal space (I) of the electronic device (300) and a third surface (300c) opposite to the second surface (300b). According to one embodiment, the first housing (310) may be combined with a third housing (330) to form a closed curve shape surrounding the internal space (I).

[0102] According to one embodiment, the first housing (310) may accommodate at least some of the components of the electronic device (300). For example, a light output module, a circuit board, and a speaker module may be placed within the first housing (310).

[0103] According to one embodiment, one display member (340) corresponding to the left and right eyes of the electronic device (300) may be included. The display member (340) may be placed in the first housing (310). The configuration of the display member (340) of FIG. 3a may be all or partly the same as the configuration of the screen display portion (254-1, 254-2) of FIG. 2a.

[0104] According to one embodiment, the electronic device (300) may include a second housing (320) that can be placed on the user's face. According to one embodiment, the second housing (320) may include a fourth surface (300d) that can face at least partially the user's face. According to one embodiment, the fourth surface (300d) may be a surface facing the internal space (I) of the electronic device (300) (e.g., -X direction). According to one embodiment, the second housing (320) may be combined with the first housing (310).

[0105] According to one embodiment, the electronic device (300) may include a third housing (330) that can be seated on the back of the user's head. According to one embodiment, the third housing (330) may be combined with the first housing (310). According to one embodiment, the third housing (330) may accommodate at least some of the components of the electronic device (300). For example, a battery (e.g., the battery (235-1, 235-2) of FIG. 2a) may be placed within the third housing (330).

[0106] In order to enhance the user's overall user experience, usage environment, and usability of the head-mounted wearable electronic device (300), it may be necessary for the sensations felt and experienced by the user in VR (virtual reality), AR (augmented reality), and MR (mixed reality) spaces to be as similar as possible to the sensations of the real world.

[0107] FIGS. 3B and FIGS. 3C are perspective views showing the structure of an electronic device according to one embodiment.

[0108] Referring to FIG. 3b and FIG. 3c, in one embodiment, camera modules (311, 312, 313, 314, 315, 316) and / or a depth sensor (317) for acquiring information related to the surrounding environment of an electronic device (300) (e.g., a wearable device) may be disposed on a first surface (310) of the housing.

[0109] In one embodiment, camera modules (311, 312) can acquire images related to the surrounding environment of a wearable electronic device.

[0110] In one embodiment, camera modules (313, 314, 315, 316) can acquire images while the electronic device (300) is worn by a user. Camera modules (313, 314, 315, 316) can be used for hand detection, tracking, and user gesture (e.g., hand movements) recognition. Camera modules (313, 314, 315, 316) can be used for 3DoF, 6DoF head tracking, position (space, environment) recognition, and / or movement recognition. In one embodiment, camera modules (311, 312) may be used for hand detection and tracking and user gestures.

[0111] In one embodiment, the depth sensor (317) may be configured to transmit a signal and receive a signal reflected from the subject, and may be used for determining the distance to the object, such as time of flight (TOF). In place of or additionally to the depth sensor (317), camera modules (313, 314, 315, 316) may determine the distance to the object.

[0112] According to one embodiment, a face recognition camera module (325, 326) (e.g., FT (Face Tracking) camera) and / or a display (321) (and / or a lens) may be disposed on the second surface (320) of the housing.

[0113] In one embodiment, a face recognition camera module (325, 326) adjacent to the display may be used to recognize the user's face or to recognize and / or track both of the user's eyes. In one embodiment, the lens may serve to adjust the focus so that the screen output to the display (321) can be seen by the user's eyes, and may be composed of, for example, a Fresnel lens, a Pancake lens, or a multi-channel lens.

[0114] In one embodiment, the display (321) (and / or lens) may be disposed on a second surface (320) of the wearable electronic device (300). In one embodiment, the wearable electronic device (300) may not include camera modules (315, 316) among a plurality of camera modules (313, 314, 315, 316). Although not illustrated in FIG. 3b and 3c, the electronic device (300) may further include at least one of the configurations illustrated in FIG. 2a and 2b.

[0115] As described above, according to one embodiment, the electronic device (300) may have a form factor for being worn on a user's head. The electronic device (300) may further include a strap and / or a wearing member for being secured on a part of the user's body. The electronic device (300) may provide a user experience based on augmented reality, virtual reality, and / or mixed reality while being worn on the user's head.

[0116] When the electronic device (300) of FIGS. 3a, 3b, and 3c described above is in the form of a VST, it may include a camera for external object identification (e.g., including at least one of RGB and IR), a camera for recognizing the wearer's gaze (IR), a microphone, a speaker, a display for displaying virtual objects (e.g., a display placed in both eyes), and an external display (e.g., an external display of VisionPro).

[0117] Hereinafter, the electronic device described in this disclosure may be an electronic device that a user can wear on their body (e.g., head), such as a head-mounted display (HMD) device, augmented reality (AR) glasses, and / or a VST device, as described with reference to FIGS. 2a, 2b, 3a, 3b, and 3c. Herein, the electronic device may be referred to as a wearable electronic device. The interpretation described in this disclosure may be referred to as “translation” and may be replaced by other terms having the meaning of transferring a language other than the user’s language into the user’s language. The interpretation target described in this disclosure may be a person who speaks a language different from the user or a device that outputs a different language. The interpretation described in this disclosure may mean transferring audio information, including the voice of a person who speaks a different language or sound output from a device that outputs a different language, into the form of audio or text in the language used by the user. In this disclosure, the audio information may include sign language as the other language of the interpretation target.

[0118] FIG. 4 is a block diagram showing an example of the configuration of a wearable electronic device according to one embodiment.

[0119] Referring to FIG. 4, a wearable device (401) according to one embodiment may include at least one of a processor (410), memory (415), display (420), camera (425), sensor (430), or communication circuit (435). The processor (410), memory (415), display (420), camera (425), sensor (430), and communication circuit (435) may be electrically and / or operably coupled with each other by an electronic component such as a communication bus (402). The type and / or number of hardware components included in the wearable device (401) are not limited to those shown in FIG. 4. For example, the wearable device (401) may include only some of the hardware components shown in FIG. 4. The elements within the memory described below (e.g., layers and / or modules) may be in a logically separated state. The elements within the memory (415) may be included in a hardware component distinct from the memory (415). An operation performed by the processor (410) using each of the elements within the memory (415) is one embodiment, and the processor (410) may perform a different operation different from the above operation through at least one of the elements within the memory (415).

[0120] A processor (410) of a wearable device (401) according to one embodiment may include a hardware component for processing data based on one or more instructions. The hardware component for processing data may include, for example, an arithmetic and logic unit (ALU), a field programmable gate array (FPGA), and / or a central processing unit (CPU). The number of processors (410) may be one or more. For example, the processor (410) may have the structure of a multi-core processor such as a dual core, a quad core, or a hexa core.

[0121] A memory (415) of a wearable device (401) according to one embodiment may include a hardware component for storing data and / or instructions that are input and / or output to a processor (410). The memory (415) may include, for example, volatile memory such as random-access memory (RAM) and / or non-volatile memory such as read-only memory (ROM). Volatile memory may include, for example, at least one of dynamic RAM (DRAM), static RAM (SRAM), cache RAM, and pseudo SRAM (PSRAM). Non-volatile memory may include, for example, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), flash memory, hard disk, compact disk, and embedded multimedia card (eMMC).

[0122] In one embodiment, a display (420) of a wearable device (401) can output visualized information to a user of the wearable device (401). For example, the display (420) can be controlled by a processor (410) including a circuit such as a GPU (graphic processing unit) to output visualized information to a user. The display (420) may include a flat panel display (FPD) and / or electronic paper. The FPD may include a liquid crystal display (LCD), a plasma display panel (PDP), and / or one or more light emitting diodes (LEDs). The LED may include an organic LED (OLED).

[0123] In one embodiment, the camera (425) of the wearable device (401) may include one or more light sensors (e.g., a CCD (charged coupled device) sensor, a CMOS (complementary metal oxide semiconductor) sensor) that generate an electrical signal representing the color and / or brightness of light. The plurality of light sensors included in the camera (425) may be arranged in the form of a two-dimensional grid (2 dimensional array). The camera (425) may acquire the electrical signals of each of the plurality of light sensors substantially simultaneously to generate two-dimensional frame data corresponding to the light reaching the light sensors of the two-dimensional grid. For example, photo data captured using the camera (425) may refer to one (a) two-dimensional frame data acquired from the camera (425). For example, video data captured using the camera (425) may refer to a sequence of multiple two-dimensional frame data acquired from the camera (425) along a frame rate. The camera (425) may further include a flash light for outputting light in the direction in which the camera (425) receives light.

[0124] According to one embodiment, the wearable device (401) may include a plurality of cameras positioned toward different directions as an example of a camera (425). Among the plurality of cameras, the first camera may be referred to as a motion recognition camera (e.g., motion recognition camera (260-2, 260-3) of FIG. 2b), and the second camera may be referred to as an eye tracking camera (e.g., eye tracking camera (260-1) of FIG. 2b). The wearable device (401) may identify the position, shape, and / or gesture of a hand using an image acquired using the first camera. The wearable device (401) may identify the direction of gaze of a user wearing the wearable device (401) using an image acquired using the second camera. For example, the direction in which the first camera faces and the direction in which the second camera faces may be opposite.

[0125] According to one embodiment, a sensor (430) of a wearable device (401) can generate electrical information that can be processed by a processor (410) and / or memory (415) of the wearable device (401) from non-electronic information associated with the wearable device (401). The information may be referred to as sensor data. The sensor (430) may include a global positioning system (GPS) sensor for detecting the geographic location of the wearable device (401), an image sensor, an illuminance sensor and / or a time-of-flight (ToF) sensor, and an inertial measurement unit (IMU) for detecting physical motion of the wearable device (401).

[0126] In one embodiment, the communication circuit (435) of the wearable device (401) may include hardware components to support the transmission and / or reception of electrical signals between the wearable device (401) and an external electronic device. The communication circuit (435) may include, for example, at least one of a modem, an antenna, and an optic / electronic converter. The communication circuit (435) may support the transmission and / or reception of electrical signals based on various types of protocols such as Ethernet, LAN (local area network), WAN (wide area network), WiFi (wireless fidelity), Bluetooth, BLE (Bluetooth low energy), ZigBee, LTE (long term evolution), 5G NR (new radio) and / or 6G.

[0127] According to one embodiment, within the memory (415) of a wearable device (401), one or more instructions (or commands) representing operations and / or operations to be performed on data by the processor (410) of the wearable device (401) may be stored. A set of one or more instructions may be referred to as firmware, an operating system, a process, a routine, a sub-routine, and / or an application. For example, the wearable device (401) and / or the processor (410) may perform at least one of the operations of FIGS. 11 and 12 of the present disclosure when a set of a plurality of instructions distributed in the form of an operating system, firmware, a driver, and / or an application is executed. In the following, the statement that an application is installed in the wearable device (401) may mean that one or more instructions provided in the form of an application are stored in memory (415), and that the one or more applications are stored in an executable format (e.g., a file having an extension specified by the operating system of the wearable device (401)) by the processor (410). For example, the application may include a program and / or library related to a service provided to the user.

[0128] Referring to FIG. 4, programs installed on a wearable device (401) may be classified into any one of different layers, including an application layer (440), a framework layer (450), and / or a hardware abstraction layer (HAL) (480), based on the target. For example, within the hardware abstraction layer (480), programs (e.g., modules, or drivers) designed to target the hardware of the wearable device (401) (e.g., a display (420), a camera (420), and / or a sensor (430)) may be classified. The framework layer (450) may be referred to as an XR framework layer in that it contains one or more programs for providing XR (extended reality) services. For example, FIG. 4 illustrates the layers separated within memory (415), but the layers may be logically separated. However, it is not limited thereto. According to an embodiment, the layers may be stored in a designated area within memory (415).

[0129] For example, within the framework layer (450), programs designed to target at least one of the hardware abstraction layer (480) and / or the application layer (440) (e.g., location tracker (471), spatial recognizer (472), gesture tracker (473), and / or eye tracker (474), face tracker (475)) may be classified. Programs classified into the framework layer (450) may provide an application programming interface (API) that is executable based on other programs.

[0130] For example, within the application layer (440), programs designed to target a user controlling a wearable device (401) may be classified. Examples of programs classified into the application layer (440) include an XR (extended reality) system UI (user interface) and / or an XR application (442), but embodiments are not limited thereto. For example, programs classified into the application layer (440) (e.g., software applications) may call an API (application programming interface) to cause the execution of functions supported by programs classified into the framework layer (450).

[0131] For example, the wearable device (401) may display one or more visual objects on the display (420) to perform interaction with a user for using a virtual space based on the execution of the XR system UI (441). A visual object may mean an object that can be deployed within the screen for the transmission of information and / or interaction, such as text, images, icons, videos, buttons, checkboxes, radio buttons, text boxes, sliders, and / or tables. A visual object may be referred to as a visual guide, a virtual object, a visual element, a UI element, a view object, and / or a view element. The wearable device (401) may provide the user with a service to control functions available within the virtual space based on the execution of the XR system UI (441).

[0132] Referring to FIG. 4a, a lightweight renderer (443) and / or an XR plugin (444) are depicted within the XR system UI (441), but are not limited thereto. For example, the XR system UI (441) may cause the execution of supported functions in the lightweight renderer (443) and / or the XR plugin (444) included within the application layer (440).

[0133] For example, a wearable device (401) may acquire resources (e.g., APIs, system processes and / or libraries) used to define, create, and / or execute a rendering pipeline, which is permitted to be partially modified, based on the execution of a lightweight renderer (443). The lightweight renderer (443) may be referred to as a lightweight render pipeline in terms of defining a rendering pipeline, which is permitted to be partially modified. The lightweight renderer (443) may include a renderer built prior to the execution of a software application (e.g., a pre-built renderer). For example, the wearable device (401) may acquire resources (e.g., APIs, system processes and / or libraries) used to define, create, and / or execute the entire rendering pipeline based on the execution of an XR plugin (444). The XR plugin (444) may be referred to as an open XR native client in terms of defining (or setting) the entire rendering pipeline.

[0134] For example, the wearable device (401) may display a screen representing at least a portion of a virtual space on the display (420) based on the execution of the XR application (442). The XR plugin (444-1) included in the XR application (442) may be referenced by the XR plugin (444) of the XR system UI (441). Descriptions of the XR plugin (444-1) that overlap with descriptions of the XR plugin (444) may be omitted. The wearable device (401) may trigger the execution of a screen composition manager (451) based on the execution of the XR application (442).

[0135] According to one embodiment, a wearable device (101) may provide a virtual space service based on the execution of a screen composition manager (451). For example, the screen composition manager (451) may include a platform (e.g., an Android platform) for supporting the virtual space service. Based on the execution of the screen composition manager (451), the wearable device (401) may display on a display the posture of a virtual object representing a rendered user's posture using data acquired through a sensor (430). The screen composition manager (451) may be referred to as a composition presentation manager (CPM).

[0136] For example, the screen configuration manager (451) may include a runtime service (452). In one example, the runtime service (452) may be referenced as an OpenXR runtime module. A wearable device (401) may be used to provide at least one of a user’s pose prediction function, frame timing function, and / or spatial input function through the wearable device (401) based on the execution of the runtime service (452). In one example, the wearable device (401) may be used to perform rendering for a virtual space service for the user based on the execution of the runtime service (452). For example, an application (e.g., Unity or OpenXR native application) may be implemented based on the execution of the runtime service (452).

[0137] For example, the screen configuration manager (451) may include a pass-through library (453). The wearable device (401) may, based on the execution of the pass-through library (453), display another screen representing real space acquired through a camera (425) overlaid on at least a portion of the screen while displaying a screen representing virtual space on the display (420).

[0138] For example, the screen composition manager (451) may include a renderer (e.g., the renderer (540-1) of FIG. 5). The wearable device (101) can render a screen to be displayed on a display by compositing virtual layers (or virtual nodes) rendered based on sensor data (e.g., sensing data obtained through a camera (425) or sensor (430)) and pass-through layers (or pass-through nodes) obtained through a pass-through library (453) using the renderer through the screen composition manager (451). The virtual layers may be referred to as virtual nodes and / or virtual surfaces. The wearable device (101) can render each of the virtual layers or render all of the virtual layers through the screen composition manager (451).

[0139] For example, the screen configuration manager (451) may include an input manager (454). The wearable device (401) may identify acquired data (e.g., sensor data) by executing one or more programs included in the recognition service layer (470) based on the execution of the input manager (454). The wearable device (401) may initiate the execution of at least one of the functions of the wearable device (401) using the acquired data.

[0140] For example, the perception abstract layer (460) may be used for data exchange between the screen configuration manager (451) and the perception service layer (470). In terms of being used for data exchange between the screen configuration manager (451) and the perception service layer (470), the perception abstract layer (460) may be referred to as an interface. As an example, the perception abstract layer (460) may be referred to as OpenPX and / or PPAL (perception platform abstract layer). The perception abstract layer (460) may be used for a perception client and a perception service.

[0141] According to one embodiment, the recognition service layer (470) may include one or more programs for processing data obtained from a sensor (430) (or a camera (425)). The one or more programs may include at least one of a location tracker (471), a spatial recognizer (472), a gesture tracker (473), an eye tracker (474), and / or a face tracker (475). The type and / or number of the one or more programs included in the recognition service layer (470) are not limited to those shown in FIG. 4a.

[0142] For example, the wearable device (401) can identify the posture of the wearable device (401) using the sensor (430) based on the operation of the position tracker (471). The wearable device (401) can identify the 6 degrees of freedom pose (6 DOF pose) of the wearable device (401) using data acquired using the camera (425) and the IMU based on the operation of the position tracker (471). The position tracker (471) may be referred to as a head tracking (HeT) module.

[0143] For example, the wearable device (401) may be used to construct the surrounding environment of the wearable device (401) (or the user of the wearable device (401)) into a three-dimensional virtual space based on the execution of the space recognizer (472). The wearable device (401) may reconstruct the surrounding environment of the wearable device (401) in three dimensions using data acquired through the camera (425) based on the execution of the space recognizer (472). The wearable device (401) may identify at least one of a plane, an incline, or a staircase based on the surrounding environment of the wearable device (401) reconstructed in three dimensions based on the execution of the space recognizer (472). The space recognizer (472) may be referred to as a scene understanding (SU) module.

[0144] For example, the wearable device (401) may be used to identify (or recognize) the pose and / or gesture of the user's hand of the wearable device (401) based on the execution of the gesture tracker (473). For example, the wearable device (401) may identify the pose and / or gesture of the user's hand using data acquired from the sensor (430) based on the execution of the gesture tracker (473). For example, the wearable device (401) may identify the pose and / or gesture of the user's hand based on data (or images) acquired using the camera (425) based on the execution of the gesture tracker (473). The gesture tracker (473) may be referred to as a hand tracking (HaT) module and / or a gesture tracking module.

[0145] For example, the wearable device (401) can identify (or track) the movement of the user's eyes of the wearable device (401) based on the execution of the eye tracker (474). For example, the wearable device (401) can identify the movement of the user's eyes using data obtained from at least one sensor based on the execution of the eye tracker (474). For example, the wearable device (401) can identify the movement of the user's eyes based on data obtained using a camera (425) (e.g., the eye tracking camera (260-1) of FIG. 2a and FIG. 2b) and / or an IR LED (infrared light emitting diode) based on the execution of the eye tracker (474). The eye tracker (474) may be referred to as an eye tracking (ET) module and / or a gaze tracking module.

[0146] For example, the recognition service layer (470) of the wearable device (401) may further include a face tracker (475) for tracking the user's face. For example, the wearable device (401) may identify (or track) the movement of the user's face and / or the user's facial expression based on the execution of the face tracker (475). The wearable device (401) may estimate the user's facial expression based on the movement of the user's face based on the execution of the face tracker (475). For example, the wearable device (401) may identify the movement of the user's face and / or the user's facial expression based on data (e.g., an image) acquired using a camera based on the execution of the face tracker (475).

[0147] FIG. 5 is a block diagram illustrating an example of interpreting audio of an interpretation target using an artificial intelligence model in an electronic device according to one embodiment. FIG. 6a and FIG. 6b are drawings illustrating an example of designating an interpretation target in an electronic device according to one embodiment.

[0148] Referring to FIGS. 4, FIGS. 5, FIGS. 6a, and FIGS. 6b, an electronic device (401) according to one embodiment may be a wearable device that can be worn on a user's body (e.g., head), similar to the electronic device (101) of FIG. 1, or the electronic device (200) of FIGS. 2a and FIGS. 2b and the electronic device (300) of FIGS. 3a, FIGS. 3b, and FIGS. 3c. The electronic device (401) may be a device that provides a screen based on images of the external environment captured in real time, or a technology that provides the external environment as is. Such technology may include, for example, virtual reality (VR), augmented reality (AR), mixed reality (MR), and / or extended reality (XR) that encompasses these. The electronic device (401) may be a device configured to be wearable on a user's body (e.g., a head-mounted display (HMD) or an AR glasses device) as illustrated in FIGS. 2a, 2b, 3a, 3b, and 3c. For example, the electronic device (401) may be configured to combine with an external electronic device, such as a mobile device, and may utilize components of the external electronic device (e.g., the electronic device (102 or 104) of FIG. 1) (e.g., a display module, a camera module, an audio output module, or other components). Not limited thereto, the electronic device (401) may be implemented in various forms that can be worn on a user's body (e.g., the head).

[0149] According to one embodiment, the electronic device (401) may display a screen (e.g., a display module (160) of FIG. 1 or a display (321) of FIG. 3) based on images captured using at least one camera included in a camera (425) (e.g., a camera module (180) of FIG. 1, a first camera (211-1, 211-2) of FIG. 2a, a third camera (213) of FIG. 3b, a camera (313, 314, 315, 316)) in the external environment surrounding the electronic device (401). The electronic device (401) may display at least one virtual object on the screen based on information (e.g., content) related to applications currently running. According to one embodiment, when an electronic device (401) makes a real space of a real environment (e.g., an external environment of the electronic device (401)) visible to a user through a transparent member, at least one virtual object may be anchored and displayed on a screen corresponding to the real space.

[0150] An electronic device (401) according to one embodiment may include at least one processor (410), memory (415), display (420), camera (425), communication circuit (435), microphone circuit (520) including two or more microphones, and speaker (530). The display (420) may include a display disposed on an inner surface facing the wearer's eyes and a display disposed on an outer surface facing away from the wearer's eyes. Without being limited thereto, the electronic device (401) may be implemented identically or similarly to the electronic device (101) of FIG. 1, the electronic device (200) of FIG. 2a and FIG. 2b, or the electronic device (300) of FIG. 3a, FIG. 3b and FIG. 3c, and may further include other components of the electronic device (101) of FIG. 1, the electronic device (200) of FIG. 2a and FIG. 2b, or the electronic device (300) of FIG. 3. In addition to this, the electronic device (200) may be configured to include other components necessary for the method of operation of the present disclosure.

[0151] According to one embodiment, a processor (410) (e.g., processor (120) of FIG. 1) may acquire images of the external environment of an electronic device (401) in real time through at least one camera included in a camera (425), and may display a screen (610) of the external environment on a display (420) based on the acquired images. The screen (610) may be shown to the user through the display (420) using the camera (425) of the electronic device (401) worn by the user, or may be shown to the user's eyes through a transparent member (e.g., one or more transparent members (290-1, 290-2)).

[0152] According to one embodiment, the processor (410) can control the display (420) using a camera (425) so that a real space (e.g., external environment) including a screen is visible through the display (420) when the electronic device (401) is, for example, a VR device. According to one embodiment, the processor (410) can control the display (420) (e.g., a transparent member (e.g., one or more transparent members (290-1, 290-2)) so that a real space including a screen is visible through the user's eyes when the electronic device (401) is, for example, an AR device.

[0153] According to one embodiment, the processor (410) may use an artificial intelligence model (510) to interpret (e.g., translate) audio generated in an external environment, such as conversations between people in an external environment (e.g., conversations in a language different from the user's language) or audio output from a device, and provide the interpreted result (e.g., voice information or text information) to the user. For example, voice information may be transmitted to an external electronic device and provided to other external users. According to one embodiment, the artificial intelligence model (510) may be included in an electronic device (401), and at least some of the components included in the artificial intelligence model may be included in an external server. For example, a module for separation and screening functions may be included within the electronic device, and the interpretation function may use an interpretation module included in the server.

[0154] According to one embodiment, the processor (410) can display the result of interpreting the wearer's voice (e.g., text information) on a display placed on an outer surface.

[0155] According to one embodiment, when the processor (410) receives a request (e.g., event or input) to interpret audio of people's conversation or audio of content output from a device, as an initial operation, it may designate at least one object (e.g., target object (620)) selected by user's gesture input, voice input, or gaze on a screen (e.g., current scene (610)) displayed on the display (420) as the primary interpretation target (hereinafter referred to as the first interpretation target).

[0156] According to one embodiment, if a target object (620) (e.g., a person or device) that speaks or outputs a language different from the user exists in the space of the external environment within the field of view of the direction the user is currently looking, the processor (410) may designate the target object (620) selected from the current screen (610) provided based on images captured in real time as the first interpretation target.

[0157] According to one embodiment, if there is no target object (620) (e.g., person or device) that speaks or outputs a language different from the user in the space of the external environment within the field of view of the direction the user is currently looking at, the processor (410) may change the direction of the electronic device (401) to the direction where the target object (620) is located (e.g., by the user turning their head) and designate the target object (620) as the first interpretation target in the changed screen (630) provided based on images captured in real time in the changed direction.

[0158] According to one embodiment, the processor (410) can change a designated first interpretation target (e.g., primary interpretation target) to another object selected through the user's voice input or gaze while performing an interpretation operation.

[0159] According to one embodiment, the processor (410) may continuously track the first interpretation target and collect data for interpretation (information about the screen, audio information and / or gaze information) at intervals. Here, the interval (e.g., the interval for interpreting audio information) may be the interval from the time when information for interpreting audio information (e.g., input information) is provided to the artificial intelligence model (510) to the time when the interpretation result (e.g., output information) is obtained.

[0160] According to one embodiment, when the processor (410) receives an interpretation request, it may provide images captured by at least one camera at the time the interpretation request (e.g., event) is received (e.g., the start time of the interpretation operation), information about the first interpretation target, and information about the screen to the artificial intelligence model (510) as information to be input to the artificial intelligence model (510) (e.g., initial input information). Since the processor (410) may perform an operation to designate a main interpretation target to obtain context information because context information is not present when the interpretation operation is started, making it difficult to determine which is the main audio for understanding the context, the processor (410) may perform an operation to designate a main interpretation target. When the main interpretation target is designated, the processor (410) may provide initial input information to the artificial intelligence model (510) without audio information.

[0161] According to one embodiment, the processor (410) can acquire audio information (hereinafter referred to as the first audio information) in real time from an external environment (e.g., external space) through two or more microphones.

[0162] According to one embodiment, the processor (410) may provide first information for interpreting first audio information to the artificial intelligence model (510) as information input to the artificial intelligence model (510). The first information may include images, first audio information, and information about the first interpretation target. For example, the information input to the artificial intelligence model may be provided to the artificial intelligence model (510) as an input prompt of a specified format (e.g., [audio information, object list, scene-images, context-info]). The audio information may include audio information to be interpreted (e.g., first audio information) obtained through two or more microphones (e.g., directional microphones) included in the microphone circuit (520) of the electronic device (401). The audio information may additionally include audio of a part that was not interpreted in the previous cycle (e.g., first part). The object list may include information regarding a first interpretation target initially designated by the user (e.g., object information and location information corresponding to the first interpretation target). The object list may additionally include information regarding a second interpretation target additionally designated during the interpretation process (e.g., object information and location information corresponding to the second interpretation target). Here, the location information may be the relative position between each object based on the electronic device (401). Scene-images are used in initial situations where it is difficult to verify the context for interpretation and may be used to correct the positions of objects; they may include images captured in real-time by at least one camera during the corresponding cycle. Context-info may include the context of the interpretation target's audio during the interpretation process.Contextual information may include all content of the audio information of the interpreter or may include information in a simplified form that can be understood by an artificial intelligence model. According to one embodiment, the processor (410) may further include a priority for translation or interpretation of surrounding audio in the first information (e.g., input prompt). For example, since delay may increase if all audio is interpreted or translated considering system performance, the processor (410) may assign a high priority to audio corresponding to an object included in the user's gaze target, field of view, or the FOV of the XR device's camera, and may perform the interpretation or translation of the high-priority audio first.

[0163] According to one embodiment, the processor (410) may obtain information regarding the interpretation of the first audio information using an artificial intelligence model (510) based on first information for interpreting the first audio information. At this time, since the current cycle is a situation in which the first audio information, which is the first input audio information, is interpreted, and since there is no context information or it is insufficient to determine the context, the artificial intelligence model (510) may identify the object of the first interpretation target based on the images included in the first information and interpret the audio of the first interpretation target (e.g., the first audio). Since the artificial intelligence model (510) may not perform the operation of identifying additional interpretation targets based on context information because there is no context information or it is insufficient to determine the context, it may exclude objects other than the object of the first interpretation target from the interpretation target and may not interpret audio for other objects. For example, even if the artificial intelligence model (510) interprets objects other than the object of the first interpretation target, it may not include interpretation information for other objects in the output information.

[0164] According to one embodiment, the processor (410) may obtain first interpretation information that interprets the first audio of a first interpretation target included in the first audio information. The processor (410) may obtain context information identified based on the first audio and images while interpreting the first audio. The first interpretation information and context information may be included in information output from the artificial intelligence model (510) (hereinafter referred to as output information). For example, the output information may be output in a specified format (e.g., [object-list, translated-info or interpreted-info, context-info, audio]). The object-list includes object information regarding the interpretation target (e.g., the first interpretation target), and the interpretation information may include first interpretation information (e.g., voice information and / or text information) interpreted in the current cycle (e.g., the second cycle). The context information may include information about the context learned based on the first audio or the first interpretation information and images. The audio information may include information regarding an untranslated part (e.g., a first part) in the first audio or first interpretation information (e.g., audio of the untranslated part or information indicating a section of the untranslated part). Here, the first part in which interpretation is not completed in the first audio may be stored in memory (415) separately from the second part in which interpretation is completed in the voice information.

[0165] According to one embodiment, the processor (410) may output audio (e.g., voice information) of the first interpretation information through a speaker (530) or display text of the first interpretation information through a display. For example, if the first interpretation information contains only audio information, the processor (410) may convert the audio information contained in the first interpretation information into text and display the converted (e.g., translated) text through a display (420) in an area adjacent to an object corresponding to the first interpretation target included in the current screen. For example, if the first interpretation information does not contain audio information and contains only text information, the text information may be converted into audio and output the converted audio information through the speaker (530).

[0166] According to one embodiment, the processor (410) can acquire second audio information in real time from an external environment through two or more microphones from the time when the first information is provided to the artificial intelligence model (510) until the time when the first interpretation information is acquired (e.g., during the period for interpreting the first audio information).

[0167] According to one embodiment, the processor (410) may provide second information to an artificial intelligence model (510) for interpreting second audio information at the start of the next cycle after acquiring first interpretation information. The second information may include the images, the second audio information, information about the first interpretation target, and the context information. The processor (410) may interpret a plurality of audios included in the second audio information using the artificial intelligence model (510), and acquire information about an additionally designated second interpretation target by comparing the similarity between the context of each of the plurality of audios and the context information included in the second information. According to one embodiment, the processor (410) may acquire information regarding the interpretation of the second audio of the second interpretation target among the plurality of audios included in the second audio information using the artificial intelligence model (510) as second interpretation information.

[0168] According to one embodiment, the processor (410) may output audio (e.g., voice information) of the second interpretation information through a speaker (530) or display text of the second interpretation information through a display (420). For example, if the second interpretation information contains only audio information, the processor (410) may convert the audio information contained in the second interpretation information into text and display the converted (e.g., translated) text through the display (420) in an area adjacent to an object corresponding to the second interpretation target included in the current screen. For example, if the second interpretation information does not contain audio information and contains only text information, the text information may be converted into audio and output the converted audio information through the speaker (530).

[0169] According to one embodiment, if the processor (410) does not identify an additional interpretation target and the first audio of the first interpretation target is included in the second audio information, the processor (410) can use an artificial intelligence model (510) to obtain information that interprets the first audio included in the second audio information as second interpretation information.

[0170] According to one embodiment, the processor (410) may obtain context information of the second interpretation information identified based on the second interpretation information (or second audio) and images included in the second information while interpreting the second audio information. According to one embodiment, the processor (410) may update the context information stored in the memory (415) with the context information of the second interpretation information. Here, the updated context information may be included in the third information for interpreting the third audio information in the next cycle.

[0171] According to one embodiment, the processor (410) may acquire third audio information in real time from an external environment through two or more microphones from the time when the second information is provided to the artificial intelligence model (510) until the time when the second interpretation information is acquired (e.g., during the period for interpreting the second audio information). According to one embodiment, the processor (410) may perform the same operation as the operation for interpreting the second audio information described above to interpret the third audio information using the artificial intelligence model (510) during the next period after the time when the second interpretation information is acquired. The processor (410) may repeat the same operation as the operation for interpreting the second audio information described above until audio information is not acquired from designated interpretation targets for a designated time or until a request to end interpretation is input.

[0172] According to one embodiment, the processor (410) may obtain sign language as the language to be interpreted from the interpreter, in addition to first audio information and second audio information. When the interpreter is using sign language, the processor (410) may obtain a video of the interpreter using at least one camera to film the interpreter using sign language in real time, include the obtained video and information indicating that the language to be interpreted is sign language in the information to be input to the artificial intelligence model (510) (e.g., first information or second information), and provide it to the artificial intelligence model (510). According to one embodiment, the artificial intelligence model (510) may identify information indicating that the language to be interpreted is sign language in the input information according to the interpretation request, analyze the video, and interpret the sign language performed by the interpreter in the form of audio or text.

[0173] According to one embodiment, the processor (410) can execute an application for interpretation and display a user interface that provides interpretation-related functions on a screen showing an external environment without being adjacent to or overlapping with it.

[0174] FIGS. 7a and 7b are drawings illustrating an example of interpreting audio of an interpretation target in an electronic device according to one embodiment.

[0175] Referring to FIGS. 4, FIGS. 5, FIGS. 7a and FIGS. 7b, a processor (410) of an electronic device (401) according to one embodiment can select at least one object (711, 713) by user input (701, 703) (e.g., gesture input, voice input, or gaze) when context information is present at the start of an operation. For example, when the user inputs a gesture, such as drawing a circle, to select at least one object (711, 713), the processor (410) can detect the gesture input through at least one camera and display a circle on the screen (710) in response to the gesture input.

[0176] According to one embodiment, the processor (410) may provide information about at least one selected object (711, 713) and information about images or screens (710) captured by at least one camera at the time of operation start as initial input information to the artificial intelligence model (510). According to one embodiment, the artificial intelligence providing model (510) that receives the initial input information from the processor (410) may use a separation model (511) to analyze the information about the images or screens (710) included in the initial input information to classify objects corresponding to objects (e.g., a device that outputs audio) or people included in the images or screens (710), and identify at least one object (711, 713) designated as an interpretation target among the classified objects. According to one embodiment, an artificial intelligence model (510) may be used to generate an image by adding a virtual object (e.g., a visual graphic object) (721, 723) indicating that at least one object (711, 713) is designated as at least one interpretation target (e.g., a first interpretation target or a main interpretation target) based on images, and a virtual object (e.g., a visual graphic object) (731, 733) indicating an interpretation target, and may provide output information including the generated image.

[0177] According to one embodiment, the processor (410) may display a screen (720) through the display (420) comprising a virtual object (e.g., a visual graphic object) (721, 723) indicating that at least one interpretation target (e.g., a first interpretation target or a primary interpretation target) is designated based on an image generated by an artificial intelligence model (510), and a virtual object (e.g., a visual graphic object) (731, 733) indicating the interpretation target.

[0178] According to one embodiment, when the processor (410) selects at least one object (711, 713) designated as an interpretation target on the screen (710 or 720)) with a designated gesture, at least one object (711, 713) may be excluded from the interpretation target.

[0179] FIGS. 8A, FIGS. 8B, FIGS. 8C, and FIGS. 8D are drawings illustrating examples of interpreting audio of an interpretation target in an electronic device according to one embodiment.

[0180] Referring to FIGS. 4, FIGS. 5, FIGS. 8a, FIGS. 8b, FIGS. 8c and FIGS. 8d, a processor (410) of an electronic device (401) according to one embodiment can acquire audio information (e.g., first audio information) including audio detected from an external environment through two or more microphones (e.g., audios of FIG. 8d (811, 812, 813, 814)) (e.g., voice of a conversation partner or sound output from a device) from the start of operation until input information is provided to an artificial intelligence model (510). The processor (410) can acquire audio information (e.g., second audio information) including detected audios (e.g., audios of FIG. 8d (811, 812, 813, 814)) (e.g., the voice of a conversation partner or sound output from a device) from the time input information is provided to the artificial intelligence model (510) until the time next input information is provided (or until output information is received from the artificial intelligence model (510)) (e.g., second cycle). The processor (410) can acquire audio information (e.g., third audio information) during the next cycle (e.g., third cycle) following the previous cycle (e.g., second cycle). The processor (410) can continuously acquire audio information until there is a request to end interpretation, and can provide the audio information acquired during one cycle to the artificial intelligence model (510).

[0181] According to one embodiment, an artificial intelligence model (510) that receives audio information acquired from a processor (410) during a corresponding cycle classifies at least one audio included in the audio information (e.g., the audios of FIG. 8d (811, 812, 813, 814)) using a separation model (511), and can interpret at least one audio (e.g., the audios of FIG. 8d (811, 812, 813, 814)) into the user's language using an interpretation model (512). For example, the audio information may include audio generated around an electronic device in addition to audio detected from the interpretation target. According to one embodiment, the artificial intelligence model (510) can classify the detected audio (e.g., the audios of FIG. 8d (811, 812, 813, 814)) using acoustic localization information. The artificial intelligence model (510) can generate an audio map (810, 820) (e.g., a 2D or 3D map) as in FIG. 8a and FIG. 8b based on audios classified according to the electronic device (401) (e.g., audios of FIG. 8d (811, 812, 813, 814)).

[0182] According to one embodiment, an artificial intelligence model (510) can generate an object map (830) (e.g., a 2D or 3D map) based on the location of at least one object identified by analyzing information about input images or screens using a separation model (510). The object map (830) may include at least one object designated as an interpretation target (all classified objects) (e.g., the objects of FIG. 8c (831, 832, 833, 834)).

[0183] According to one embodiment, an artificial intelligence model (510) can correct an audio map (820) (e.g., direction or location of classified audio) based on information analyzed about images or screens using a separation model (511) (e.g., object map (830)).

[0184] According to one embodiment, an artificial intelligence model (510) can use a separation model (511) to match audios (e.g., audios of FIG. 8d (811, 812, 813, 814)) and objects (e.g., objects of FIG. 8c (831, 832, 833, 834)) based on an audio map (810, 820) and an object map (830), and separate at least one audio (811) corresponding to at least one object corresponding to a designated interpretation target. For example, the artificial intelligence model (510) can analyze the classified audios (e.g., audios of FIG. 8d (811, 812, 813, 814)) to separate audio of a language different from the user's language (e.g., an interpretable language). For example, the artificial intelligence model (510) may separate only at least one audio corresponding to at least one object corresponding to a designated interpretation target in an initial operation, and during the interpretation, by sufficiently learning contextual information related to the interpretation, it may separate audio of an interpretable language form (e.g., speech or sign language) and / or audio matching objects. For example, the artificial intelligence model (510) may analyze the classified objects to separate one or more objects corresponding to a person or a sound output device, and match one or more separated objects to one or more separated audios.

[0185] According to one embodiment, the separation model (511) of the artificial intelligence model (510) may provide only the separated audio (811) to the interpretation model (512) when only the audio (811) for the interpretation target (831) is separated (e.g., initial operation situation). The separation model (511) of the artificial intelligence model (510) may provide multiple separated audios to the interpretation model (512) when the audio of other objects is further separated in addition to the audio (811).

[0186] FIGS. 9a and 9b are drawings illustrating examples of interpreting audio of an interpretation target in an electronic device according to one embodiment.

[0187] Referring to FIGS. 4, FIGS. 5, FIGS. 9a and FIGS. 9b, an artificial intelligence model (510) that receives input information for interpreting audio information from a processor (410) of an electronic device (401) according to one embodiment can interpret at least one separated audio provided from a separation model (511) using an interpretation model (512). When the interpretation model (512) interprets the first audio information input in the first cycle, it can interpret only the first audio of the first interpretation target as the object selected by the user (e.g., at least one object (711, 713) in FIG. 7) is designated as the first interpretation target. When the interpretation model (512) intends to interpret the second audio information input in the next third cycle, if it has sufficiently learned context information related to interpretation, it can interpret all of the multiple audios separated from the separation model (511).

[0188] According to one embodiment, the interpretation model (512) of the artificial intelligence model (510) may be multiple, and the artificial intelligence model (510) may interpret each audio using an interpretation model (512) equal to the number of provided audios. According to one embodiment, while the artificial intelligence model (510) is performing interpretation using the interpretation model (512), if there is a part of the audio that lacks information (e.g., a part where the utterance contained in the audio does not include the entire sentence or is in a language that is not supported), the artificial intelligence model (510) may not interpret that part and include it in the interpretation information as is in the form of audio, create a graphic object representing the uninterpreted part, and output the created graphic object together with the interpretation information.

[0189] According to one embodiment, an artificial intelligence model (510) can learn context based on at least one separated audio or the result of interpreting at least one separated audio and images using a screening model (513). The artificial intelligence model (510) can determine whether the previous context information, which was learned in a previous cycle and stored in memory (415) or included in the input information, is sufficiently learned to designate additional interpretation targets (e.g., a confidence score above a designated level). Here, previous context information may include context learned up to the previous cycle (e.g., information indicating conversation content and / or information indicating situation (e.g., "inside the cafe," "conversation," and / or "announcement")) and / or information indicating a confidence score for the previous context information (e.g., confidence score). Here, the information indicating a confidence score for the previous context information (e.g., confidence score)) is set to a value lower than the reference level in the initial operation where the amount of previously learned contexts is small, and the score may increase as the understanding of the context increases as interpretation is repeated. The artificial intelligence model (510) can check the confidence score in the context information included in the input information using a screening model (513), and if the confirmed confidence score is above the reference level, it can perform an action to designate an additional interpretation target.

[0190] According to one embodiment, if the artificial intelligence model (510) is in a state where context information is sufficiently learned (e.g., the confirmed confidence score is above a reference level), it compares the similarity between the currently learned context and the context information based on the content of at least one audio for which interpretation is requested in the current cycle or at least one audio, and if the compared similarity value is greater than or equal to a specified value, it may designate an object corresponding to at least one audio as an additional interpretation target. Here, the object corresponding to at least one audio is an object different from the main interpretation target (e.g., the first interpretation target) and may be an object that speaks or outputs a language different from the user. According to one embodiment, the artificial intelligence model (510) may output the context information of the output information by including information about the context learned based on at least one audio and image in the current cycle (e.g., context learned in the current cycle (e.g., information indicating conversation content and / or information indicating situation (e.g., "inside the cafe," "conversation," and / or "announcement")) and / or a confidence score of the context information in the current cycle (e.g., confidence score)). The processor (410) may update the context information stored in memory (415) with the context information included in the output information. According to one embodiment, the artificial intelligence model (510) may generate and output the output information using a selection model (513). The output information may be output in a specified format (e.g., [object-list, translated-info or interpreted-info, context-info, audio]). The object-list may include information about the first interpretation target (e.g., main interpretation target) and / or additional interpretation targets (e.g., second It includes information about the interpretation target, and the interpretation information may include interpretation information interpreted in the current cycle (e.g., voice information and / or text information). This context information may include information about the context learned in the current cycle.The audio information may include information about the untranslated part (e.g., the first part) of the audio requested for interpretation in the current cycle (e.g., audio of the untranslated part or information indicating the section of the untranslated part).

[0191] According to one embodiment, the processor (410) may display text (931) of first interpretation information on a screen (901) through a display (420) based on output information (e.g., first output information) provided by an artificial intelligence model (510) when the initial operation situation (e.g., interpretation situation of the first cycle) is as illustrated in FIG. 9a. The processor (410) may display a graphic object (921) representing the first interpretation target (911) on the screen (901) in an area containing the designated first interpretation target (911). According to one embodiment, the processor (410) may output audio of the first interpretation information through a speaker (530).

[0192] According to one embodiment, the processor (410) may display text (933) of second interpretation information on the screen (903) through the display (420) based on output information provided by the artificial intelligence model (510) when an interpretation target is added, as illustrated in FIG. 9b. The processor (410) may display on the screen (903) a graphic object (921) representing the first interpretation target (901) in an area containing the designated first interpretation target (911), and a graphic object (923) representing the second interpretation target (913) in an area containing the second interpretation target (913). According to one embodiment, the processor (410) may output audio of the second interpretation information through the speaker (530).

[0193] FIG. 10 is a diagram illustrating an example of interpreting audio of an interpretation target in an electronic device according to one embodiment.

[0194] Referring to FIGS. 4, 5 and 10, a processor (410) of an electronic device (401) according to one embodiment can identify a first interpretation target (1011) and additionally designated interpretation targets (1015) based on output information provided from an artificial intelligence model (510). Based on the output information, the processor (410) can display an image on a current screen (1001) through a display (420), including graphic objects (e.g., virtual objects) (1021, 1023) generated by the artificial intelligence model (510), or display graphic objects (e.g., virtual objects) (1021, 1023) on a current screen (1001) through a display (420). Here, among the graphic objects, the first graphic object (1021) indicates that the objects (e.g., areas containing objects) correspond to the designated interpretation targets (1011, 1015), and the second graphic object (1023) indicates that the remaining objects, excluding the objects corresponding to the designated interpretation targets (1011, 1015), are objects excluded from the interpretation targets.

[0195] According to one embodiment, the artificial intelligence model (510) analyzes images through context similarity verification to exclude from interpretation targets objects among classified objects whose similarity value is smaller than a reference value, generates a graphic object (1023) indicating that the objects excluded from interpretation targets (e.g., objects not designated as additional interpretation targets), and can include the generated graphic object (1023) in output information. According to one embodiment, the artificial intelligence model (510) generates a graphic object (1021) indicating that the objects (e.g., areas containing objects) correspond to the designated interpretation targets (1011, 1015), and can include the generated graphic object (1021) in output information. According to one embodiment, the artificial intelligence model (510) can generate and provide an image including the graphic object (1021) and the graphic object (1023) using a generative artificial intelligence model.

[0196] Referring to FIGS. 4 and 5, according to one embodiment, the processor (410) may be a hardware component (function) or a software element (program) comprising at least one component provided in the electronic device (401), such as a hardware module or a software module (e.g., an application program). According to one embodiment, the processor (410) may include, for example, one or more combinations of hardware, software, or firmware. The processor (410) may be configured to omit at least some of the components or to include additional components for performing audio and image processing operations in addition to the components.

[0197] According to one embodiment, the memory (415) (e.g., the memory (130) of FIG. 1) may store applications. For example, the memory (415) may store applications (functions or programs) related to images (or image generation), applications related to interpretation, applications related to image or audio management, or applications related to generative AI. The memory (415) may store images captured through at least one camera included in an external electronic device or camera (425), audio information detected through two or more microphones, and interpretation information obtained by interpreting audio information using an artificial intelligence model (510). The memory (415) may store context information included in the output information provided by the artificial intelligence model (510), information about the image, and information about the interpretation target. The memory (415) may include a database that stores context information learned from the artificial intelligence model (510).

[0198] According to one embodiment, the memory (415) may store various data generated during the execution of the program (140), including a program used for functional operation (e.g., the program (140) of FIG. 1). For example, the memory (415) may include a program (140) area and a data area (not shown). The program (140) area may store related program information for operating the electronic device (401), such as an operating system (OS) that boots the electronic device (401) (e.g., the operating system (142) of FIG. 1). The data area (not shown) may store transmitted and / or received data and generated data according to various embodiments. Additionally, the memory (415) may be configured to include at least one storage medium among flash memory, hard disk, multimedia card micro type memory (e.g., secure digital (SD) or extreme digital (XD) memory), RAM, and ROM.

[0199] According to one embodiment, a display (420) (e.g., the display module (160) of FIG. 1, the displays (251, 252) of FIG. 2a and FIG. 2b, the display member (340) of FIG. 3a, or the display (321) of FIG. 3c)) can display a screen of the external environment of an electronic device based on images captured through at least one camera, and can display at least one virtual object in a part area of ​​the screen. The display (420) can display on the screen information related to the interpretation target (e.g., a graphic object representing the interpretation target) and text of interpretation information interpreted using an artificial intelligence model (510) to interpret audio information, under the control of a processor (410). According to one embodiment, the display (420) can be implemented in the form of a touch screen. When the display (420) is implemented in the form of a touch screen together with an input module, it can display various information generated according to the user's touch operation. According to one embodiment, the display (420) may be composed of at least one of an LCD (liquid crystal display), a TFT-LCD (thin film transistor LCD), an OLED (organic light emitting diodes), an LED, an AMOLED (active matrix organic LED), a flexible display, and a 3-dimensional display. Additionally, some of these displays may be configured to be transparent or light-transmitting so that the outside can be seen through them. This may be configured in the form of a transparent display including a TOLED (transparent OLED). According to one embodiment, in addition to the display (420), other display modules (e.g., an extended display or a flexible display) may be further included.

[0200] According to one embodiment, a camera (425) (e.g., camera module (180) of FIG. 1, camera (211-1, 211-2) of FIG. 2a, or camera (311, 312, 313, 314, 315, 316) of FIG. 3b) may include at least one camera and may capture images (e.g., 2D images or 3D images) of an external environment so that the actual external environment is displayed through a display in a real space (e.g., virtual reality space, augmented reality space, or mixed reality space) or on a screen corresponding to the real space (e.g., to display a screen). The configuration and operation of at least one camera included in the camera (425) may be the same or similar to the camera (211-1, 211-2) of FIG. 2a or the camera (311, 312, 313, 314, 315, 316) of FIG. 3b.

[0201] According to one embodiment, a communication circuit (435) (e.g., a communication module (190) of FIG. 1) can communicate with an external electronic device (e.g., an electronic device (102, 104) of FIG. 1, a server (108) of FIG. 1, or another user's electronic device). For example, the communication circuit (435) can receive at least one object displayed in a portion of a screen from an external electronic device and transmit notification information to the external electronic device. According to one embodiment, the communication circuit (435) may include a cellular module, a Wi-Fi (wireless-fidelity) module, a Bluetooth module, or a near field communication (NFC) module.

[0202] An electronic device according to one embodiment (e.g., electronic device (101) of FIG. 1, electronic device (200) of FIG. 2a and FIG. 2b and electronic device (300) of FIG. 3a, FIG. 3b and FIG. 3c or electronic device (401) of FIG. 4) may implement a software module related to an interpretation service (e.g., program (140) of FIG. 1). The memory of the electronic device (e.g., memory (130) of FIG. 1 and / or memory (415) of FIG. 4) may store instructions (e.g., instructions) to implement the software module. At least one processor (e.g., processor (120) of FIG. 1 and / or processor (410) of FIG. 4) can execute instructions stored in memory to implement a software module and can control hardware associated with the function of the software module (e.g., sensor module (176) of FIG. 1, camera module (180), communication module (190) of FIG. 1 and / or communication circuit (435) of FIG. 4, display module (160) of FIG. 1 and / or display (420) of FIG. 4).

[0203] A software module of an electronic device (101, 200, 300, 401) according to one embodiment may be configured to include a kernel (or HAL), a framework (e.g., middleware (144) of FIG. 1), and an application (e.g., application (146) of FIG. 1). At least some of the software modules may be preloaded onto the electronic device (101, 200, 300, 401) or downloadable from a server (e.g., server (108)).

[0204] According to one embodiment, the kernel may include, for example, a system resource manager or a device driver, but may be configured to include other modules, not limited thereto. The system resource manager may perform control, allocation, or reclamation of system resources. The device driver may include, for example, a display driver, a camera driver, a Bluetooth driver, a shared memory driver, a USB driver, a keypad driver, a WIFI driver, an audio driver, or an IPC (inter-process communication) driver.

[0205] According to one embodiment, the framework may provide various functions to an application through an application programming interface (API) (not shown) to provide functions commonly required by the application or to enable the application to efficiently use limited system resources within the electronic device (101, 200, 300, 401). The framework may include modules that form combinations of various functions of the components. The framework may provide modules specialized for each type of operating system to provide differentiated functions. The framework may dynamically delete some existing components or add new components.

[0206] According to one embodiment, the application may be configured to include an application (e.g., a module, a manager, or a program) for displaying an image of the external environment in real space. The application may include an application received from an external electronic device (e.g., a server (108) or an electronic device (102, 104)). According to one embodiment, the application may include a preloaded application or a third-party application downloadable from a server. The components of the software module and the names of the components according to the illustrated embodiments may vary depending on the type of operating system. According to one embodiment, at least a portion of the software module may be implemented as software, firmware, hardware, or a combination of at least two of these. At least a portion of the software module may be implemented (e.g., executed) by a processor (e.g., AP). At least a portion of the software module may include, for example, a module, a program, a routine, a set of instructions, or a process for performing at least one function.

[0207] As such, in one embodiment, the main components of an electronic device (101, 200, 300, 401) have been described through the electronic device of FIG. 1, FIG. 2a, FIG. 2b, FIG. 3a, FIG. 3b, FIG. 3c, and FIG. 4. However, in various embodiments, the components illustrated in FIG. 1, FIG. 2a, FIG. 2b, FIG. 3a to FIG. 3c, and FIG. 4 are not all essential components, and the electronic device (101, 200, 300, 401) may be implemented by more components than those illustrated, or by fewer components. Additionally, the positions of the main components of the electronic devices (101, 200, 300, 401) described above through FIG. 1, FIG. 2a, FIG. 2b, FIG. 3a, FIG. 3b, FIG. 3c and FIG. 4 may be changed according to various embodiments.

[0208] According to one embodiment, a head-wearable electronic device (e.g., electronic device (101) of FIG. 1, electronic device (200) of FIG. 2a and FIG. 2b and electronic device (300) of FIG. 3a, FIG. 3b and FIG. 3c or electronic device (401) of FIG. 4) comprises at least one camera (e.g., camera module (180) of FIG. 1, at least one camera included in the first camera (211-1, 211-2) of FIG. 2a, third camera (213), camera (313, 314, 315, 316) of FIG. 3b or camera (425) of FIG. 4), a display (e.g., first display (251) and second display (252) of FIG. 2a and FIG. 2b, display member (340) of FIG. 3a, display (321) of FIG. 3c or display (420) of FIG. 4), and two or more microphones (e.g., input of FIG. 1). It may include a module (150), a microphone (520) of FIG. 5, a speaker (e.g., an acoustic output device (155) of FIG. 1, a speaker (530) of FIG. 5), at least one processor (a processor (120) of FIG. 1 or a processor (410) of FIG. 4) and a memory for storing instructions (e.g., a memory (130) of FIG. 1 or a memory (415) of FIG. 4).

[0209] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device,

[0210] Images of the external environment of the electronic device captured in real time by at least one camera can be obtained.

[0211] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be able to identify at least one first object selected by the user on a screen displayed on the display based on the images as a first interpretation target.

[0212] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be enabled to acquire first audio information in real time from the external environment through the two or more microphones.

[0213] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be enabled to obtain first interpretation information and context information related to the first interpretation information by using an artificial intelligence model (e.g., the artificial intelligence model (510) of FIG. 5) based on first information for interpreting the first audio information.

[0214] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be configured to output the audio of the first interpretation information through the speaker or display the text of the first interpretation information through the display.

[0215] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be enabled to acquire second audio information including a plurality of audios detected in real time from the external environment through the two or more microphones from the time the first information is provided to the artificial intelligence model until the time the first interpretation information is acquired.

[0216] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be enabled to obtain second interpretation information in which at least one of the plurality of audios included in the second audio information is interpreted using the artificial intelligence model based on second information for interpreting the second audio information, and to obtain information regarding an additional interpretation target identified by comparing the context of the plurality of audios with the similarity of the context information using the artificial intelligence model.

[0217] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be configured to output the audio of the second interpretation information through the speaker or display the text of the second interpretation information through the display.

[0218] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may identify a second interpretation target based on information regarding the additional interpretation target, and use the artificial intelligence model to obtain interpretation information of the second audio of the second interpretation target among a plurality of audios included in the second audio information as the second interpretation information. According to one embodiment, the second interpretation target may be at least one object confirmed to have a similarity value equal to or greater than a specified value by comparing the similarity by the artificial intelligence model.

[0219] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be configured to use the artificial intelligence model to obtain the interpretation information of the first audio of the first interpretation target among the plurality of audios included in the second audio information as the second interpretation information when information regarding the additional interpretation target is not obtained.

[0220] According to one embodiment, when acquiring the second interpretation information, context information related to the second interpretation information can be acquired using the artificial intelligence model, and the context information related to the first interpretation information stored in the memory can be updated with the context information related to the second interpretation information.

[0221] According to one embodiment, the updated context information may be included in the third information when providing the third information for interpretation of the third audio information to the artificial intelligence model.

[0222] According to one embodiment, the first information may include the images, the first audio information, and information about the first interpretation target.

[0223] According to one embodiment, the second information may include images of the external environment captured in real time by the at least one camera after the first interpretation information is acquired, the second audio information, information about the first interpretation target, and the context information.

[0224] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may add the first part to the second information provided to the artificial intelligence model to interpret the first part based on the first part being included in the first interpretation information that is not interpreted in the first audio.

[0225] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may acquire a first graphic object representing an additional interpretation target generated by the artificial intelligence model and display the first graphic object on the screen through the display to at least one object corresponding to the additional interpretation target.

[0226] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may acquire a second graphic object indicating that at least one third object generated by the artificial intelligence model is excluded from interpretation, and display the second graphic object on the at least one third object on the screen through the display.

[0227] According to one embodiment, the at least one third object may be an object corresponding to at least one audio among a plurality of audios included in the second audio information, wherein the similarity value is less than the specified value.

[0228] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may select the at least one first object corresponding to the first interpretation target on the screen based on the user's gesture input, voice input, or the user's gaze using the at least one camera.

[0229] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be able to identify the first interpretation target and then provide information about the images and the at least one first object to the artificial intelligence model without audio information.

[0230] FIG. 11 is a diagram illustrating an example of a method of operation in an electronic device according to one embodiment.

[0231] In the following embodiments, each operation may be performed sequentially, but is not necessarily performed sequentially. For example, the order of each operation may be changed, and at least two operations may be performed in parallel.

[0232] Referring to FIGS. 11 and 12, an electronic device according to one embodiment (e.g., electronic device (101) of FIG. 1, electronic device (200) of FIG. 2a and 2b, electronic device (300) of FIG. 3a, FIG. 3b and 3c, or electronic device (401) of FIG. 4) can acquire images of the external environment of the electronic device captured in real time through at least one camera (e.g., camera module (180) of FIG. 1, at least one camera included in FIG. 2a, first camera (211-1, 211-2), third camera (213), camera (313, 314, 315, 316) of FIG. 3, or camera (425) of FIG. 4) based on receiving a request for interpretation (e.g., event or input) for interpreting the content of people's conversation or audio output from the device in operation 1101. The electronic device can display a screen on a display (e.g., the first display (251) and second display (252) of FIG. 2a and FIG. 2b, the display member (340) of FIG. 3a, the display (321) of FIG. 3c, or the display (420) of FIG. 4) based on images.

[0233] In operation 1103, the electronic device may identify (e.g., designate, confirm, or set) at least one object selected by the user on a screen displayed on the display as the first interpretation target. As an initial operation for interpretation, the electronic device may designate at least one object (e.g., target object (620)) selected by the user's gesture input, voice input, or gaze on a screen (e.g., current scene (610)) displayed on the display (420) as the primary interpretation target (hereinafter referred to as the first interpretation target). If there is a target object (e.g., person or device) that speaks or outputs a language different from the user in the space of the external environment within the field of view of the direction the user is currently looking, the electronic device may designate the target object selected on the current screen provided based on images captured in real time as the first interpretation target (e.g., primary interpretation target). If there is no target object (e.g., person or device) that speaks or outputs a language different from the user in the space of the external environment within the field of view of the direction the user is currently looking, the electronic device may change its direction toward the direction where the target object is located (e.g., by the user turning their head) and select the target object as the first interpretation target from the changed screen provided based on images captured in real-time from the changed direction. After the initial operation, the electronic device may change the designated first interpretation target (e.g., primary interpretation target) to another object selected through the user's gesture input, voice input, or gaze while performing the interpretation operation. In operation 1103, as a situation where the electronic device starts the interpretation operation, if the first interpretation target is designated to obtain context information because context information is not present, the electronic device may provide initial input information to the artificial intelligence model (510) without audio information. The electronic device may use the artificial intelligence model (510) to analyze images to obtain a graphic object representing the first interpretation target and display the graphic object representing the first interpretation target on the screen.

[0234] In operation 1105, the electronic device may acquire first audio information in real time from an external environment through two or more microphones (e.g., the input module (150) of FIG. 1, or at least one microphone included in the microphone circuit (520) of FIG. 5). Even after acquiring the first audio information, the electronic device may continuously detect audio through at least one microphone and may provide audio information including the detected audio to an artificial intelligence model at the start of a cycle. Data for interpretation (information about the screen (e.g., images), audio information and / or gaze information) may be collected at each cycle by continuously tracking the first interpretation target. Here, the cycle may be the period from the time when information for interpreting audio information (e.g., input information) is provided to the artificial intelligence model (510) to the time when the interpretation result (e.g., output information) is acquired.

[0235] In operation 1107, the electronic device may provide first information for interpreting first audio information to an artificial intelligence model. Here, the first information may include images, first audio information, and information about a first interpretation target. For example, information input to the artificial intelligence model may be provided to the artificial intelligence model (510) as an input prompt of a specified format (e.g., [audio information, object list, scene-images, context-info]). The audio information may include audio information to be interpreted (e.g., first audio information) obtained through two or more microphones (520) in the electronic device. The audio information may additionally include audio of a part that was not interpreted in the previous cycle (e.g., first part). The object list may include information about a first interpretation target initially designated by the user (e.g., object information and location information corresponding to the first interpretation target). The object list may additionally include information regarding a second interpretation target designated during the interpretation process (e.g., object information and location information corresponding to the second interpretation target). Here, the location information may be the relative position between each object with respect to an electronic device. Scene-images are used in initial situations where it is difficult to verify the context for interpretation and may be used to correct the positions of objects; they may include images captured in real-time by at least one camera during the corresponding cycle. Context-info may include the context of the audio during the interpretation process. Context-info may include all the content of the audio information or include information in a concise form that can be understood by an artificial intelligence model.

[0236] In operation 1109, the electronic device may obtain first interpretation information by interpreting the first audio of the first interpretation target included in the first audio information using the artificial intelligence model based on the first information. The electronic device may obtain confirmed context information based on the first audio and images. Here, the first information may include information about images including captured images, the first audio information, and information about the first interpretation target. The first interpretation information and the context information may be included in information output from the artificial intelligence model (e.g., output information).

[0237] In operation 1111, the electronic device may output the audio of the first interpretation information through a speaker (e.g., the acoustic output module (155) of FIG. 1, the speaker (530) of FIG. 5) and / or display the text of the first interpretation information through a display.

[0238] In operation 1113, the electronic device can check whether there is a request to end interpretation. If the check reveals that there is a request to end interpretation, the electronic device ends the operation as is, and if there is no request to end interpretation, the electronic device can perform operation 1115 to continue performing the interpretation operation of the next cycle.

[0239] In operation 1115, the electronic device performs the interpretation operation of the next cycle and may perform an operation to identify additional interpretation targets. Afterwards, the electronic device may again check in operation 1113 whether there is a request to end interpretation.

[0240] FIG. 12 is a diagram illustrating an example of an operation method in an electronic device according to one embodiment. FIG. 13a and FIG. 13b are diagrams illustrating an example of interpreting audio of a target to be interpreted using an artificial intelligence model in an electronic device according to one embodiment.

[0241] In the following embodiments, each operation may be performed sequentially, but is not necessarily performed sequentially. For example, the order of each operation may be changed, and at least two operations may be performed in parallel. FIG. 12 illustrates a specific operation for performing the interpretation operation of the next cycle in operation FIG. 1115 of FIG. 11 and identifying an additional interpretation target.

[0242] Referring to FIGS. 12, FIGS. 13a, and FIGS. 13b, an electronic device according to one embodiment (e.g., electronic device (101) of FIG. 1, electronic device (200) of FIGS. 2a and FIGS. 2b, electronic device (300) of FIGS. 3a, FIGS. 3b, and FIGS. 3c, or electronic device (401) of FIG. 4) can acquire audio information (hereinafter referred to as second audio information) including a plurality of audios detected in real time from an external environment through two or more microphones while performing interpretation of the previous cycle, in order to perform interpretation of the next cycle in operation 1201. For example, as shown in FIG. 13a, in a situation where a user wearing an electronic device (401) is having a conversation with a designated first interpretation target in a space (e.g., inside a bus), while performing the operation of interpreting the audio of a first object (1311) corresponding to the detected first interpretation target, multiple audios occurring within the space (e.g., audios occurring inside a bus) can be detected through two or more microphones, and second audio information including the detected multiple audios can be obtained. The second audio information may include a second audio (e.g., an announcement inside a bus) of a second object (1313) (e.g., a speaker) included in a screen (1201).

[0243] In operation 1203, the electronic device may provide second information to an artificial intelligence model for interpreting second audio information at the start of the next cycle after acquiring first interpretation information. Here, the second information may include images, second audio information, information about the first interpretation target, and context information.

[0244] In operation 1205, the electronic device may obtain output information including second interpretation information, which interprets at least one of a plurality of audios included in the second audio information based on the second information using an artificial intelligence model. In operation 1205, when the electronic device obtains the second interpretation information, it may obtain information regarding additional interpretation targets identified by comparing the similarity between the context of the plurality of audios and the previous context information by the artificial intelligence model. Information regarding additional interpretation targets may be included in the object list of the output information. The output information provided by the artificial intelligence model may include information regarding the first interpretation target and context information regarding the second interpretation information in the object list. If there is a part of the audio information that is not interpreted in at least one audio, the output information may include the audio for the uninterpreted part. If there is a newly created image or graphic object related to the images, the output information may include the created image or graphic object. The artificial intelligence model may compare the similarity between the context of the plurality of audios and the previous context information and add audio with a similarity value greater than or equal to a specified value as an interpretation target. The electronic device can update previously acquired context information stored in memory (e.g., memory (130) of FIG. 1, memory (415) of FIG. 4) as context information of the second interpretation information included in the output information. The updated context information stored in memory can be added to the third information when providing the third information for the interpretation of the next third audio information to the artificial intelligence model.For example, as illustrated in FIG. 13a, the artificial intelligence model can compare the context (e.g., bus announcement) of the second audio (e.g., bus announcement output from the speaker) of the second object (1313) (e.g., speaker) included in the screen (1201) with previous context information (e.g., context information obtained during conversation with the first object (1311) (e.g., inside the bus)) and determine that there is a high degree of similarity, and add the second object (1313) as an interpretation target.

[0245] In operation 1207, when the electronic device acquires the second interpretation information, it can check whether there is an additional interpretation target added by the artificial intelligence model. The electronic device can check whether there is an additional interpretation target by checking whether the output information includes information about the additional interpretation target. As a result of the check, if there is an additional interpretation target, the electronic device performs operation 1209, and if there is no additional interpretation target, the electronic device can perform operation 1211.

[0246] In operation 1209, the electronic device identifies an added second interpretation target based on information about an additional interpretation target, and can obtain information that interprets the second audio of the second interpretation target among a plurality of audios included in the second audio information using an artificial intelligence model as second interpretation information. After performing operation 1209, the electronic device can perform operation 1213.

[0247] In operation 1211, the electronic device confirms that there is no additional interpretation target based on the fact that information regarding an additional interpretation target is not confirmed, and can obtain information that interprets the first audio of the first interpretation target among the multiple audios included in the second audio information using an artificial intelligence model as the second interpretation information. After performing operation 1211, the electronic device can perform operation 1213.

[0248] In operation 1213, the electronic device may output the audio of the second interpretation information through a speaker and / or display the text of the second interpretation information through a display. For example, as illustrated in FIG. 13b, the electronic device may display text information (1320) of the second interpretation information (e.g., "The next stop is Seoul Station.") that interprets the audio of the second object (1313) (e.g., bus announcement) (e.g., "The next stop is Seoul Station.") on a screen (1301) adjacent to the second object (1313). For example, the electronic device may output the audio information (e.g., "The next stop is Seoul Station.") through a speaker (e.g., the sound output module (155) of FIG. 1, the speaker (530) of FIG. 5). According to one embodiment, if the output information includes a second graphic object representing an interpretation target added to the output information, the electronic device may display the second graphic object representing the added interpretation target on the screen on the second object corresponding to the second interpretation target.

[0249] According to one embodiment, an artificial intelligence model analyzes images through context similarity verification to exclude objects among classified objects whose similarity value is smaller than a reference value from interpretation targets, generates a third graphic object indicating that the objects excluded from interpretation targets (e.g., objects not designated as additional interpretation targets), and can provide the third graphic object by including it in output information. According to one embodiment, when the third graphic object is included in the output information, the electronic device can display the third graphic object on the excluded objects on the screen.

[0250] FIG. 14 is a diagram illustrating a generative artificial intelligence system according to one embodiment.

[0251] Referring to FIG. 14, in a generative artificial intelligence system (1400) according to one embodiment, a user query / response interface (1410) (e.g., an input module (150) or a display module (160) of FIG. 1, or a display (230) of FIG. 2a and FIG. 2b, or a display (420) of FIG. 4) may include user location information, etc. Additionally, user input may be in a mixed form of the aforementioned natural language, images, sounds, and context information. Additionally, user input may be in a non-natural language form, such as selecting a menu. The user query / response interface (1410) may output results of the generative artificial intelligence system to the user. The output may be in a natural language form or a specific content form, and may also be provided in a form such as an action requested by the user.

[0252] An artificial intelligence framework (1440) (e.g., the processor (120) of FIG. 1 or the processor (410) of FIG. 4) can receive input from a user and coordinate and control each component necessary to perform the user's intent based on the user's query.

[0253] User input received from the user query / response interface (1410) can be transmitted to a prompt design component (1441) (e.g., the processor (120) of FIG. 1 or the processor (410) of FIG. 4). The prompt design component (1441) can be used to generate prompts suitable for inputting user input into a large language model (LLM), a large vision model (LVM), or a large multimodal model (LMM). The prompt design component (1441) may be an artificial intelligence component that uses machine learning algorithms or neural networks to develop better prompts over time. The prompt design component (1441) can generate prompts by accessing a knowledge component containing user preference data, a prompt library, and prompt examples based on user input, and can transmit the generated prompts to the LLM, LVM, or LMM.

[0254] An API / Plug-in management component (1442) (e.g., the processor (120) of FIG. 1 or the processor (410) of FIG. 4) can perform the role of communicating with external information when there is a request for additional information when user input is passed as input to a generative model (e.g., the artificial intelligence (AI) model (510) of FIG. 5 or a cloud artificial intelligence (AI) model). The API / Plug-in management component (1442) establishes a channel to communicate with the outside of the artificial intelligence framework (1440) via an API, and through the established channel, it can access various data sources (e.g., a knowledge store (1420)) (e.g., the memory (130) of FIG. 1 or the memory (415) of FIG. 4). Additionally, the API / plugin management component (1442) may request the application / service component (1430) (e.g., the processor (120) of FIG. 1 or the processor (410) of FIG. 4) via the API when the application or service needs to perform an action that ultimately performs user input rather than an intermediate result. Information obtained from the outside may be used to generate a prompt in the prompt design component (1441) along with user input, or it may be passed as input to a generative artificial intelligence model (1460) (e.g., the artificial intelligence model (510) of FIG. 5 or a cloud artificial intelligence model).

[0255] An output modification component (or, which may also be named a refiner component) (1443) (e.g., the processor (120) of FIG. 1 or the processor (410) of FIG. 4) can finely tune the output of a generative artificial intelligence model (1460) (e.g., the artificial intelligence model (510) of FIG. 5 or a cloud artificial intelligence model). For example, the output modification component (1443) can verify whether the content generated through LLM, LVM, and / or LMM is irrelevant, contains biased content, or contains harmful content. Additionally, the output modification component (1443) can determine the extent to which the output matches the desired result and, if additional processing is required, proceed with that process. Furthermore, the output modification component (1443) can configure and provide hints to the user to avoid unwanted output.

[0256] A generative AI model (1460) (e.g., the AI ​​model (510) of FIG. 5 or a cloud AI model) can generally refer to an artificial intelligence neural network that generates new forms of data based on user input information. A generative AI model (1460) may include a model that generates images and / or a model that generates language. Models that generate images include, but are not limited to, GANs (generative adversarial networks) and VAEs (variational autoencoders), and examples include diffusion-based generative models that use VAEs and transformer structures. Models that generate language are models trained to output the most statistically appropriate output value based on input values, and examples include models such as CHAT-GPT 3 and CHAT-GPT 4. Additionally, there are LMMs (large multimodal models) that can recognize various forms of data input, such as text, images, and voice, and generate new data corresponding to them.

[0257] In one embodiment, the artificial intelligence framework (1440) and / or generative artificial intelligence model (1460) may be included within an artificial intelligence module (e.g., including a processing circuit) within the electronic device. For example, the artificial intelligence module may be operatively coupled with at least one processor of the electronic device (e.g., at least one processor (120) of FIG. 1 or processor (410) of FIG. 4). For example, the artificial intelligence module may be operatively coupled with a sensor hub of the electronic device for one or more sensors within the electronic device.

[0258] According to one embodiment, a method of operation in a head-wearable electronic device (e.g., electronic device (101) of FIG. 1, electronic device (200) of FIG. 2a and FIG. 2b, electronic device (300) of FIG. 3a, FIG. 3b and FIG. 3c, or electronic device (401) of FIG. 4) may include the operation of acquiring images of the external environment of the electronic device captured in real time by at least one camera of the electronic device (e.g., camera module (180) of FIG. 1, first camera (211-1, 211-2) of FIG. 2a, third camera (213), camera (313, 314, 315, 316) of FIG. 3b, or camera (425) of FIG. 4).

[0259] According to one embodiment, the method may include the operation of identifying at least one first object selected by the user as a first interpretation target on a screen displayed on a display of the electronic device (e.g., the first display (251) of FIG. 2a and 2b, the second display (252), the display member (340) of FIG. 3a, the display (321) of FIG. 3c, or the display (420) of FIG. 4) based on the images.

[0260] According to one embodiment, the method may include the operation of acquiring first audio information in real time from the external environment through two or more microphones of the electronic device (e.g., input module (150) of FIG. 1, microphone circuit (520) of FIG. 5).

[0261] According to one embodiment, the method may include the operation of obtaining first interpretation information and context information related to the first interpretation information by using an artificial intelligence model (e.g., the artificial intelligence model (510) of FIG. 5) based on first information for interpreting the first audio information.

[0262] According to one embodiment, the method may include the operation of outputting the audio of the first interpretation information through a speaker of the electronic device (e.g., the sound output module (155) of FIG. 1, the speaker (530) of FIG. 5) or displaying the text of the first interpretation information through the display.

[0263] According to one embodiment, the method may include the operation of acquiring second audio information comprising a plurality of audios detected in real time from the external environment through the two or more microphones from the time when the first information is provided to the artificial intelligence model until the time when the first interpretation information is acquired.

[0264] According to one embodiment, the method may include the operation of obtaining second interpretation information by interpreting at least one of the plurality of audios included in the second audio information using the artificial intelligence model based on second information for interpreting the second audio information, and obtaining information about an additional interpretation target identified by comparing the context of the plurality of audios with the context information using the artificial intelligence model.

[0265] According to one embodiment, the method may include the operation of outputting the audio of the second interpretation information through the speaker or displaying the text of the second interpretation information through the display.

[0266] According to one embodiment, the method may further include an operation of identifying a second interpretation target based on information about the additional interpretation target.

[0267] According to one embodiment, the second interpretation information is information in which the second audio of the second interpretation target is interpreted among a plurality of audios included in the second audio information using the artificial intelligence model, and the second interpretation target may be at least one object in which the similarity value obtained by comparing the similarity by the artificial intelligence model is confirmed to be greater than or equal to a specified value.

[0268] According to one embodiment, the operation of acquiring the second interpretation information may include, when information regarding the additional interpretation target is not acquired, acquiring the interpretation information of the first audio of the first interpretation target among a plurality of audios included in the second audio information using the artificial intelligence model as the second interpretation information.

[0269] According to one embodiment, the method comprises, when acquiring the second interpretation information, acquiring context information related to the second interpretation information using the artificial intelligence model, updating the context information related to the first interpretation information stored in the memory with the context information related to the second interpretation information, and the updated context information may be included in the third information when providing the third information for the interpretation of the third audio information to the artificial intelligence model.

[0270] According to one embodiment, the first information may include the images, the first audio information, and information about the first interpretation target.

[0271] According to one embodiment, the second information may include images of the external environment captured in real time by the at least one camera after the first interpretation information is acquired, the second audio information, information about the first interpretation target, and the context information.

[0272] According to one embodiment, the method may further include the operation of adding the first part to the second information provided to the artificial intelligence model to interpret the first part, based on the fact that the first interpretation information includes a first part that is not interpreted in the first audio.

[0273] According to one embodiment, the method may further include the operation of acquiring a first graphic object representing an additional interpretation target generated by the artificial intelligence model and the operation of displaying the first graphic object on at least one object corresponding to the additional interpretation target on the screen through the display.

[0274] According to one embodiment, the method may further include the operation of acquiring a second graphic object indicating that at least one third object generated by the artificial intelligence model is excluded from interpretation; and the operation of displaying the second graphic object on the at least one third object on the screen through the display.

[0275] According to one embodiment, the at least one third object may be an object corresponding to at least one audio among a plurality of audios included in the second audio information, wherein the similarity value is less than the specified value.

[0276] According to one embodiment, the operation of identifying a first interpretation target may include the operation of selecting the at least one first object corresponding to the first interpretation target on the screen based on the user's gesture input, voice input, or the user's gaze using the at least one camera.

[0277] According to one embodiment, the method may further include the operation of identifying the first interpretation target and then providing information about the images and the at least one first object to the artificial intelligence model without audio information.

[0278] According to one embodiment, in a non-transient storage medium storing one or more programs, the one or more programs are a command to, when executed by at least one processor (e.g., processor (120) of FIG. 1 or processor (410) of FIG. 4) of a head-wearable electronic device (e.g., electronic device (101) of FIG. 1, electronic device (200) of FIG. 2a and 2b, electronic device (300) of FIG. 3a, 3b and 3c or electronic device (401) of FIG. 4), cause the electronic device to execute an operation of acquiring images of the external environment of the electronic device captured in real time by at least one camera of the electronic device (e.g., camera module (180) of FIG. 1, first camera (211-1, 211-2) of FIG. 2a, third camera (213), camera (313, 314, 315, 316) of FIG. 3b or camera (425) of FIG. 4). It can be included.

[0279] According to one embodiment, the one or more programs may include a command that, when executed by at least one processor of a head-wearable electronic device, causes the electronic device to execute an operation of identifying at least one first object selected by the user as a first interpretation target on a screen displayed on the display of the electronic device (e.g., the first display (251) of FIG. 2a and 2b, the second display (252), the display member (340) of FIG. 3a, the display (321) of FIG. 3c, or the display (420) of FIG. 4) based on the images.

[0280] According to one embodiment, the one or more programs may include a command that causes the electronic device to perform an operation of acquiring first audio information in real time from the external environment through two or more microphones of the electronic device (e.g., input module (150) of FIG. 1, microphone circuit (520) of FIG. 5) when executed by at least one processor of the electronic device that can be worn on the head.

[0281] According to one embodiment, the one or more programs may include a command to cause the electronic device to execute, when executed by at least one processor of a head-wearable electronic device, an operation to obtain first interpretation information and context information related to the first interpretation information by using an artificial intelligence model (e.g., the artificial intelligence model (510) of FIG. 5) based on first information for interpreting the first audio information.

[0282] According to one embodiment, the one or more programs may include a command to cause the electronic device to perform an operation of outputting the audio of the first interpretation information through the speaker of the electronic device (e.g., the sound output module (155) of FIG. 1, the speaker (530) of FIG. 5) or displaying the text of the first interpretation information through the display when executed by at least one processor of the electronic device that can be worn on the head.

[0283] According to one embodiment, the one or more programs may include a command to cause the electronic device to execute, when executed by at least one processor of a head-wearable electronic device, an operation to acquire second audio information including a plurality of audios detected in real time from the external environment through the two or more microphones from the time when the first information is provided to the artificial intelligence model until the time when the first interpretation information is acquired.

[0284] According to one embodiment, the one or more programs may include a command to, when executed by at least one processor of a head-wearable electronic device, cause the electronic device to obtain second interpretation information by interpreting at least one of the plurality of audios included in the second audio information using the artificial intelligence model based on second information for interpreting the second audio information, and to obtain information about an additional interpretation target identified by comparing the context of the plurality of audios with the context information using the artificial intelligence model.

[0285] According to one embodiment, the one or more programs may include a command that causes the electronic device to perform an operation of outputting the audio of the second interpretation information through the speaker or displaying the text of the second interpretation information through the display when executed by at least one processor of the head-wearable electronic device.

[0286] The present disclosure enables interpretation by focusing on the speech of the target desired by the user, and can improve the accuracy of the target designated for interpretation. Furthermore, the present disclosure allows for interpretation results to be obtained for targets participating in the same conversation and conversing in the same context, even if they are not the target designated for interpretation. In addition, various effects that can be identified directly or indirectly through the present disclosure may be provided. The effects obtainable from the present disclosure are not limited to those mentioned above, and other unmentioned effects will be clearly understood by those skilled in the art to which the present disclosure pertains from the description below.

[0287] Furthermore, the embodiments disclosed in this disclosure are presented for the purpose of explaining and understanding the disclosed technical content and are not intended to limit the scope of the technology described in this disclosure. Accordingly, the scope of this disclosure should be interpreted to include all modifications or various other embodiments based on the technical concept of this disclosure.

[0288] The electronic device according to the various embodiments disclosed in this document may be of various forms. The electronic device may include, for example, a portable communication device (e.g., a smartphone), a computer device, a portable multimedia device, a portable medical device, a camera, a wearable device, or a consumer electronics device. The electronic device according to the embodiments of this document is not limited to the devices described above.

[0289] The various embodiments of this document and the terms used therein are not intended to limit the technical features described in this document to specific embodiments, and should be understood to include various modifications, equivalents, or substitutions of said embodiments. In connection with the description of the drawings, similar reference numerals may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more of said items unless the relevant context clearly indicates otherwise. In this document, phrases such as "A or B," "at least one of A and B," "at least one of A or B," "A, B or C," "at least one of A, B and C," and "at least one of A, B, or C" may each include any one of the items listed together in the corresponding phrase, or all possible combinations thereof. Terms such as "first," "second," or "first" or "second" may be used simply to distinguish said components from other said components and do not limit said components in any other aspect (e.g., importance or order). Where any (e.g., 1st) component is referred to as “coupled” or “connected” to another (e.g., 2nd) component, with or without the terms “functionally” or “communicationly,” it means that said any component may be connected to said other component directly (e.g., via a wire), wirelessly, or through a third component.

[0290] The term “module” as used in the various embodiments of this document may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit, for example. A module may be a component formed integrally, or a minimum unit of said component or a part thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).

[0291] Various embodiments of the present document may be implemented as software (e.g., program (140)) comprising one or more instructions stored in a storage medium (e.g., internal memory (136) or external memory (138)) readable by a machine (e.g., electronic device (101)). For example, a processor (e.g., processor (120)) of the machine (e.g., electronic device (101)) may call at least one of the one or more instructions stored in the storage medium and execute it. This enables the machine to be operated to perform at least one function according to the at least one called instruction. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. The storage medium readable by the machine may be provided in the form of a non-transitory storage medium. Here, 'non-temporary' simply means that the storage medium is a tangible device and does not contain a signal (e.g., electromagnetic waves), and the term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily.

[0292] According to one embodiment, the method according to the various embodiments disclosed herein may be provided by being included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)) or an application store (e.g., Play Store). TM It can be distributed online (e.g., downloaded or uploaded) through ) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily created on a device-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.

[0293] According to various embodiments, each component (e.g., module or program) of the components described above may include a singular or multiple entities, and some of the multiple entities may be separated and placed in other components. According to various embodiments, one or more of the components or operations of the aforementioned components may be omitted, or one or more other components or operations may be added. Generally or additionally, multiple components (e.g., module or program) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the multiple components in the same or similar manner as those performed by the corresponding component among the multiple components prior to integration. According to various embodiments, operations performed by the module, program, or other components may be executed sequentially, in parallel, iteratively, or heuristically, or one or more of the operations may be executed in a different order, omitted, or one or more other operations may be added.

Claims

1. In a head-wearable electronic device (101, 200, 300, 401), At least one camera (180, 211-1, 211-2, 213, 313, 314, 315, 316, 425); Display(160, 251, 252, 340, 321, 422); Two or more microphones (150, 520); Speaker(155, 530); At least one processor (120, 410); and It includes memory (130, 415) for storing instructions, When the above instructions are executed individually or collectively by the at least one processor, the electronic device: Acquiring images of the external environment of the electronic device captured in real time by at least one camera, and Based on the above images, at least one first object selected by the user on the screen displayed on the above display is identified as a first interpretation target, and Acquire first audio information in real time from the external environment through the two or more microphones mentioned above, and Based on first information for interpreting the first audio information, using the artificial intelligence model, first interpretation information that interprets the first audio of the first interpretation target included in the first audio information and context information related to the first interpretation information are obtained, and The audio of the first interpretation information is output through the speaker, or the text of the first interpretation information is displayed through the display. From the time the first information is provided to the artificial intelligence model until the time the first interpretation information is acquired, second audio information including a plurality of audios detected in real time from the external environment through the two or more microphones is acquired, and Based on second information for interpreting the second audio information, second interpretation information is obtained by interpreting at least one of the plurality of audios included in the second audio information using the artificial intelligence model, and information regarding an additional interpretation target identified by comparing the context of the plurality of audios with the context information using the artificial intelligence model is obtained. An electronic device that outputs the audio of the second interpretation information through the speaker or displays the text of the second interpretation information through the display.

2. In paragraph 1, when the instructions are executed individually or collectively by the at least one processor, the electronic device: Based on the information regarding the additional interpretation target mentioned above, identify the second interpretation target, and Using the artificial intelligence model above, interpretation information of the second audio of the second interpretation target among the plurality of audios included in the second audio information is obtained as the second interpretation information, and The above second interpretation target is an electronic device, which is at least one object confirmed by the above artificial intelligence model to have a similarity value greater than or equal to a specified value.

3. In claim 1 or 2, when the instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that, when information regarding the additional interpretation target is not obtained, uses the artificial intelligence model to obtain interpretation information of the first audio of the first interpretation target among a plurality of audios included in the second audio information as the second interpretation information.

4. In any one of paragraphs 1 through 3, When acquiring the second interpretation information, context information related to the second interpretation information is acquired using the artificial intelligence model, and The context information related to the first interpretation information stored in the memory is updated with the context information related to the second interpretation information. The above-mentioned updated context information is included in the third information when providing the third information for the interpretation of the third audio information to the artificial intelligence model, and The first information above includes the images, the first audio information, and information about the first interpretation target, and An electronic device comprising the above second information, images of the external environment captured in real time by the at least one camera after acquiring the above first interpretation information, the above second audio information, information about the first interpretation target, and the above context information.

5. In any one of claims 1 to 4, when the instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that adds the first part to the second information provided to the artificial intelligence model to interpret the first part, based on the first interpretation information including the first part that is not interpreted in the first audio.

6. In any one of claims 1 to 5, when the instructions are executed individually or collectively by the at least one processor, the electronic device: Acquire a first graphic object representing an additional interpretation target generated by the artificial intelligence model above, and The first graphic object is displayed on the screen via the display on at least one object corresponding to the additional interpretation target, and A second graphic object generated by the above artificial intelligence model is obtained, and the second graphic object indicates that at least one third object is excluded from the interpretation target, and The second graphic object is displayed on the screen through the display on at least one third object, and An electronic device wherein at least one third object is an object corresponding to at least one audio among a plurality of audios included in the second audio information, the similarity value being less than the specified value.

7. In any one of claims 1 through 6, when the instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that selects at least one first object corresponding to the first interpretation target on the screen based on the user’s gaze using the user’s gesture input, voice input, or at least one camera.

8. In any one of claims 1 through 7, when the instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that, after identifying the first interpretation target, provides information about the images and at least one first object to the artificial intelligence model without audio information.

9. A method of operation in a head-wearable electronic device (101, 200, 300, 401), An operation of acquiring images of the external environment of the electronic device captured in real time by at least one camera (180, 211-1, 211-2, 213, 313, 314, 315, 316, 425) of the electronic device; An operation of identifying at least one first object selected by a user as a first interpretation target on a screen displayed on a display (160, 251, 252, 340, 321, 420) of the electronic device based on the above images; The operation of acquiring first audio information in real time from the external environment through two or more microphones (150, 520) of the electronic device; Based on the first information for interpreting the first audio information, the operation of obtaining first interpretation information and context information related to the first interpretation information by using an artificial intelligence model (510) to interpret the first audio of the first interpretation target included in the first audio information; The operation of outputting the audio of the first interpretation information through the speaker (155, 530) of the electronic device or displaying the text of the first interpretation information through the display; An operation of acquiring second audio information including a plurality of audios detected in real-time from the external environment through the two or more microphones from the time when the first information is provided to the artificial intelligence model until the time when the first interpretation information is acquired; Based on second information for interpreting the second audio information, the operation of obtaining second interpretation information in which at least one of the plurality of audios included in the second audio information is interpreted using the artificial intelligence model, and obtaining information regarding an additional interpretation target identified by comparing the context of the plurality of audios with the similarity of the context information using the artificial intelligence model; and A method comprising the operation of outputting the audio of the second interpretation information through the speaker or displaying the text of the second interpretation information through the display.

10. In paragraph 9, the above method is, Based on the information regarding the additional interpretation target mentioned above, the method further includes an operation to identify a second interpretation target, and The above second interpretation information is information obtained by interpreting the second audio of the second interpretation target among a plurality of audios included in the above second audio information using the above artificial intelligence model, and A method in which the second interpretation target is at least one object in which the similarity value compared by the artificial intelligence model is confirmed to be greater than or equal to a specified value.

11. In paragraph 9 or 10, the operation of obtaining the second interpretation information is, A method comprising the operation of obtaining, using the artificial intelligence model, the interpretation information of the first audio of the first interpretation target among a plurality of audios included in the second audio information as the second interpretation information when information regarding the additional interpretation target is not obtained.

12. In any one of paragraphs 9 to 11, the above method is, When acquiring the second interpretation information, the operation of acquiring context information related to the second interpretation information using the artificial intelligence model; and An operation to update the context information related to the first interpretation information stored in the memory with the context information related to the second interpretation information; The above-mentioned updated context information is included in the third information when providing the third information for the interpretation of the third audio information to the artificial intelligence model, and The first information above includes the images, the first audio information, and information about the first interpretation target, and A method comprising the second information including images of the external environment captured in real-time by the at least one camera after acquiring the first interpretation information, the second audio information, information about the first interpretation target, and the context information.

13. In any one of paragraphs 9 through 12, the above method is, Based on the fact that the first interpretation information includes a first part that is not interpreted in the first audio, the operation of adding the first part to the second information provided to the artificial intelligence model to interpret the first part; The operation of acquiring a first graphic object representing an additional interpretation target generated by the artificial intelligence model above; The operation of displaying the first graphic object on the screen via the display on at least one object corresponding to the additional interpretation target; and the operation of acquiring a second graphic object indicating that at least one third object generated by the artificial intelligence model is excluded from the interpretation target; and The method further includes the operation of displaying the second graphic object on the screen through the display to at least one third object, A method in which at least one third object is an object corresponding to at least one audio among a plurality of audios included in the second audio information, wherein the similarity value is less than the specified value.

14. In any one of paragraphs 9 through 13, the operation identified as the first interpretation target is, The method includes the operation of selecting the at least one first object corresponding to the first interpretation target on the screen based on the user’s gaze using the user’s gesture input, voice input, or the at least one camera. The above method is, A method further comprising, after identifying the first interpretation target, providing information about the images and at least one first object to the artificial intelligence model without audio information.

15. In a non-transient storage medium storing one or more programs, the one or more programs, when executed by at least one processor of a head-wearable electronic device (101, 200, 300, 401), cause the electronic device: An operation of acquiring images of the external environment of the electronic device captured in real time by at least one camera (180, 211-1, 211-2, 213, 313, 314, 315, 316, 425) of the electronic device; An operation of identifying at least one first object selected by a user as a first interpretation target on a screen displayed on a display (160, 251, 252, 340, 321, 420) of the electronic device based on the above images; The operation of acquiring first audio information in real time from the external environment through two or more microphones (150, 520) of the electronic device; Based on the first information for interpreting the first audio information, the operation of obtaining first interpretation information and context information related to the first interpretation information by using an artificial intelligence model (510) to interpret the first audio of the first interpretation target included in the first audio information; The operation of outputting the audio of the first interpretation information through the speaker (155, 530) of the electronic device or displaying the text of the first interpretation information through the display; An operation of acquiring second audio information including a plurality of audios detected in real-time from the external environment through the two or more microphones from the time when the first information is provided to the artificial intelligence model until the time when the first interpretation information is acquired; Based on second information for interpreting the second audio information, the operation of obtaining second interpretation information in which at least one of the plurality of audios included in the second audio information is interpreted using the artificial intelligence model, and obtaining information regarding an additional interpretation target identified by comparing the context of the plurality of audios with the similarity of the context information using the artificial intelligence model; and A non-transient storage medium comprising a command to execute an operation of outputting the audio of the second interpretation information through the speaker or displaying the text of the second interpretation information through the display.