Electronic device, method, and recording medium for providing interpreting service
The described technology addresses the lack of real-time interpretation services in augmented reality by using wearable devices to detect and translate spoken language with user-defined aliases, enhancing user experience through efficient and accurate language translation.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SAMSUNG ELECTRONICS CO LTD
- Filing Date
- 2025-12-10
- Publication Date
- 2026-06-18
Smart Images

Figure KR2025021300_18062026_PF_FP_ABST
Abstract
Description
Electronic device, method, and recording medium for providing interpretation services
[0001] In embodiments of the present disclosure, an electronic device, a method, and a recording medium for providing an interpreting service are provided.
[0002] With the development of digital technology, various types of electronic devices such as smartphones, tablet PCs (personal computers), and / or wearable electronic devices are widely used. To support and enhance the functionality of these electronic devices, the hardware and / or software parts of the devices are continuously being developed.
[0003] Recently, research and development on extended reality (XR) technologies, such as virtual reality (VR), augmented reality (AR), and / or mixed reality (MR), are underway. Recently, VR, AR, and / or MR technologies are being utilized in various fields (e.g., entertainment, infotainment, smart home, and / or smart factory), and the hardware and / or software aspects of electronic devices for this purpose are being continuously researched and developed.
[0004] For example, a wearable electronic device can provide various digital contents (e.g., virtual information or virtual images) by overlaying them onto the real world (e.g., the actual environment) through an application (or function) related to an AR service, either independently (e.g., in a standalone manner) or by linking at least two or more devices together (e.g., in a tethered manner), through the display of the wearable electronic device. For instance, AR environments are being implemented such as a tethered AR method, in which an electronic device (e.g., a smartphone) is connected to a wearable electronic device and virtual information generated by the electronic device is provided through the wearable electronic device's display, and a standalone AR method, in which the wearable electronic device independently generates virtual information without being connected to an electronic device and provides it through its display. As such, due to recent technological advancements in AR services, the number of users utilizing AR services is increasing, and user needs are also growing accordingly.
[0005] Furthermore, with the recent rapid advancement of big data and deep learning technologies, artificial intelligence (AI) is being implemented in electronic devices and is also being applied to intelligent personalized services that analyze specific data and integrate and utilize information from various fields tailored to the user. For example, users can control electronic devices through voice conversation and perform searches, queries, and responses regarding specific information using a knowledge base powered by deep learning. Recently, with the evolution of AI technology, generative AI is being implemented. Generative AI can refer to AI technology that creates new, similar content using existing content such as text, audio, and / or images. For instance, generative AI can refer to AI technology capable of generating content (e.g., text, audio, images, and / or videos) corresponding to a given input.
[0006] The information described above may be provided as related art for the purpose of aiding understanding of the present disclosure. No claim or determination is made as to whether any of the foregoing may be applied as prior art related to the present disclosure.
[0007] In one embodiment of the present disclosure, an electronic device for providing an instant interpreting service, a method of operation thereof, and a recording medium are provided.
[0008] In one embodiment of the present disclosure, an electronic device capable of supporting instant interpretation services by interoperating with an on-device and / or external device in an augmented reality (AR) environment using a wearable electronic device, a method of operation thereof, and a recording medium are provided.
[0009] In one embodiment of the present disclosure, an electronic device, a method of operation thereof, and a recording medium are provided that can specify a target speaker by a user describing (or explaining) the target speaker by voice command through a wearable electronic device, perform interpretation / translation on the target speaker's utterance (e.g., voice signal) through artificial intelligence (e.g., interpretation service engine) of an on-device and / or external device, and provide interpretation data through the wearable electronic device.
[0010] The technical problems intended to be solved in this document are not limited to those mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art to which this disclosure pertains from the description below.
[0011] According to an embodiment of the present disclosure, an electronic device supporting interpretation services is provided. The electronic device may include a display, a camera, a memory storing one or more computer programs, and one or more processors including processing circuitry that are communicatively coupled to the display and memory. One or more computer programs may include instructions (e.g., computer-executable instructions) that cause the electronic device to perform the following when executed individually or collectively by one or more processors.
[0012] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may execute an interpretation service based on the detection of a user's voice command. According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may acquire input data through the camera and the microphone. According to one embodiment, the input data may include image data acquired through the camera and voice data acquired through the microphone. According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may generate a prompt containing the input data to generate output data for the input data. According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may provide the prompt to an artificial intelligence on an on-device and / or external device. According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be configured to acquire output data in relation to the prompt. According to one embodiment, the output data may include text data and / or audio data of at least one target speaker, information with a predetermined masking set for the target speaker, source language information associated with the target speaker, translation information, and alias information. According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be configured to provide an interpretation service based on the output data.
[0013] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may receive input data. According to one embodiment, the input data may include image data comprising a point point for a target speaker designated by a voice command or user gesture in which the user describes a target speaker, and an object of at least one speaker in real-world space corresponding to the user's field of view (FoV). According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may extract feature information based on the image data of the input data and feature information described by the user based on the voice command. According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may fuse the feature information based on the image data and the feature information described by the user based on the voice command to generate an alias. According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may perform a mask of the target speaker based on the image data of the input data. According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may map and store mask information and alias information. According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may perform a translation on the target speaker's speech signal based on the target speaker's language information.According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may provide result data corresponding to the translation execution.
[0014] A method of operation of an electronic device according to an embodiment of the present disclosure may include an operation of executing an interpretation service based on the detection of a user's voice command. The method of operation may include an operation of acquiring input data through a camera and a microphone. According to one embodiment, the input data may include image data acquired through the camera and voice data acquired through the microphone. The method of operation may include an operation of generating a prompt containing the input data to generate output data for the input data. The method of operation may include an operation of providing the prompt to an artificial intelligence on an on-device and / or external device. The method of operation may include an operation of acquiring output data in relation to the prompt. According to one embodiment, the output data may include text data and / or audio data of at least one target speaker, information with a predetermined masking set for the target speaker, original language information associated with the target speaker, translation information, and alias information. The method of operation may include an operation of providing an interpretation service based on the output data.
[0015] In order to solve the above problems, various embodiments of the present disclosure may include a computer-readable recording medium that records a program for executing the method on a processor.
[0016] According to one embodiment, one or more non-transitory computer-readable recording media are provided for storing computer-executable instructions that cause one or more processors of an electronic device to perform operations when one or more processors are executed individually or collectively.
[0017] According to one embodiment, the operations may include, based on the detection of a user's voice command, an operation to execute an interpretation service; an operation to acquire input data through a camera and a microphone; an operation to generate a prompt containing said input data to generate output data for said input data; an operation to provide said prompt to an artificial intelligence on an on-device and / or external device; an operation to acquire output data in relation to said prompt; and an operation to provide an interpretation service based on said output data. According to one embodiment, the input data may include image data acquired through said camera and voice data acquired through said microphone. According to one embodiment, the output data may include text data and / or audio data of at least one target speaker, information with a predetermined masking set for said target speaker, original language information associated with said target speaker, translation information, and alias information.
[0018] Further scopes of the applicability of the present disclosure will become apparent from the following detailed description. However, since various changes and modifications within the spirit and scope of the present disclosure are clearly understood by those skilled in the art, specific embodiments, such as the detailed description and preferred embodiments of the present disclosure, should be understood as being given merely as examples.
[0019] According to an electronic device, a method of operation thereof, and a recording medium according to one embodiment of the present disclosure, an instant interpreting service can be provided in an augmented reality (AR) environment using a wearable electronic device by interoperability with an on-device and / or external device. According to one embodiment, when providing an interpreting service, the source language of the target speaker can be automatically specified along with an alias of the target speaker without specifying the source language of the target speaker and the target language to be translated. According to one embodiment, a complex function of automatic speech recognition (ASR), text-to-speech (TTS), and interpretation / translation functions can be processed in a single model (e.g., an artificial intelligence model or an interpreting service engine). According to one embodiment, a user can use the interpreting service immediately by setting the target speaker's language as an alias for the speaker. According to one embodiment, a new user experience (UX) / user interface (UI) for the interpreting / translation service can be provided to the user.
[0020] In addition, various effects that can be understood directly or indirectly through this document may be provided. The effects obtainable from this disclosure are not limited to those mentioned above, and other unmentioned effects will be clearly understood by those skilled in the art to which this disclosure pertains from the description below.
[0021] In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components.
[0022] FIG. 1 is a block diagram of an electronic device in a network environment according to various embodiments.
[0023] FIG. 2 is a block diagram showing an integrated intelligence system according to one embodiment.
[0024] FIG. 3 is a block diagram illustrating a generative artificial intelligence system according to one embodiment.
[0025] FIG. 4 is a diagram schematically illustrating the configuration of an electronic device according to one embodiment of the present disclosure.
[0026] FIG. 5a is a drawing illustrating an example of a wearable electronic device according to one embodiment of the present disclosure.
[0027] FIG. 5b is a drawing illustrating an example of the internal structure of the wearable electronic device of FIG. 5a.
[0028] FIG. 6 is a diagram illustrating a network environment between an electronic device and a wearable electronic device according to one embodiment of the present disclosure.
[0029] FIG. 7 is a drawing for illustrating an example of operation between an electronic device and a wearable electronic device according to one embodiment of the present disclosure.
[0030] FIG. 8 is a drawing illustrating an example of an operation in which an interpretation service is provided according to one embodiment of the present disclosure.
[0031] FIG. 9 is a flowchart illustrating the operation method of an interpretation electronic device according to one embodiment of the present disclosure.
[0032] FIG. 10 is a flowchart illustrating the operation method of an interpreting electronic device according to one embodiment of the present disclosure.
[0033] FIG. 11 is a flowchart illustrating the operation method of an interpreting electronic device according to one embodiment of the present disclosure.
[0034] FIG. 12 is a flowchart illustrating a method of operation of an electronic device according to one embodiment of the present disclosure.
[0035] FIG. 13 is a drawing illustrating an example of an interface providing interpretation services in a wearable electronic device according to one embodiment of the present disclosure.
[0036] FIG. 14 is a drawing illustrating an example of an interface providing interpretation services in a wearable electronic device according to one embodiment of the present disclosure.
[0037] Hereinafter, embodiments of the present disclosure are described in detail with reference to the drawings so that those skilled in the art can easily practice them. However, the present disclosure may be embodied in various different forms and is not limited to the embodiments described herein. In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components. Furthermore, in the drawings and related descriptions, descriptions of well-known functions and configurations may be omitted for clarity and brevity.
[0038] FIG. 1 is a block diagram of an electronic device (101) in a network environment (100) according to various embodiments.
[0039] Referring to FIG. 1, in a network environment (100), an electronic device (101) may communicate with an electronic device (102) through a first network (198) (e.g., a short-range wireless communication network) or with at least one of an electronic device (104) or a server (108) through a second network (199) (e.g., a long-range wireless communication network). According to one embodiment, the electronic device (101) may communicate with the electronic device (104) through a server (108). According to one embodiment, the electronic device (101) may include a processor (120), memory (130), input module (150), sound output module (155), display module (160), audio module (170), sensor module (176), interface (177), connection terminal (178), haptic module (179), camera module (180), power management module (188), battery (189), communication module (190), subscriber identification module (196), or antenna module (197). In some embodiments, at least one of these components (e.g., connection terminal (178)) may be omitted from the electronic device (101), or one or more other components may be added. In some embodiments, some of these components (e.g., sensor module (176), camera module (180), or antenna module (197)) may be integrated into a single component (e.g., display module (160)).
[0040] The processor (120) can control at least one other component (e.g., a hardware or software component) of the electronic device (101) connected to the processor (120) by executing software (e.g., a program (140)), and can perform various data processing or operations. According to one embodiment, as at least part of the data processing or operations, the processor (120) can store commands or data received from other components (e.g., a sensor module (176) or a communication module (190)) in a volatile memory (132), process the commands or data stored in the volatile memory (132), and store the resulting data in a non-volatile memory (134). According to one embodiment, the processor (120) may include a main processor (121) (e.g., a central processing unit (CPU) or an application processor (AP)) or an auxiliary processor (123) that can operate independently or together with it (e.g., a graphic processing unit (GPU), a neural processing unit (NPU), an image signal processor (ISP), a sensor hub processor, or a communication processor (CP). For example, if the electronic device (101) includes a main processor (121) and an auxiliary processor (123), the auxiliary processor (123) may be configured to use lower power than the main processor (121) or to be specialized for a designated function. The auxiliary processor (123) may be implemented separately from the main processor (121) or as part thereof.
[0041] The auxiliary processor (123) may control at least some of the functions or states associated with at least one component of the electronic device (101) (e.g., display module (160), sensor module (176), or communication module (190)) on behalf of the main processor (121) while the main processor (121) is in an inactive (e.g., sleep) state, or together with the main processor (121) while the main processor (121) is in an active (e.g., application execution) state. According to one embodiment, the auxiliary processor (123) (e.g., image signal processor or communication processor) may be implemented as part of another functionally related component (e.g., camera module (180) or communication module (190)). According to one embodiment, the auxiliary processor (123) (e.g., neural network processing unit) may include a hardware structure specialized for processing an artificial intelligence model. The artificial intelligence model may be generated through machine learning. Such learning may be performed, for example, on the electronic device (101) itself where the artificial intelligence model is executed, or through a separate server (e.g., server (108)). The learning algorithm may include, for example, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but is not limited to the examples described above. The artificial intelligence model may include a plurality of artificial neural network layers.An artificial neural network may be a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), a deep Q-network, or a combination of two or more of the above, but is not limited to the examples described above. In addition to the hardware structure, the artificial intelligence model may include a software structure, either additionally or substantially.
[0042] The memory (130) can store various data used by at least one component of the electronic device (101) (e.g., processor (120) or sensor module (176)). The data may include, for example, input data or output data for software (e.g., program (140)) and related commands. The memory (130) may include volatile memory (132) or non-volatile memory (134).
[0043] The program (140) may be stored as software in memory (130) and may include, for example, an operating system (OS) (142), middleware (144), or an application (146).
[0044] The input module (150) can receive commands or data to be used for a component of the electronic device (101) (e.g., processor (120)) from outside the electronic device (101) (e.g., user). The input module (150) may include, for example, a microphone, a mouse, a keyboard, a key (e.g., a button), or a digital pen (e.g., a stylus pen).
[0045] The sound output module (155) can output a sound signal to the outside of the electronic device (101). The sound output module (155) may include, for example, a speaker or a receiver. The speaker may be used for general purposes, such as multimedia playback or recording playback. The receiver may be used to receive incoming calls. According to one embodiment, the receiver may be implemented separately from the speaker or as part thereof.
[0046] The display module (160) can visually provide information to an external (e.g., user) of the electronic device (101). The display module (160) may include, for example, a display, a holographic device, or a projector and a control circuit for controlling said device. According to one embodiment, the display module (160) may include a touch sensor configured to detect a touch, or a pressure sensor configured to measure the intensity of the force generated by said touch.
[0047] The audio module (170) can convert sound into an electrical signal or, conversely, convert an electrical signal into sound. According to one embodiment, the audio module (170) can acquire sound through the input module (150) or output sound through the sound output module (155) or an external electronic device (e.g., electronic device (102)) (e.g., speaker or headphones) connected directly or wirelessly to the electronic device (101).
[0048] The sensor module (176) can detect the operating state of the electronic device (101) (e.g., power or temperature) or the external environmental state (e.g., user state) and generate an electrical signal or data value corresponding to the detected state. According to one embodiment, the sensor module (176) may include, for example, a gesture sensor, a gyroscope sensor, a barometric pressure sensor, a magnetic sensor, an accelerometer sensor, a grip sensor, a proximity sensor, a color sensor, an IR (infrared) sensor, a biosensor, a temperature sensor, a humidity sensor, or an illuminance sensor.
[0049] The interface (177) may support one or more specified protocols that can be used for the electronic device (101) to be connected directly or wirelessly to an external electronic device (e.g., electronic device (102)). According to one embodiment, the interface (177) may include, for example, a high definition multimedia interface (HDMI), a universal serial bus (USB) interface, a secure digital (SD) card interface, or an audio interface.
[0050] The connection terminal (178) may include a connector through which the electronic device (101) can be physically connected to an external electronic device (e.g., electronic device (102)). According to one embodiment, the connection terminal (178) may include, for example, an HDMI connector, a USB connector, an SD card connector, or an audio connector (e.g., a headphone connector).
[0051] The haptic module (179) can convert an electrical signal into a mechanical stimulus (e.g., vibration or movement) or an electrical stimulus that can be perceived by the user through tactile or kinesthetic senses. According to one embodiment, the haptic module (179) may include, for example, a motor, a piezoelectric element, or an electric stimulation device.
[0052] The camera module (180) can capture still images and video. According to one embodiment, the camera module (180) may include one or more lenses, image sensors, image signal processors, or flashes.
[0053] The power management module (188) can manage power supplied to the electronic device (101). According to one embodiment, the power management module (188) can be implemented, for example, as at least part of a power management integrated circuit (PMIC).
[0054] The battery (189) can supply power to at least one component of the electronic device (101). According to one embodiment, the battery (189) may include, for example, a non-rechargeable primary battery, a rechargeable secondary battery, or a fuel cell.
[0055] - It may support the establishment of a channel or a wireless communication channel, and the performance of communication through the established communication channel. The communication module (190) may include one or more communication processors that operate independently of the processor (120) (e.g., application processor) and support direct (e.g., wired) communication or wireless communication. According to one embodiment, the communication module (190) may include a wireless communication module (192) (e.g., cellular communication module, short-range wireless communication module, or GNSS (global navigation satellite system) communication module) or a wired communication module (194) (e.g., LAN (local area network) communication module, or power line communication module). The corresponding communication module among these communication modules can communicate with an external electronic device (104) through a first network (198) (e.g., a short-range communication network such as Bluetooth, WiFi (wireless fidelity) direct, or IrDA (infrared data association)) or a second network (199) (e.g., a legacy cellular network, a 5G network, a next-generation communication network, the Internet, or a computer network (e.g., a LAN or a WAN (wide area network)). These various types of communication modules may be integrated into a single component (e.g., a single chip) or implemented as multiple separate components (e.g., multiple chips). The wireless communication module (192) can identify or authenticate the electronic device (101) within a communication network such as the first network (198) or the second network (199) using subscriber information (e.g., International Mobile Subscriber Identifier (IMSI)) stored in the subscriber identification module (196).
[0056] The wireless communication module (192) can support 5G networks and next-generation communication technologies following 4G networks, for example, new radio access technology. NR access technology can support high-speed transmission of high-capacity data (eMBB, enhanced mobile broadband), minimization of terminal power and connection of multiple terminals (mMTC, massive machine type communications), or high reliability and low-latency (URLLC, ultra-reliable and low-latency communications). The wireless communication module (192) can support a high-frequency band (e.g., mmWave band) to achieve a high data transmission rate, for example. The wireless communication module (192) can support various technologies for securing performance in the high-frequency band, such as beamforming, massive MIMO (multiple-input and multiple-output), full-dimensional MIMO (FD-MIMO), array antenna, analog beam-forming, or large-scale antenna. The wireless communication module (192) can support various requirements specified in the electronic device (101), external electronic device (e.g., electronic device (104)), or network system (e.g., second network (199)). According to one embodiment, the wireless communication module (192) may support a Peak data rate (e.g., 20 Gbps or more) for eMBB realization, loss coverage (e.g., 164 dB or less) for mMTC realization, or U-plane latency (e.g., downlink (DL) and uplink (UL) each 0.5 ms or less, or round trip 1 ms or less) for URLLC realization.
[0057] An antenna module (197) can transmit a signal or power to or from an external source (e.g., an external electronic device). According to one embodiment, the antenna module (197) may include an antenna comprising a radiator made of a conductor or a conductive pattern formed on a substrate (e.g., a PCB). According to one embodiment, the antenna module (197) may include a plurality of antennas (e.g., an array antenna). In this case, at least one antenna suitable for a communication method used in a communication network, such as a first network (198) or a second network (199), may be selected from the plurality of antennas, for example, by a communication module (190). A signal or power may be transmitted or received between the communication module (190) and an external electronic device through the selected at least one antenna. According to some embodiments, in addition to the radiator, other components (e.g., a radio frequency integrated circuit (RFIC)) may be additionally formed as part of the antenna module (197).
[0058] According to various embodiments, the antenna module (197) may form a mmWave antenna module. According to one embodiment, the mmWave antenna module may include a printed circuit board, an RFIC disposed on or adjacent to a first surface (e.g., bottom surface) of the printed circuit board and capable of supporting a specified high frequency band (e.g., mmWave band), and a plurality of antennas (e.g., array antennas) disposed on or adjacent to a second surface (e.g., top surface or side surface) of the printed circuit board and capable of transmitting or receiving a signal of the specified high frequency band.
[0059] At least some of the above components can be connected to each other via a communication method between peripheral devices (e.g., bus, GPIO (general purpose input and output), SPI (serial peripheral interface), or MIPI (mobile industry processor interface)) and exchange signals (e.g., commands or data) with each other.
[0060] According to one embodiment, commands or data may be transmitted or received between the electronic device (101) and an external electronic device (104) through a server (108) connected to a second network (199). Each of the external electronic devices (102, or 104) may be the same or a different type of device as the electronic device (101). According to one embodiment, all or part of the operations performed on the electronic device (101) may be performed on one or more of the external electronic devices (102, 104, or 108). For example, if the electronic device (101) needs to perform a function or service automatically or in response to a request from a user or another device, the electronic device (101) may request one or more external electronic devices to perform at least part of the function or service instead of performing the function or service itself or additionally. One or more external electronic devices that receive the above request may execute at least part of the requested function or service, or additional function or service related to the request, and transmit the result of the execution to the electronic device (101). The electronic device (101) may provide the result as is or additionally processed as at least part of the response to the request. For this purpose, for example, cloud computing, distributed computing, mobile edge computing (MEC), or client-server computing technology may be used. The electronic device (101) may provide ultra-low latency services using, for example, distributed computing or mobile edge computing. In one embodiment, the external electronic device (104) may include an Internet of Things (IoT) device. The server (108) may be an intelligent server using machine learning and / or neural networks. According to one embodiment, the external electronic device (104) or the server (108) may be included within the second network (199).The electronic device (101) can be applied to intelligent services (e.g., smart home, smart city, smart car, or healthcare) based on 5G communication technology and IoT-related technology.
[0061] FIG. 2 is a block diagram showing an integrated intelligence system according to one embodiment.
[0062] Referring to FIG. 2, an integrated intelligent system of one embodiment may include an electronic device (201) (e.g., the electronic device (101) of FIG. 1), an intelligent server (300), and a service server (399).
[0063] According to the illustrated embodiment, the electronic device (201) may include a communication interface (210), an input / output (I / O) interface (220), a processor (230), and / or memory (240). The listed components may be operatively or electrically connected to each other. For example, the electronic device (201) may include at least some of the components of the electronic device (101) of FIG. 1.
[0064] The communication interface (210) can be connected to an external device (e.g., an intelligent server (300) and / or a service server (399)) via a network (299) (e.g., any network including a cellular network and / or a WLAN (wireless local area network)) to transmit and receive data. For example, the communication interface (210) may correspond to the CP and / or communication circuit of FIG. 1 (e.g., a communication module (190)). The I / O interface (220) can receive user input, process received user input, and / or output results processed by the processor (230) using an input / output device (not shown) (e.g., a microphone, a speaker, and / or a display (e.g., a display module (160) of FIG. 1).
[0065] The processor (230) may perform a specified operation by being operatively or electrically connected to a communication interface (210), an I / O interface (220), and / or a memory (240) (e.g., the memory (130) of FIG. 1). For example, the processor (230) may correspond to the processor (120) of FIG. 1. The processor (230) may perform a specified operation by executing a program (or one or more instructions) stored in the memory (240). For example, the processor (230) may receive a user's voice input (e.g., user speech) through the I / O interface (220) or from an external electronic device. The processor (230) may transmit the voice input received through the communication interface (210) to an intelligent server (300). For example, the processor (230) may include one or more processors.
[0066] The processor (230) may receive a result corresponding to a voice input from the intelligent server (300). For example, the processor (230) may receive a plan corresponding to the voice input and / or a result calculated using the plan from the intelligent server (300). For example, the plan may include, but is not limited to, information regarding a plurality of sequential operations to be executed by the electronic device (201) and / or other electronic device in relation to the voice input. The processor (230) may receive a request from the intelligent server (300) to obtain information (e.g., entities, slots, and / or parameters) necessary to generate a plan corresponding to the voice input. The processor (230) may transmit the necessary information to the intelligent server (300) in response to the request.
[0067] The processor (230) can output the result of executing a specified action according to a plan visually, tactilely, and / or audibly through the I / O interface (220). For example, the processor (230) can sequentially display the result of executing multiple actions on a display. For example, the processor (230) can display only the result of executing multiple actions (e.g., the result of executing one of the multiple actions or the last action) on a display.
[0068] The processor (230) can recognize voice input. For example, the processor (230) can execute an intelligent app (or voice recognition app) to process voice input in response to a specified voice input (e.g., Wake up!). The processor (230) can provide voice recognition services through the intelligent app. The processor (230) can transmit voice input to an intelligent server (300) through the intelligent app and receive a result corresponding to the voice input from the intelligent server (300).
[0069] An intelligent server (300) of one embodiment can receive a user's voice input from an electronic device (201) via a network (299). The intelligent server (300) can convert audio data corresponding to the received voice input into text data. Based on the text data, the intelligent server (300) can generate at least one plan for performing a task corresponding to the user's voice input. The intelligent server (300) can transmit the generated plan, or the result according to the generated plan, to the electronic device (201) via the network (299).
[0070] An intelligent server (300) of one embodiment may include a front end (310), a natural language platform (320), a capsule database (330), an execution engine (340), and / or an end user interface (350).
[0071] The front end (310) can receive voice input received by the electronic device (201) from the electronic device (201). The front end (310) can transmit a response corresponding to the voice input to the electronic device (201).
[0072] The natural language platform (320) may include an automatic speech recognition (ASR) module (321), a natural language understanding (NLU) module (323), a planner module (325), a natural language generator (NLG) module (327), and / or a text-to-speech (TTS) module (329).
[0073] The automatic voice recognition module (321) can convert voice input received from the electronic device (201) into text data.
[0074] The natural language understanding module (323) can identify the user's intent and / or parameters (e.g., entities and / or slots) based on the text data of the voice input. The user's intent corresponds to the voice input and may include information indicating an action (or function) that the user intends to perform using the device. A slot may be detailed information related to the user's intent. A slot may be obtained based on a domain corresponding to the utterance. A slot may be variable information required to perform an action. In one embodiment, the variable information constituting the slot may include a named entity.
[0075] The planner module (325) can generate a plan using the intent and / or parameters determined by the natural language understanding module (323). For example, the planner module (325) can determine at least one domain required to perform a task based on the determined intent. The planner module (325) can determine multiple actions included in each of the at least one domain determined based on the intent. The domain may correspond to a category (or service) associated with an action (or function) that the user intends to execute using the device. The domain may be classified according to a service (e.g., an app) related to text. The domain may be related to the user's intent corresponding to the text. The domain may be classified according to, for example, the application that receives voice input and / or the type of service to be provided based on voice input, but is not limited thereto. In one example, the determination of the domain may be performed by another module (e.g., the natural language understanding module (323)). The planner module (325) can determine parameters required to execute a plurality of determined actions or result values output by the execution of a plurality of actions. The parameters and result values may be defined as concepts of a specified format (or class). For example, the plan may include a plurality of actions and / or a plurality of concepts determined by the user's intent. The planner module (325) can determine the relationship between the plurality of actions and / or a plurality of concepts in a stepwise (or hierarchical) manner. For example, the planner module (325) can identify the execution order of a plurality of actions (e.g., a plurality of actions determined based on the user's intent) based on a plurality of concepts (e.g., parameters required to execute a plurality of actions, and result values output by the execution of a plurality of actions). The planner module (325) can generate a plan that includes association information (e.g., an ontology) between the plurality of actions and the plurality of concepts.The planner module (325) can generate a plan using information (e.g., at least one capsule) stored in a capsule database (330) in which a set of relationships between concepts and actions is stored.
[0076] The planner module (325) can generate a plan based on an artificial intelligence (AI) system. For example, the artificial intelligence system may include one or more electronic devices and / or one or more processing circuits to execute a rule-based system, a neural network-based system (e.g., a feedforward neural network (FNN), a recurrent neural network (RNN)), or a combination of the above. The artificial intelligence system described above is exemplary, and the artificial intelligence system may be an artificial intelligence system based on any model based on machine learning. The planner module (325) may select a plan corresponding to a user request from a set of predefined plans, or generate a plan in real time in response to a user request.
[0077] The natural language generation module (327) can change the specified information into a text form. The information changed into a text form may be in the form of a natural language utterance.
[0078] The text-to-speech conversion module (329) can convert information in text form into information in speech form.
[0079] The capsule database (330) can store information regarding the relationships between multiple concepts and actions corresponding to multiple domains (e.g., applications). The capsule database (330) can store at least one capsule (e.g., capsule (331) and / or capsule (333)) in the form of a concept action network (CAN). For example, the capsule database (330) can store actions for processing a task corresponding to a user's voice input, and parameters required for the actions, in the form of a CAN. A capsule may include multiple action objects (or action information) and / or concept objects (or concept information) included in a plan. For example, capsules (331, 333) may be created per domain and stored in the capsule database (330), but are not limited thereto.
[0080] The execution engine (340) can produce a result using the generated plan. The end user interface (350) can transmit the produced result to the electronic device (201).
[0081] According to one embodiment, some functions (e.g., natural language platform (320)) or all functions of the intelligent server (300) may be implemented in the electronic device (201). For example, the electronic device (201) may execute one or more programs including a natural language platform (e.g., the natural language platform (320) of FIG. 2) separately from the intelligent server (300). For example, the electronic device (201) may directly perform at least some of the operations of the natural language platform (320) of the intelligent server (300) (e.g., automatic speech recognition module (321), natural language understanding module (323), planner module (325), natural language generation module (327), and / or text-to-speech module (329)).
[0082] In one embodiment, the service server (399) may provide a designated service (e.g., food ordering or hotel reservation) to the electronic device (201). The service server (399) may be a server operated by an operator different from the intelligent server (300). The service server (399) may communicate with the intelligent server (300) and / or the electronic device (201) through the network (299). The service server (399) may communicate with the intelligent server (300) through a separate connection (not shown). The service server (399) may provide the intelligent server (300) with information (e.g., operation information and / or concept information for providing the designated service) to generate a plan corresponding to the voice input received by the electronic device (201). The provided information may be stored in the capsule database (330). The service server (399) may provide the intelligent server (300) with result information according to the plan received from the electronic device (201).
[0083] FIG. 3 is a block diagram illustrating a generative artificial intelligence system according to one embodiment.
[0084] Referring to FIG. 3, a generative artificial intelligence system according to one embodiment (e.g., server (108) of FIG. 1 or intelligent server (300) of FIG. 2) may include a user interface (260), a database (265), an application and service component (270), an artificial intelligence (AI) framework (280), and a generative AI model (290). According to one embodiment, the generative artificial intelligence system may be included in an electronic device (e.g., electronic device (101, 201) of FIG. 1 or FIG. 2) (e.g., AI engine) or / or in an intelligent server (e.g., external server (e.g., server (108) of FIG. 1).
[0085] The user interface (260) can receive user queries.
[0086] The above input may include user input and / or data obtained or generated by an electronic device (e.g., the electronic device (101, 201) of FIG. 1 or FIG. 2).
[0087] The above data may include images, videos, and / or sensor data generated by at least one processor of the electronic device (101, 201) (e.g., processor (120, 230) of FIG. 1 or FIG. 2).
[0088] The sensor data may include, for example, illuminance data around the electronic device (102, 201) obtained from the sensor module (176) or sensor hub of FIG. 1, posture data (or orientation data) of the electronic device (101, 201), temperature inside the electronic device (101, 201) (e.g., temperature of the display module (160) of FIG. 1 and / or temperature of the processor (120, 230)), size information of the display area of the display module (160), and / or an image obtained through the image sensor of the electronic device (101, 201) (e.g., camera module (180) of FIG. 1).
[0089] The above user query may be in the form of natural language, touch data obtained through a touch circuit included in the display module (160) (e.g., used to identify input from a finger and / or stylus), an image, audio, and / or video. Additionally, context information may be transmitted along with the user query. The context information may include various side information related to the time when the user query is input into the generative artificial intelligence system. For example, the side information may include information such as application information currently being used by the user and / or location information of the user. As another example, the user query may also be a non-natural language input that does not generate natural language, such as a design request or modification. Additionally, a mixed form of the natural language, image, sound, and context information described above is also possible.
[0090] Additionally, the user interface (260) may output results of the generative artificial intelligence system to the user. The output may include results (or result information) generated or obtained by the generative artificial intelligence system based on at least part of the input. The output may be in the form of natural language or specific content, and may also be provided in the form of an action requested by the user. For example, the output may have a format according to the user settings of the electronic device (101, 201).
[0091] The AI framework (280) can receive a user query and coordinate and control each component necessary to perform the user's intent. This AI framework (280) may include a prompt design component (281), an APIs / Plugins Management component (283), and an output modification component (285).
[0092] User queries or actions entered in the user interface (260) can be transmitted to a prompt design component (281). The prompt design component (281) can be used to generate prompts suitable for input into a large language model (LLM), a large vision model (LVM), or a large multimodal model (LMM). The prompt design component (281) may be an AI component that uses machine learning algorithms or neural networks to develop better prompts over time. The prompt design component (281) can generate prompts by accessing a database (265) (e.g., a knowledge component) containing user preference data, a prompt library, and prompt examples, and can transmit them to the large language model (LLM), the large vision model (LVM), and / or the large multimodal model (LMM).
[0093] The application and plugin management component (283) can perform the role of communicating with external information when there is a request for additional information when user input is transmitted as input to the generative AI model (290). The application and plugin management component (283) establishes a channel to communicate with the outside of the generative artificial intelligence system through an application programming interface (API), thereby enabling access to various data sources. For example, the application and plugin management component (283) can be used to request other components (e.g., application and service components (270)) that perform feedback (or response) according to the prompt. The acquired information can be used to generate a prompt by the prompt design component (281) together with the user input, or can be used as input to the generative AI model (290). Additionally, the application and plugin management component (283) can request the action through the API if the application or service needs to perform an action that ultimately executes the user query rather than an intermediate result. Information obtained from an external source can be transmitted as input to a generative AI model (290) along with user input.
[0094] The output modification component (285) can fine-tune (or adjust or change) the output produced by the generative AI model (290). For example, the output modification component (285) can determine the relevance (e.g., score) between the output (e.g., content) of the generative AI model (290) and the user input. For example, the output modification component (285) can verify whether the content generated through a large language model (LLM), a large vision model (LVM), or a large multimodal model (LMM) is relevant, contains biased information (e.g., selective information), or contains harmful information (e.g., violent content or profanity). Additionally, the output modification component (285) can determine the extent to which the output matches the desired result and, if additional processing is required, proceed with that process. Additionally, the output modification component (285) can configure and provide to the user a hint to avoid unwanted output.
[0095] A generative AI model (290) generally refers to an artificial intelligence neural network that generates new forms of data based on user input information. A generative AI model (290) may include an image generation model and / or a language generation model. An image generation model may include a generative adversarial network (GAN) and / or a variational autoencoder (VAE). An example of an image generation model is a diffusion-based generative model that uses the structures of a VAE and a Transformer. Additionally, a language generation model is a model trained to output the statistically most appropriate output value based on input values, and representative examples include models such as CHAT-GPT 3 and CHAT-GPT 4. Additionally, it may include a large-scale multimodal model (LMM) capable of recognizing various forms of data input, such as text, images, and / or speech, and generating new data corresponding to them.
[0096] In one embodiment, the AI framework (280) and / or generative AI model (290) may be included within an AI module (e.g., including a processing circuit) within the electronic device (101, 201). For example, the AI module may be operatively coupled with at least one processor (120, 230) of the electronic device (101, 201). For example, the AI module may be operatively coupled with a sensor hub of the electronic device (101, 201) for one or more sensors within the electronic device (101, 201).
[0097] FIG. 4 is a diagram schematically illustrating the configuration of an electronic device according to one embodiment.
[0098] According to one embodiment, FIG. 4 may show a block diagram of an exemplary electronic device (400) (e.g., the electronic device (101, 201) of FIG. 1 or FIG. 2) (hereinafter referred to as the electronic device (400)) capable of performing the operations described in this document. In one embodiment of the present disclosure, the electronic device (400) may be referred to as an electronic device (e.g., a smartphone) that communicates with a wearable electronic device (e.g., the electronic device (101, 201) of FIG. 1 or FIG. 2 or the electronic device (501) of FIG. 5a and FIG. 5b described below) via a defined wireless communication (e.g., short-range wireless communication (e.g., Bluetooth, Wi-Fi)). In one embodiment of the present disclosure, the electronic device (400) may communicate with an external server (e.g., a generative artificial intelligence server, an intelligent server (300)) via a defined wireless communication (e.g., long-range wireless communication (e.g., LTE, 5G)).
[0099] Referring to FIG. 4, the electronic device (400) may be one of various forms of electronic devices, such as a notebook (490), smartphones (491) having various form factors (e.g., a bar-type smartphone (491-1), a foldable-type smartphone (491-2), or a sliderable (or rollable)-type smartphone (491-3)), a tablet (492), a cellular phone (not shown), and other similar computing devices (not shown). The components, their relationships, and their functions illustrated in FIG. 4 are illustrative only and are not intended to limit the implementations described or claimed herein. The electronic device (400) may be referred to as a mobile device, a user device, a multifunction device, a portable device, or a server.
[0100] According to one embodiment, the electronic device (400) may include all or at least some of the components of the electronic device (101) as described in the description with reference to FIG. 1. For example, in various embodiments of this document, some of the illustrated components may be omitted or substituted. The electronic device (400) may include at least some of the components and / or functions of the electronic device (101) of FIG. 1. At least some of each component of the illustrated (or unillustrated) electronic device (400) may be operatively, functionally, and / or electrically connected.
[0101] The electronic device (400) comprises at least one processor (410) (e.g., processor (120, 230) of FIG. 1 or FIG. 2) (hereinafter referred to as processor (410)), at least one memory (420) (e.g., memory (130, 240) of FIG. 1 or FIG. 2) (hereinafter referred to as memory (420)), at least one display (440) (e.g., display module (160) of FIG. 1) (hereinafter referred to as display (440)), at least one image sensor (450) (e.g., camera module (180) of FIG. 1) (hereinafter referred to as image sensor (450)), at least one communication circuit (460) (e.g., communication module (190) of FIG. 1 or communication interface (210) of FIG. 2) (hereinafter referred to as communication circuit (460)), and / or at least one sensor (470) (e.g., sensor module (176) of FIG. 1) (hereinafter, It may include components including a sensor (referenced as 470). These components are merely exemplary. For example, the electronic device (400) may include other components (e.g., power management integrated circuitry (PMIC), audio processing circuit, antenna, rechargeable battery, or input / output interface). For example, some components may be omitted from the electronic device (400). For example, some components may be integrated into a single component.
[0102] The processor (410) can perform application layer processing functions required by the user of the electronic device (400). According to one embodiment, the processor (410) can provide control and commands for functions for various blocks of the electronic device (400). According to one embodiment, the processor (410) can perform operations or data processing regarding the control and / or communication of each component of the electronic device (400). For example, the processor (410) may include at least some of the configuration and / or functions of the processor (120) of FIG. 1. According to one embodiment, the processor (410) may be operatively connected to the components of the electronic device (400). According to one embodiment, the processor (410) may load commands or data received from other components of the electronic device (400) into memory (420), process commands or data stored in memory (420), and store result data.
[0103] The processor (410) may be implemented as one or more IC (integrated circuit (or circuitry)) chips and may perform various data processing operations. The processor (410) may include at least one electrical circuit and may individually and / or collectively process instructions (or programs, data, etc.) stored in memory (420). The processor (410) may include a processor assembly comprising one or more processing circuitries and / or executable program elements.
[0104] The processor (410) may include any processing circuit operative to control the performance and operation of one or more components of the electronic device (400) (e.g., memory (420), display (440), image sensor (450), communication circuit (460), and / or sensor (470)). For example, the processor (410) may be an application processor (AP). For example, the processor (410) may be a system semiconductor responsible for the computation and multimedia driving functions of the electronic device (400). The processor (410) may be implemented as a system on chip (SoC) (e.g., a single chip or a chipset). For example, the processor (410) may be implemented as multiple cores (or at least one core circuit), multiple chips, or multiple chipsets. For example, the processor (410) may include one or more processing circuits. For example, the processor (410) may include one or more processing circuits configured to perform the various functions of the present disclosure individually and / or collectively. As an example without limitation, at least a portion of the processor (410) may be included in a first chip of the electronic device (400), and at least another portion of the processor (410) may be included in a second chip of the electronic device (400) that is different from the first chip of the electronic device (400).
[0105] For example, the processor (410) may include a central processing unit (CPU) (411), a graphics processing unit (GPU) (412), a neural processing unit (NPU) (413), an image signal processor (ISP) (414), a display controller (415), a memory controller (416), a storage controller (417), a communication processor (CP) (418), and / or a sensor interface (419). These components of the processor (410) are merely exemplary. For example, the processor (410) may include other components. For example, some components of the processor (410) may be omitted from the processor (410). For example, some components of the processor (410) may be included as separate components of the electronic device (400) outside of the processor (410). For example, some components of the processor (410) (e.g., memory controller (416)) may be included within other components (e.g., at least part of memory (420), an interface (e.g. available for connection to at least one component of the electronic device (400)), a display (440) and / or an image sensor (450)).
[0106] The processor (410) can cause other components of the electronic device (400) to perform various operations by executing instructions stored in memory (420).
[0107] The CPU (411) (or central processing circuit) may be configured to control the components of the processor (410) based on the execution of instructions stored in memory (420) (e.g., volatile memory (421) and / or non-volatile memory (422)). The CPU (411) may decode user commands and perform arithmetic and logical operations, and / or data processing operations. For example, the CPU (411) may be responsible for the functions of memory, interpretation, operation, and control. The CPU (411) may execute all software of the electronic device (400) (e.g., application (146) of FIG. 1) on an operating system (OS) and control hardware devices.
[0108] The CPU (411) can store instructions or data in the volatile memory (421) of the memory (420) (e.g., the volatile memory (132) of FIG. 1) as at least part of the data processing or operation, process the instructions or data stored in the volatile memory (421), and store the result data in the non-volatile memory (422) of the memory (420) (e.g., the non-volatile memory (134) of FIG. 1).
[0109] The CPU (411) may include a single processor core or multiple processor cores. The CPU (411) may be a programmable processor capable of storing executable instructions (e.g., instructions capable of performing operations on the CPU (411)) and executing the instructions.
[0110] The CPU (411) may operate in a multi-domain environment. The CPU (411) may operate in a domain of a normal world (e.g., a non-secure world, a framework, or a non-secure environment) and a multi-domain environment of a secure world (e.g., a security framework or a security environment). In one embodiment, a domain of the secure world may include one or more domains (e.g., a trusted OS, a Trustzone, and / or a virtualization framework).
[0111] The GPU (412) (or graphics processing circuit) may be configured to perform parallel operations (e.g., rendering). The GPU (412) may be responsible for graphics processing. The GPU (412) may receive instructions from the CPU (411) and perform graphics processing to represent the shape, position, color, shade, movement, and / or texture of objects (or things) on the display (440).
[0112] The NPU (413) (or neural processing circuit, or AI (artificial intelligence) chip) can be configured to execute computations (e.g., convolution computations) for artificial intelligence models. The NPU (413) can perform processing optimized for deep-learning algorithms of artificial intelligence. The NPU (413) is a processor optimized for deep-learning algorithm computations (e.g., artificial intelligence computations) and can process big data quickly and efficiently like a human neural network. For example, the NPU (413) can be primarily used for artificial intelligence computations. The NPU (413) can perform processing such as automatically adjusting the focus by recognizing objects, environments, and / or people in the background when capturing video through a camera, automatically switching the camera's shooting mode to food mode when taking food photos, and / or removing only unnecessary subjects from the captured results. The NPU (413) can perform processing to generate response content based on given information (e.g., natural language).
[0113] The ISP (414) (or image signal processing circuit) may be configured to process a raw image acquired through an image sensor (450) into a format suitable for components within the electronic device (400) or components of the processor (410). For example, the ISP (414) may be responsible for image processing and correction of images and videos. The ISP (414) may correct unprocessed data (e.g., raw data) transmitted from the image sensor (450) of a camera (e.g., camera module (180) of FIG. 1) to generate an image in a form more preferred by the user. The ISP (414) may perform post-processing such as adjusting the partial brightness of the image and emphasizing detailed parts. For example, the ISP (414) may independently undergo a process of image quality tuning and correction of the image acquired through the camera to generate a result preferred by the user.
[0114] The ISP (414) can support artificial intelligence-based image processing technology. The ISP (414) can support scene segmentation (e.g., image segmentation) technology that recognizes and / or classifies parts of the scene being captured in conjunction with the NPU (413). For example, the ISP (414) may include a function to process objects such as the sky, bushes, and / or skin by applying different parameters. The ISP (414) can detect and display a human face during video capture using artificial intelligence functions, or use the coordinates and information of the face to adjust the brightness, focus, and / or color of the image.
[0115] According to one embodiment, the electronic device (400) can support integrated machine learning processing by interacting with all processors such as a CPU (411), GPU (412), NPU (413), and ISP (414).
[0116] A display controller (415) (or display control circuit, or DPU (display processing unit)) may be configured to process an image obtained from a CPU (411), GPU (412), ISP (414), or memory (420) (e.g., volatile memory (421)) into a format suitable for display (440).
[0117] The memory controller (416) (or memory control circuit) may be configured to control reading data from the volatile memory (421) and writing data to the volatile memory (421).
[0118] The storage controller (417) (or storage control circuit) may be configured to control reading data from non-volatile memory (422) and writing data to non-volatile memory (422).
[0119] The CP (418) (or communication processing circuit) may be configured to process data obtained from a component of the processor (410) into a format suitable for transmitting to another electronic device via the communication circuit (460), or to process data obtained from another electronic device via the communication circuit (460) into a format suitable for processing by the component of the processor (410). For example, the communication circuit (460) may include one or more communication circuits.
[0120] The sensor interface (419) (or sensing data processing circuit, sensor hub) may be configured to process data regarding the state of the electronic device (400) and / or the state around the electronic device (400), obtained through the sensor (470), into a format suitable for the components of the processor (410).
[0121] According to one embodiment, the processor (410) may be operable in a normal mode (or normal world) and a secure mode (or secure world). According to one embodiment, the processor (410) may control (or process) overall operations related to supporting interpretation services based on processing circuits and / or executable program elements.
[0122] According to one embodiment, the detailed operation of the processor (410) (e.g., processor (120, 230) of FIG. 1 or FIG. 2) of the electronic device (400) (e.g., electronic device (101, 201) of FIG. 1 or FIG. 2) is described with reference to the drawings described below.
[0123] According to one embodiment, operations performed by the processor (410) may be implemented by executing instructions stored in a recording medium (or computer program product or storage medium). For example, the recording medium may include a non-transitory computer-readable recording medium that records a program for executing various operations performed by the processor (410).
[0124] The embodiments described in this disclosure may be implemented in a recording medium readable by a computer or similar device using software, hardware, or a combination thereof. According to a hardware implementation, the operations described in one embodiment may be implemented using at least one of application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, and / or other electrical units for performing functions.
[0125] In one embodiment, a computer-readable recording medium (or computer program product) is provided that records a program to perform (or execute) various operations in an electronic device (400).
[0126] The above operations may include various operations related to supporting interpretation services.
[0127] The memory (420) includes at least some of the configuration and / or functions of the memory (130) of FIG. 1 and can store software (e.g., the program (140) of FIG. 1 and / or the application (146) of FIG. 1). The memory (420) may include one or more storage media (or one or more storage devices). For example, the memory (420) may include a memory assembly comprising one or more storage media. For example, the one or more storage media may include a hard drive, a permanent memory such as flash memory, read-only memory (ROM) (e.g., non-volatile memory (422)), a semi-permanent memory such as random access memory (RAM) (e.g., volatile memory (421)), any other suitable type of storage (or storage assembly), or any combination thereof.
[0128] The memory (420) may include a cache memory, which is one or more different types of memory used to temporarily store data for a function or feature of the electronic device (400). As an example, but not limited to, the cache memory may be included within the processor (410).
[0129] The memory (420) can be fixedly embedded in the electronic device (400) or incorporated into one or more suitable types of components (e.g., a SIM (subscriber identity module) card and / or an SD (secure digital) card) that can be repeatedly inserted into and removed from the electronic device (400).
[0130] For example, memory (420) may store one or more software applications, such as operating system (OS) (or system) software applications, firmware software applications, driver software applications, plugin (e.g., add-in, add-on, and / or applet) software applications, and / or any other suitable software applications. For example, the one or more software applications may include instructions executable by the processor (410). For example, memory (420) may store instructions that can be called by an application programming interface (API). For example, memory (420) may store instructions within a library.
[0131] The memory (420) can store various data used by at least one component of the electronic device (400) (e.g., processor (410)). In one embodiment, the data may include software (e.g., program (140) of FIG. 1) (e.g., operating system (142), middleware (144), and / or application (146) of FIG. 1) and input data or output data for commands associated with the software.
[0132] The memory (420) may include a volatile memory (421) (e.g., the volatile memory (132) of FIG. 1) or a non-volatile memory (422) (e.g., the non-volatile memory (134) of FIG. 1). The memory (420) may store instructions or data received from the processor (410) in the volatile memory (421), and may store result data processed by the processor (410) from the instructions or data stored in the volatile memory (421) in the non-volatile memory (422).
[0133] In one embodiment, the data stored in the memory (420) may include content such as images and / or videos, mask information, alias information, and / or mapping information. In one embodiment, the data stored in the memory (420) may include various information related to interpretation services. The various information is not limited thereto and may include various information described with reference to the drawings described below.
[0134] In one embodiment, the data stored in the memory (420) may include various learning data and / or parameters obtained based on the user's learning through interaction with the user. In one embodiment, the data may include various schemas (or algorithms, models, networks, or functions) to support artificial intelligence-based operations.
[0135] In one embodiment, the fields in which artificial intelligence technology is applied may be diverse. For example, it may consist of technology fields of linguistic understanding, visual understanding, reasoning / prediction, knowledge representation, and / or motion control. Linguistic understanding is a technology that recognizes and applies / processes human language / characters, and may include natural language processing, machine translation, dialogue systems, question answering, and / or speech recognition / synthesis. Visual understanding is a technology that recognizes and processes objects like human vision, and may include object recognition, object tracking, image search, person recognition, scene understanding, spatial understanding, and / or image enhancement. Reasoning / prediction is a technology that judges information to logically reason and predict, and may include knowledge / probability-based reasoning, optimization prediction, preference-based planning, and / or recommendation. Knowledge representation is a technology that automates the processing of human experience information into knowledge data, and may include knowledge construction (e.g., data generation / classification) and / or knowledge management (e.g., data utilization). Motion control is a technique for controlling the movement of an electronic device (400) and may include motion control and / or operation control (e.g., behavior control).
[0136] In one embodiment, a schema for supporting artificial intelligence-based operation in an electronic device (400) may include a neural network. In one embodiment, the neural network may include a neural network model based on at least one of an artificial neural network (ANN), a convolutional neural network (CNN), a region with convolutional neural network (R-CNN), a region proposal network (RPN), a recurrent neural network (RNN), a stacking-based deep neural network (S-DNN), a state-space dynamic neural network (S-SDNN), a Deconvolution Network, a deep belief network (DBN), a restricted Boltzmann machine (RBM), a long short-term memory (LSTM) network, a classification network, a plain residual network, a dense network, a hierarchical pyramid network, and / or a fully convolutional network. According to one embodiment, the types of neural network models are not limited to the examples described above.
[0137] According to one embodiment, the memory (420) can store instructions that cause the electronic device (400) to perform an operation when executed individually and / or collectively by the processor (410).
[0138] According to one embodiment, the memory (420) can store instructions that cause the electronic device (400) to perform operations related to supporting interpretation services when executed individually and / or collectively by the processor (410).
[0139] Instructions can be stored as software (e.g., program (140) of FIG. 1) in memory (420) and can be executed by a processor (410). For example, instructions may include control commands such as arithmetic and logical operations, data movement, and / or input / output that can be recognized by the processor (410). According to one embodiment, the software may include various applications (e.g., application (146) of FIG. 1) that can provide various functions (or services) (e.g., camera (e.g., video recording) function, AI service function, conversational service function, routine function, call function, message function, messenger function, email function, social networking service (SNS) function, search function, content (or media) (e.g., image, video and / or music) playback function, game function, and / or wireless communication function) in the electronic device (400).
[0140] The display (440) may include a configuration identical or similar to the display module (160) of FIG. 1. The display (440) may display various images provided by the processor (410). Under the control of the processor (410), the display (440) may visually provide various screens related to an application being executed (e.g., the application (146) of FIG. 1) and its use (e.g., AR screen, contents screen, application execution screen, menu screen, and / or function execution screen).
[0141] According to one embodiment, the display (440) may have a screen size that changes according to the form factor of the electronic device (400) (e.g., a bar-type smartphone (491-1), a foldable-type smartphone (491-2), a sliderable (or rollable)-type smartphone (491-3), a tablet (492)). For example, the display (440) may be configured to provide a first state having a first screen size and a second state having a second screen size larger than the first screen size.
[0142] According to one embodiment, the electronic device (400) may include an electronic device of the form of a foldable electronic device (e.g., a foldable type smartphone (491-2)) (e.g., including a multi-foldable electronic device). For example, the electronic device (400) may be a foldable electronic device of various forms such as a G-type, Z-type, or e-type. According to one embodiment, if the electronic device (400) is in the form of a multi-foldable electronic device, the electronic device (400) may include a first housing, a second housing, and a third housing. According to one embodiment, the electronic device (400) may provide different information displays depending on the state of the display (440) of the electronic device (400) (or the size of the displayed screen (or screen display area)) (e.g., flex state (or intermediate mode)), such as when the first housing is folded or when the first housing and the third housing are folded together.
[0143] The display (440) may be combined with a touch sensor, a pressure sensor capable of measuring the intensity of the touch, and / or a touch panel (e.g., a digitizer) that detects a magnetic field-based stylus pen. The display (440) may detect touch input, air gesture input, and / or hovering input (or proximity input) by measuring a change in a signal (e.g., voltage, light intensity, resistance, electromagnetic signal, and / or charge quantity) at a specific location on the display (440) based on the touch sensor, pressure sensor, and / or touch panel. For example, the display (440) may include a touchscreen that detects touch and / or proximity touch (or hovering) input using a part of the user's body (e.g., a finger) or an input device (e.g., a stylus pen).
[0144] The display (440) may include, but is not limited to, a liquid crystal display (LCD), a light-emitting diode (LED) display, an organic light-emitting diode (OLED) display, and / or an active matrix OLED (AMOLED) display, a micro electro mechanical systems (MEMS) display, or an electronic paper display. According to one embodiment, the display (440) may include a flexible display.
[0145] The communication circuit (460) can support the establishment of a designated wireless communication channel (e.g., short-range communication such as Bluetooth communication and / or BLE communication) and the performance of communication through the established wireless communication channel. For example, the communication circuit (460) can perform designated communication (e.g., Bluetooth communication and / or BLE communication) with an external device. The communication circuit (460) can support wireless communication with an external device using cellular wireless communication (e.g., 4G LTE, 5G NR) and / or short-range wireless communication (e.g., Wi-Fi).
[0146] For example, an electronic device (400) can communicate with an external electronic device (e.g., the wearable electronic device (501) of FIG. 5a and FIG. 5b) through a network using a communication circuit (460). According to one embodiment, the communication circuit (460) can receive data generated from the external electronic device from the external electronic device and can transmit data generated from the electronic device (400) to the external electronic device.
[0147] For example, an electronic device (400) can communicate with an external server (e.g., a generative artificial intelligence server, an intelligent server (300)) that provides artificial intelligence-based functions (e.g., a conversational service or an assistant service or an AI agent) through a network using a communication circuit (460). According to one embodiment, the communication circuit (460) can transmit data generated from the electronic device (400) to the external server and receive data transmitted from the external server. The communication circuit (460) may include at least some of the configuration and / or functions of the communication module (190) of FIG. 1.
[0148] In one embodiment, the electronic device (400) may include an artificial intelligence-based function (e.g., an artificial intelligence-based function for interpretation services) within an AI engine (or module or model) (e.g., including a processing circuit) within the electronic device (400). For example, the AI engine may be operatively coupled with at least one processor of the electronic device (400) (e.g., processor (120) or processor (410)). For example, the AI engine may be operatively coupled with a sensor of the electronic device (400) for one or more sensors within the electronic device (400) (e.g., sensor module (176), sensor (470), or sensor interface (419)).
[0149] FIG. 5a is a drawing illustrating an example of a wearable electronic device according to one embodiment of the present disclosure.
[0150] FIG. 5b is a drawing illustrating an example of the internal structure of the wearable electronic device of FIG. 5a.
[0151] According to one embodiment, a wearable electronic device (501) (e.g., the electronic device (101, 201) of FIG. 1 or FIG. 2) may have the form of glasses (e.g., a glasses-type wearable electronic device) that is wearable on a part of a user's body (e.g., the head). For example, the housing of the wearable electronic device (501) may include a flexible material such as rubber and / or silicone that has a shape that adheres to a part of the user's head (e.g., a part of the face covering both eyes). For example, the housing of the wearable electronic device (501) may include a strap that can be twined around the user's head and / or temples that are attachable to the ears of the head.
[0152] According to one embodiment, FIGS. 5a and 5b illustrate an example in which the wearable electronic device (501) is in the form of glasses (e.g., a glasses-type display device or AR (augmented reality) glasses), but is not limited thereto. For example, the wearable electronic device (501) may include various types of devices that are worn (or attached) to a part of a user's body (e.g., face or head) to provide augmented reality (AR), mixed reality (MR), and / or virtual reality (VR) services. For example, the wearable electronic device (501) may be implemented in at least one of the forms of glasses, goggles, a helmet, or a hat, but is not limited thereto. The wearable electronic device (501) described below may be a device comprising at least some of the components included in the electronic device (101, 201, 400) as described above with reference to FIG. 1, FIG. 2 and / or FIG. 4. Even if not mentioned in the description below, the wearable electronic device (501) according to the present disclosure may be interpreted as including various components as described with reference to FIG. 1, FIG. 2 and / or FIG. 4.
[0153] According to one embodiment, a wearable electronic device (501) may be worn on a user's face to provide the user with images (e.g., real-world images, augmented reality images, mixed reality images, and / or virtual reality images). According to one embodiment, the wearable electronic device (501) may provide an AR service that superimposes virtual information (or virtual objects) onto at least a portion of real-world space (or world or environment). For example, the wearable electronic device (501) may provide virtual information to the user by superimposing it onto real-world space corresponding to the wearer's field of view (FoV).
[0154] Referring to FIGS. 5a and 5b, a wearable electronic device (501) (e.g., AR glasses) may include a display (550) and a frame (500). The frame (or housing) (500) may be configured to support the display (550) and accommodate a number of hardware.
[0155] The wearable electronic device (501) can provide augmented reality (AR), virtual reality (VR), or mixed reality (MR) to a user wearing the wearable electronic device (501). For example, the wearable electronic device (501) can display a virtual reality image provided by an optical device (582, 584; see FIG. 5b) on a display (550) in response to a specified gesture of the user obtained through the motion recognition camera (560-2, 560-3) of FIG. 5b.
[0156] The display (550) may provide visual information to the user. For example, the display (550) may include a transparent or translucent lens. The display (550) may include a first display (550-1) and / or a second display (550-2). For example, the first display (550-1) may be placed on a glass for the left eye and the second display (550-2) may be placed on a glass for the right eye. The display (550) may provide the user with visual information transmitted from external light through the lens and other visual information distinct from said visual information. The lens may be formed based on at least one of a Fresnel lens, a pancake lens, or a multi-channel lens.
[0157] Referring to FIG. 5b, the display (550) may include a front surface (or first surface) (531) and a rear surface (or second surface) (532) opposite to the front surface (531). A display area may be formed on the rear surface (532) of the display (550). When a user wears the wearable electronic device (501), external light may be transmitted to the user by being incident on the front surface (531) and transmitted through the rear surface (532). As another example, the display (550) may display an augmented reality image combined with a virtual reality image provided by an optical device (582, 584) on a real image transmitted through external light, in the display area formed on the rear surface (532).
[0158] The display (550) may include a waveguide (533, 534) that diffracts light emitted from an optical device (582, 584) and transmits it to the user. The waveguide (533, 534) may be formed based on at least one of glass, plastic, or polymer. A nano pattern may be formed on the exterior or at least a portion of the interior of the waveguide (533, 534). The nano pattern may be formed based on a polygonal and / or curved grating structure. Light incident on one end of the waveguide (533, 534) may be propagated to the other end of the waveguide (533, 534) by the nano pattern. The waveguide (533, 534) may include a diffractive element (e.g., DOE (diffractive optical element), HOE (holographic optical element)) and / or a reflective element (e.g., a reflective mirror). For example, waveguides (533, 534) may be placed within a wearable electronic device (501) to guide a screen displayed by a display (550) to the user's eye. For example, the screen may be transmitted to the user's eye based on total internal reflection (TIR) occurring within the waveguides (533, 534).
[0159] A wearable electronic device (501) may be configured to identify and / or analyze subjects (or objects) in real-world images collected through a shooting camera (560-4). The wearable electronic device (501) may combine a virtual subject with a real subject among the subjects that is the target of augmented reality provision and display it on a display (550). The virtual subject may include at least one of text and / or images regarding information related to the real subject. The wearable electronic device (501) may analyze and / or identify real subjects based on a multi-camera such as a stereo camera. For analysis and / or identification, the wearable electronic device (501) may perform spatial recognition (e.g., SLAM, simultaneous localization and mapping) using a multi-camera and / or time-of-flight (ToF).
[0160] The frame (500) may be formed as a physical structure that allows the wearable electronic device (501) to be worn on the user's body. The frame (500) may be configured such that when the user wears the wearable electronic device (501), the first display (550-1) is positioned in front of the user's left eye and the second display (550-2) is positioned in front of the user's right eye.
[0161] Referring to FIG. 5a, the frame (500) may include a region (520) in which at least a portion of the frame (500) comes into contact with a part of the user's body when the user wears the wearable electronic device (501). For example, the region (520) of the frame (500) in contact with a part of the user's body may include a region in contact with a part of the user's nose, a part of the user's ear, and a part of the side of the user's face that the wearable electronic device (501) comes into contact with. According to one embodiment, the frame (500) may include a nose pad (510) that comes into contact with a part of the user's body. For example, when the wearable electronic device (501) is worn by the user, the nose pad (510) may come into contact with the nose. The frame (500) may include a first temple (504) and a second temple (505) that come into contact with other parts of the body.
[0162] The frame (500) may include a first rim (501) covering at least a portion of a first display (550-1), a second rim (502) covering at least a portion of a second display (550-2), a bridge (503) positioned between the first rim (501) and the second rim (502), a first pad (511) positioned along a portion of the edge of the first rim (501) from one end of the bridge (503), a second pad (512) positioned along a portion of the edge of the second rim (502) from the other end of the bridge (503), a first temple (504) extending from the first rim (501) and fixed to a portion of the wearer's ear, and a second temple (505) extending from the second rim (502) and fixed to a portion of the ear opposite to the ear.
[0163] The first pad (511) and the second pad (512) may come into contact with a part of the nose. The first temple (504) and the second temple (505) may come into contact with a part of the face and a part of the ear. The temples (504, 505) may be rotatably connected to the rims (501, 502) through the hinge units (506, 507) of FIG. 5B. The first temple (504) may be rotatably connected to the first rim (501) through the first hinge unit (506) positioned between the first rim (501) and the first temple (504). The second temple (505) may be rotatably connected to the second rim (502) through the second hinge unit (507) positioned between the second rim (502) and the second temple (505).
[0164] A wearable electronic device (501) can identify an external object (e.g., a user's fingertip) touching the frame (500) and / or a gesture performed by said external object by using a touch sensor, a grip sensor, and / or a proximity sensor formed on at least a portion of the surface of the frame (500).
[0165] The wearable electronic device (501) may include a number of hardware components to perform various functions. For example, the wearable electronic device (501) may include a battery module (570), an antenna module (575), an optical device (582, 584), a speaker (555-1, 555-2), a microphone (565-1, 565-2, 565-3), a light-emitting module (not shown), and / or a printed circuit board (PCB) (590).
[0166] The wearable electronic device (501) may include a microphone (565-1, 565-2, 565-3) configured to convert sound into an electrical signal (e.g., an audio signal) and output it.
[0167] Referring to FIG. 5b, the first microphone (565-1) may be placed on the bridge (503). The second microphone (565-2) may be placed on the second rim (502). The third microphone (565-3) may be placed on the first rim (501). The number and / or placement of the microphones are not limited to the embodiment of FIG. 5b. When two or more microphones are included in the wearable electronic device (501) and two or more microphones are placed at different locations, the wearable electronic device (501) may be configured to identify the direction from which sound is coming (or the location of the sound source) using the plurality of microphones.
[0168] Optical devices (582, 584) can project a virtual object onto a display (550). For example, optical devices (582, 584) may include a projector. Optical devices (582, 584) may be positioned adjacent to the display (550) or may be components of the display (550). According to one embodiment, a wearable electronic device (501) may include a first optical device (582) corresponding to a first display (550-1) and a second optical device (584) corresponding to a second display (550-2). For example, the first optical device (582) may be positioned at the edge of the first display (550-1). The second optical device (584) may be positioned at the edge of the second display (550-2). The first optical device (582) can transmit light to the first waveguide (533) placed on the first display (550-1). The second optical device (584) can transmit light to the second waveguide (534) placed on the second display (550-2).
[0169] The camera (560) may include a shooting camera (560-4), an eye tracking camera (ET CAM, eye tracking camera) (660-1), and / or a motion recognition camera (560-2, 506-3). The shooting camera (560-4), the eye tracking camera (560-1), and the motion recognition camera (560-2, 560-3) may be placed at different locations in the frame (500) and may perform different functions.
[0170] The eye-tracking camera (560-1) can output data indicating the position of the eyes or the gaze of a user wearing the wearable electronic device (501). For example, the wearable electronic device (501) can detect the pupils and track the gaze (e.g., the subject the pupils are directed toward) from images acquired through the eye-tracking camera (560-1). Based on the gaze being tracked through the eye-tracking camera (560-1), the wearable electronic device (501) can identify the subject of interest (e.g., a real subject and / or a virtual subject) (or region of interest) that the user is looking at (e.g., focused by the user). The wearable electronic device (501) can perform a function (e.g., gaze interaction) for interaction between the subject of interest (or region of interest). The wearable electronic device (501) can be tracked through an eye-tracking camera (560-1) and, based on the gaze, can express an animation of the eyes of an avatar representing a user in a virtual space moving.
[0171] A wearable electronic device (501) can render an image (or screen) to be displayed on a display (550) based on the position of the user's eyes. For example, the visual quality of a first area related to the gaze within the image and the visual quality of a second area distinguished from the first area (e.g., resolution, brightness, saturation, grayscale, PPI (pixels per inch)) may differ from each other. The wearable electronic device (501) can obtain an image having the visual quality of the first area and the visual quality of the second area by using foveated rendering. For example, if the wearable electronic device (501) supports an iris recognition function, user authentication can be performed based on iris information obtained using an eye-tracking camera (560-1).
[0172] An example in FIG. 5b in which the eye tracking camera (560-1) is positioned toward the user's right eye is shown, but the embodiment is not limited thereto, and the eye tracking camera (560-1) may be positioned solely toward the user's left eye or toward both eyes.
[0173] The eye tracking camera (560-1) can achieve more realistic augmented reality by tracking the gaze of a user wearing a wearable electronic device (501), thereby matching the user's gaze with the visual information provided to the display (550). For example, when the user looks straight ahead, the wearable electronic device (501) can naturally display environmental information (e.g., the real world) related to the user's front at the location where the user is situated on the display (550). The eye tracking camera (560-1) may be configured to capture an image of the user's pupil to determine the user's gaze. For example, the eye tracking camera (560-1) may receive a gaze detection light reflected from the user's pupil and track the user's gaze based on the position and movement of the received gaze detection light. In one embodiment, the eye tracking camera (560-1) may be positioned at locations corresponding to the user's left and right eyes. For example, the eye-tracking camera (560-1) may be positioned within the first rim (501) and / or the second rim (502) to face the direction in which the user wearing the wearable electronic device (501) is located.
[0174] A motion recognition camera (560-2, 560-3) may be used to recognize movements or gestures of the user's entire body or parts of the user's body, such as the user's torso, hands, or face. For example, a wearable electronic device (501) may recognize a gesture of a specific part of the body (e.g., fingertips) in an image acquired through the motion recognition camera (560-2, 560-3) and perform a designated function based on the recognition of the gesture. For example, the wearable electronic device (501) may display an indicator corresponding to the gesture on a display (550). The motion recognition camera (560-2, 560-3) may be used to perform spatial recognition functions using SLAM and / or depth maps for a 6-degrees-of-freedom pose (6 dof pose). The wearable electronic device (501) can perform gesture recognition and / or object tracking functions using motion recognition cameras (560-2, 560-3). In one embodiment, the motion recognition cameras (560-2, 560-3) may be placed on the first rim (501) and / or the second rim (502).
[0175] The camera (560-4) can capture a real image or background to be matched with a virtual subject (e.g., image and / or text) to implement augmented reality or mixed reality content. The camera (560-4) can be used to acquire high-resolution images based on HR (high resolution) or PV (photo video). The camera (560-4) can capture a specific object located at the position where the user is looking and provide the captured image of the specific object to the display (550). The display (550) can display a single image in which information regarding a real image or background including the image of the specific object acquired using the camera (560-4) and a virtual image provided through the optical device (582, 584) are superimposed.
[0176] The wearable electronic device (501) can compensate (or correct) depth information (e.g., the distance between the wearable electronic device (501) and an external object obtained through a depth sensor) using an image obtained through a shooting camera (560-4). The wearable electronic device (501) can perform an action of identifying an object (or subject) in an image obtained using a shooting camera (560-4). The wearable electronic device (501) can perform a function of focusing on a specific subject (e.g., auto focus) and / or an optical image stabilization (OIS) function (e.g., anti-shake function). The wearable electronic device (501) can perform a pass-through function to display an image obtained through a shooting camera (560-4) superimposed on at least a part of the screen while displaying a screen representing a virtual space on a display (550). In one embodiment, the shooting camera (560-4) may be placed on a bridge (503) located between the first rim (501) and the second rim (502).
[0177] The camera (560) included in the wearable electronic device (501) is not limited to the eye-tracking camera (560-1), motion recognition camera (560-2, 560-3), and shooting camera (560-4) as described above. For example, the wearable electronic device (501) can identify external objects included in the field of view by using a camera positioned toward the user's field of view (FoV). The identification of external objects by the wearable electronic device (501) can be performed based on a sensor for identifying the distance between the wearable electronic device (501) and the external object, such as a depth sensor and / or a time of flight (ToF) sensor. The camera positioned toward the FoV may support an autofocus function and / or optical image stabilization (OIS) function. For example, the wearable electronic device (501) may include a camera (e.g., a face tracking camera) positioned toward the face of a user to acquire an image including the face of a user wearing the wearable electronic device (501).
[0178] The wearable electronic device (501) may further include a light source (e.g., LED) that emits light toward a subject (e.g., user's eyes, face, and / or external objects within the FoV) being photographed using a camera (560). This light source may include an LED of infrared wavelength. The light source may be placed in at least one of the frame (500) and the hinge unit (506, 507).
[0179] The battery module (570) can supply power to the electronic components of the wearable electronic device (501). The battery module (570) may be placed within the first temple (504) and / or the second temple (505). For example, the battery module (570) may include a first battery (571) placed within the first temple (504) and a second battery (572) placed in the second temple (505).
[0180] The antenna module (575) can transmit a signal or power to the outside of the wearable electronic device (501) or receive a signal or power from the outside. In one embodiment, the antenna module (575) may be placed within the first temple (504) and / or the second temple (505). For example, the antenna module (575) may be placed close to one side of the first temple (504) and / or the second temple (505).
[0181] The speaker (555) can output an acoustic signal (e.g., an audio signal) to the outside of the wearable electronic device (501). In one embodiment, the speaker (555) may be placed within a first temple (504) and / or a second temple (505) to be positioned adjacent to the ear of a user wearing the wearable electronic device (501). For example, the speaker (555) may include a first speaker (555-1) positioned adjacent to the user's left ear by being placed within the first temple (504) and a second speaker (555-2) positioned adjacent to the user's right ear by being placed within the second temple (505).
[0182] According to one embodiment, the wearable electronic device (501) may include a light-emitting module (not shown). The light-emitting module (not shown) may include at least one light-emitting element. The light-emitting module may emit light of a color corresponding to a specific state or emit light with an action corresponding to a specific state in order to visually provide information regarding a specific state of the wearable electronic device (501) to the user. For example, if the wearable electronic device (501) requires charging, it may emit red light at a constant frequency. In one embodiment, the light-emitting module may be placed on the first rim (501) and / or the second rim (502).
[0183] According to one embodiment, the wearable electronic device (501) may include a printed circuit board (PCB) (590). The PCB (590) may be placed on a first temple (504) and / or a second temple (505). The wearable electronic device (501) may include an interposer placed between sub-PCBs on the PCB (590). A plurality of hardware may be placed on the PCB (590). The wearable electronic device (501) may include a flexible PCB (FPCB) for interconnecting a plurality of hardware.
[0184] According to one embodiment, the wearable electronic device (501) may include an inertia measurement unit (IMU) configured to include a sensor (e.g., acceleration sensor, gyroscope (or angular velocity sensor)) and a magnetometer for detecting the posture of the wearable electronic device (501) and / or the posture of a body part (e.g., head) of a user wearing the wearable electronic device (501).
[0185] The acceleration sensor may be configured to generate a first acceleration value corresponding to motion in the x-axis direction, a second acceleration value corresponding to motion in the y-axis direction, and a third acceleration value corresponding to motion in the z-axis direction. The gyroscope may be configured to generate a first angular velocity value corresponding to rotational motion with respect to the x-axis, a second angular velocity value corresponding to rotational motion with respect to the y-axis, and a third angular velocity value corresponding to rotational motion with respect to the z-axis. The wearable electronic device (501) may acquire (e.g., calculate) a tilt value (e.g., Euler angle (e.g., roll, pitch, yaw)) using data acquired through the sensor (e.g., data representing acceleration values of each axis and data representing the direction in which the Earth's magnetic force is directed). The wearable electronic device (501) may identify a user's motion or gesture performed to execute or stop a specific function of the wearable electronic device (501) based on data acquired through the IMU.
[0186] FIG. 6 is a diagram illustrating a network environment between an electronic device and a wearable electronic device according to one embodiment of the present disclosure.
[0187] Referring to FIG. 6, a wearable electronic device (501) according to one embodiment may be connected to an electronic device (601) (e.g., the electronic device (400) of FIG. 4). The wearable electronic device (501) and the electronic device (601) may be connected wirelessly (e.g., paired). For example, but not limited to examples, the wearable electronic device (501) may be connected to the electronic device (601) via short-range wireless communication (600) such as Bluetooth, low-power Bluetooth, Wi-Fi, Wi-Fi Direct, or ultra-wide band (UWB). According to one embodiment, the electronic device (601) may include a portable device such as a smartphone, a tablet PC (personal computer), and / or a notebook. According to one embodiment, the wearable electronic device (501) may include AR glasses, smart glasses, or a head-mounted display (HMD).
[0188] A wearable electronic device (501) may directly generate relevant data (e.g., interpretation data or interpretation information) for interpretation services or acquire it from an external device (e.g., electronic device (601) or an external server (e.g., generative artificial intelligence server, intelligent server (300)) and provide it to a user through a display and / or speaker. For example, the wearable electronic device (501) may display virtual information (or digital content) (e.g., interpretation information) processed by the wearable electronic device (501) together with at least one target object in the real world (e.g., an object of interpretation) through a display. For example, the wearable electronic device (501) may receive virtual information processed by the electronic device (601) from the electronic device (601) and display the received virtual information together with at least one target object in the real world (e.g., an object of interpretation) through a display.
[0189] According to one embodiment, when a wearable electronic device (501) is connected to an electronic device (601) for communication, it may periodically transmit video information captured through a camera of the wearable electronic device (501), voice information input through a microphone (e.g., voice signal or audio signal), and user's gaze information (e.g., FoV) to the electronic device (601), and / or transmit to the electronic device (601) when a change in the state of the wearable electronic device (501) (e.g., change in position or direction) occurs. According to one embodiment, when a wearable electronic device (501) is connected to the electronic device (601), it may provide (e.g., transmit) various information such as video information, voice information, gaze information, device information, sensing information, function information, and / or location information to the electronic device (601).
[0190] According to one embodiment, the electronic device (601) and / or the wearable electronic device (501) may detect (or recognize) a defined interaction (e.g., a voice command or gesture input defined to execute an interpretation service) associated with triggering an operation of the present disclosure (e.g., execution of an interpretation service), and in response to the detection of the defined interaction, determine to execute an interpretation service. According to one embodiment, the electronic device (601) and / or the wearable electronic device (501) may acquire video information and voice information through a defined camera and a defined microphone in response to the execution of an interpretation service.
[0191] According to one embodiment, if the wearable electronic device (501) has the ability to independently process interpretation services, the wearable electronic device (501) can generate interpretation data using acquired image information and voice information, and provide the generated interpretation data to a user by outputting it through a display and / or speaker.
[0192] According to one embodiment, when a wearable electronic device (501) is connected to an electronic device (601) and the electronic device (101) has the capability to process interpretation services, the wearable electronic device (501) can transmit acquired image information and voice information to the electronic device (601). The electronic device (601) can generate interpretation data using the image information and voice information received from the wearable electronic device (501). The electronic device (601) can transmit the interpretation data to the wearable electronic device (501). The wearable electronic device (501) can receive interpretation data from the electronic device (601) and provide the received interpretation data to a user by outputting it through a display and / or speaker. The provision of interpretation services in the wearable electronic device (501) and / or electronic device (601) according to the present disclosure is described in detail with reference to the drawings described below.
[0193] According to one embodiment, as an example but not limited to, the role of the electronic device (601) in the example of FIG. 6 may be performed by an external server (e.g., an intelligent server (300) or a generative artificial intelligence). For example, the wearable electronic device (501) may be connected (or linked) to an external server to provide interpretation services.
[0194] FIG. 7 is a drawing for illustrating an example of operation between an electronic device and a wearable electronic device according to one embodiment of the present disclosure.
[0195] In one embodiment, FIG. 7 may show a state in which a user wears a wearable electronic device (501) and looks at the real world transmitted through the display (or glass) of the wearable electronic device (501) within the user's field of view (FoV) (701). For example, FIG. 7 may show a state in which a user looks at an object (700) in the real space (or environment) (700) while wearing the wearable electronic device (501).
[0196] FIG. 8 is a drawing illustrating an example of an operation in which an interpretation service is provided according to one embodiment of the present disclosure.
[0197] According to an embodiment of the present disclosure, an interpretation service engine (or model or module) (800) for supporting an interpretation service may be included. For example, an interpretation service according to an embodiment of the present disclosure may be provided by, for example, one interpretation service engine (800) (e.g., an audio-visual multimodal speaker point speech translation model). According to one embodiment, the interpretation service engine (800) may perform the function of selecting a speaker based on a user and a defined interaction (e.g., voice input according to a speaker description and / or gesture input specifying (or pointing) to the speaker), identifying a voice signal associated with the selected speaker to interpret (and / or translate), and providing interpretation data (e.g., text data and / or audio data) according to the interpretation.
[0198] According to one embodiment, the interpretation service engine (800) may be implemented by a wearable electronic device such as AR glasses worn by a user (e.g., the wearable electronic device (501) of FIG. 5a through 7) (hereinafter referred to as the wearable electronic device (501)), an electronic device such as a smartphone (e.g., the electronic device (400) of FIG. 4 or the electronic device (601) of FIG. 6) (hereinafter referred to as the electronic device (601)) and / or an external server (e.g., the intelligent server (300) of FIG. 3 or a generative artificial intelligence server) (hereinafter referred to as the intelligent server (300)).
[0199] According to one embodiment, depending on the form in which the interpretation service engine (800) is implemented, the interpretation service may be provided based on a first service form, a second service form, a third service form, or a fourth service form.
[0200] For example, the first service form may be one in which an interpretation service engine (800) is included in a wearable electronic device (501) or an electronic device (601), and the wearable electronic device (501) or the electronic device (601) alone provides interpretation services. For example, the wearable electronic device (501) or the electronic device (601) may acquire input data (e.g., video data and voice data) through a camera and a microphone, perform interpretation on the input data through the interpretation service engine (800), and operate to provide the interpretation data to the user through a display and / or speaker.
[0201] For example, the second service type may be a form in which an interpretation service engine (800) is included in the electronic device (601) and an interpretation service is provided through interoperability between the wearable electronic device (501) and the electronic device (601). For example, the wearable electronic device (501) may operate to acquire input data (e.g., video data and voice data) through a camera and a microphone and to transmit the acquired input data to the electronic device (601). For example, the electronic device (601) may operate to perform interpretation on the input data received from the wearable electronic device (501) through the interpretation service engine (800) and to transmit the interpretation data to the wearable electronic device (501). For example, the wearable electronic device (501) may operate to provide the interpretation data received from the electronic device (601) to the user through a display and / or speaker.
[0202] For example, the third service type may be one in which an interpretation service engine (800) is included in an intelligent server (300), and an interpretation service is provided through interoperability between a wearable electronic device (501) and an intelligent server (300). For example, the wearable electronic device (501) may operate to acquire input data (e.g., video data and voice data) through a camera and a microphone, and to transmit the acquired input data to the intelligent server (300). For example, the intelligent server (300) may operate to perform interpretation on the input data received from the wearable electronic device (501) through the interpretation service engine (800) and to transmit the interpretation data to the wearable electronic device (501). For example, the wearable electronic device (501) may operate to provide interpretation data received from the electronic device (601) to the user through a display and / or speaker.
[0203] For example, the fourth service type may be a form in which an interpretation service engine (800) is included in an intelligent server (300), and interpretation services are provided through interoperability between a wearable electronic device (501), an electronic device (601), and an intelligent server (300). For example, the wearable electronic device (501) may acquire input data (e.g., video data and voice data) through a camera and a microphone, and may operate to provide the acquired input data to the intelligent server (601) through the electronic device (601). For example, the electronic device (601) may receive input data from the wearable electronic device (501) and may operate to transmit the input data to the intelligent server (300). For example, the intelligent server (300) may perform interpretation on the input data received from the electronic device (601) through the interpretation service engine (800) and may operate to transmit the interpretation data to the electronic device (601). For example, the electronic device (601) may operate to receive interpretation data from the intelligent server (300) and transmit the interpretation data to the wearable electronic device (501). For example, the wearable electronic device (501) may operate to provide the interpretation data received from the electronic device (601) to the user through a display and / or speaker.
[0204] According to one embodiment, FIG. 8 may illustrate an example in which an interpretation service engine (800) operates in all service forms of an interpretation service. For example, FIG. 8 may illustrate an example of an operation in which the interpretation service engine (800) generates interpretation data based on input from a wearable electronic device (501) and provides the interpretation data to a user through the wearable electronic device (501). For example, the speaker selection function and the interpretation function may be performed by a single interpretation service engine (800) (e.g., an audio-visual multimodal speaker point speech translation model) regardless of the service implementation form. In one embodiment, the interpretation service engine (800) may be referred to as a learning model.
[0205] As illustrated in FIG. 8, the interpretation service engine (800) may include a speaker selection engine (810) (e.g., speaker selection module), a reverse diffusion engine (820) (e.g., reverse diffusion module), a speech-to-speech (STS) engine (830) (e.g., STS translation module), a language generation engine (840) (e.g., fusion generative language module), and a mapping engine (850) (e.g., mapping module).
[0206] In one embodiment, the speaker selection engine (810) can identify a speaker (e.g., a target subject) by extracting a feature vector of voice data (e.g., a vector represented as noisy speech) and / or a feature vector of image data (e.g., an image vector pointed to by a user gesture (e.g., a hand gesture). According to one embodiment, the speaker selection engine (810) can receive voice data and image data as input and perform visual-to-speech or speech-to-visual fusion. For example, the speaker selection engine (810) can select a speaker by fusing (or interacting) voice data and image data.
[0207] According to one embodiment, the speaker selection engine (810) can identify a speaker to be interpreted (e.g., a target subject) by taking two input data: voice data (e.g., a voice command for describing a speaker) and video data. According to one embodiment, the speaker selection engine (810) receives input data, recognizes text in the input data that describes the target subject from the voice data (e.g., a description of the target subject (or object) (e.g., a descriptive text)), and recognizes an object (e.g., a target subject) corresponding to the voice data based on image analysis of the video data (e.g., scene analysis).
[0208] According to one embodiment, image information selected by the speaker selection engine (810) may be segmented in the wearable electronic device (501) or the selection result for the speaker may be transmitted to the STS engine (830). According to one embodiment, the language generation engine (840) may receive the segmented image or perform segmentation based on the selected speaker information to generate a translation language together with the received voice data (or voice information).
[0209] In one embodiment, the reverse diffusion engine (820) may operate in a first stage and a second stage. According to one embodiment, the reverse diffusion engine (820) may estimate (or extract) speech data (e.g., a speech signal (or audio signal) corresponding to at least a speaker in the real world) in the first stage. For example, the reverse diffusion engine (820) may estimate the speech of a target subject (e.g., a target speaker) using visual semantics (e.g., a recognized target subject part) extracted from image data. According to one embodiment, the reverse diffusion engine (820) may restore (e.g., noise removal and speech enhancement (or correction)) the estimated speech (e.g., a sound source) tailored to the target subject of the image data in the second stage using a diffusion-based model (e.g., a generative model).
[0210] In one embodiment, the STS engine (830) may represent a speech recognition translation (or interpretation) model. According to one embodiment, when the STS engine (830) receives a voice (e.g., source speech) restored from the reverse diffusion engine (820), it may generate interpretation data (e.g., target speech, target text) corresponding to another language (e.g., the language used by the target subject). According to one embodiment, the STS engine (830) may process automatic speech recognition (ASR) functions, text-to-speech (TTS) functions, and interpretation / translation functions in a single model. For example, the STS engine (830) may extract features of speech recognition or transcribe them into text, and combine them by predicting (or identifying or estimating) the language of the target subject (e.g., target speaker) through language identification detection (LID). According to one embodiment, the STS engine (830) can translate the first language of the target subject (e.g., a foreign language) into the user's second language (e.g., a native language) based on the predicted language and convert it into related interpretation data (e.g., text data, audio data).
[0211] In one embodiment, the language generation engine (840) can generate an alias by fusing feature information based on image data and feature information described by a user based on voice data. According to one embodiment, the language generation engine (840) can generate a text-based alias for a target subject (e.g., target speaker) using information described by a user in voice. According to one embodiment, the alias for the target subject may be directly specified by the user (e.g., specified by a voice command).
[0212] In one embodiment, the mapping engine (850) can perform mapping regarding aliases generated through the language generation engine (840). According to one embodiment, the mapping engine (850) can find a target subject (e.g., target speaker) described by a user in speech and perform masking. According to one embodiment, the mapping engine (850) can map the mask and the alias and store them in a designated memory. According to one embodiment, by mapping the mask and the alias, if the user subsequently speaks the corresponding alias for the same subject, the system can operate to identify the target subject using the mask information and immediately perform interpretation / translation.
[0213] Referring to FIG. 8, the user may specify at least one speaker based on at least a voice command (e.g., a natural language-based voice command related to describing a speaker) for selecting a speaker (or subject) for interpretation in the real world (or real space) (e.g., a real object) while wearing a wearable electronic device (501), a hand gesture for selecting a speaker (or subject) for interpretation in the real world (or real space), and / or an eye gaze for selecting a speaker (or subject) for interpretation in the real world (or real space).
[0214] For example, reference numerals As exemplified in [Image], the user may specify the subject (801b) to be interpreted (hereinafter referred to as the subject (801b)) from a real-world actual subject (801a, 801b) through voice command input (e.g., natural language speech) describing the subject (801b) to be interpreted. The wearable electronic device (501) may transmit image data (or image information) (801) (e.g., images and / or videos) obtained through a camera and the user's voice data (or voice information) (e.g., voice signals or audio signals) obtained through a microphone to the interpretation service engine (800).
[0215] For example, reference numerals As exemplified in [Image], the user can designate a target subject (801b) through gesture input using the user's hand (805) on a real subject (801a, 801b) in the real world. The wearable electronic device (501) can transmit image data (or image information) (801) (e.g., an image and / or video including the user's hand (805) designating the target subject (801b)) acquired through a camera to an interpretation service engine (800).
[0216] According to one embodiment, a wearable electronic device (501) can transmit voice data (803) (e.g., voice data containing noise) input from real subjects (801a, 801b) in the real world through a microphone to an interpretation service engine (800). For example, the voice data (803) may include noise voice data (803c) (e.g., voice data mixed with noise) in which first voice data (803a) corresponding to a first subject (801a) in the real world and second voice data (803b) corresponding to a second subject (801b) in the real world are mixed (or combined). In one embodiment, the first voice data (803a) of the first subject (801a) may correspond to noise, and the voice data to be extracted may be the second voice data (803b) of the target subject (801b) (or the second subject (801b)).
[0217] According to one embodiment, the interpretation service engine (800) can select a target subject (e.g., target speaker) through the speaker selection engine (810) in response to a first input of the wearable electronic device (501) (e.g., a predetermined voice command specifying speaker selection). For example, the user may utter a voice command describing the features of the target subject to specify the target subject (e.g., target speaker). For example, the user may utter a natural language-based voice command such as "person wearing orange pants." In response to the first input according to the voice command, the interpretation service engine (800) may identify (or select) a subject having the feature points requested according to the first input as the target subject for interpretation and perform interpretation / translation for the target subject.
[0218] According to one embodiment, a user may designate at least two speakers among multiple speakers in the real world as target subjects for interpretation (e.g., target speakers). For example, the user may utter a natural language-based voice command such as "Speaker A is an American wearing blue pants, Speaker B is a French person with blonde hair, and Speaker C is a German person wearing a dress." In response to a first input according to the voice command, the interpretation service engine (800) may identify (or select) multiple subjects corresponding to each feature point requested according to the first input as target subjects for interpretation, and perform interpretation / translation for each target subject. For example, the interpretation service engine (800) may perform interpretation / translation in a language corresponding to each of the multiple target subjects. According to one embodiment, the designation of target subjects may be selected not only by voice commands as in the example above, but also by point designation based on the user's gesture (e.g., hand gesture).
[0219] According to one embodiment, when the target subject corresponds to a multi-speaker who uses different languages (e.g., a Spanish speaker, a Chinese speaker), the interpretation service engine (800) can use a reverse diffusion engine (820) to restore the speaker-specific voice from the mixed voice from the multi-speaker through audio-visual generative multimodal. For example, the interpretation service engine (800) can remove noise from an audio signal input from a noisy environment in the real world and perform generation / restoration into the voice signal of the target speaker.
[0220] According to one embodiment, the interpretation service engine (800) can generate interpretation data by using an STS engine (830) to perform a combined function of ASR, TTS, and interpretation / translation on a restored voice signal of a selected speaker. In one embodiment, the interpretation data may include text data (e.g., translated text) and / or audio data (e.g., translated audio) of at least one target subject (or target speaker).
[0221] According to one embodiment, the interpretation service engine (800) can generate an alias corresponding to the user's description of the utterance by using a language generation engine (840). For example, the interpretation service engine (800) can generate and provide a user alias by fusing a visual representation and a speech representation. For example, if a user utters, "The blonde woman on the far left wearing a white T-shirt is Spanish," the language generation engine (840) of the interpretation service engine (800) can generate an alias based on feature points from the natural language of the user's utterance (e.g., "blonde friend," "friend who likes white T-shirts"). For example, the interpretation service engine (800) can generate an alias through the feature points described by the user. According to one embodiment, the interpretation service engine (800) can store the generated alias in memory by mapping it to the mask information (e.g., face information or a face part image) of the target subject (e.g., target speaker) and the language information.
[0222] According to one embodiment, the interpretation service engine (800) can set the language of a specific speaker using only an alias based on the stored mapping information stored in memory. For example, when using the interpretation service, the user can use the interpretation service immediately by speaking a predetermined alias (e.g., “blonde friend”) without describing or designating the person they previously met.
[0223] According to one embodiment, the interpretation service engine (800) may provide interpretation data to be output through a designated output device of the wearable electronic device (501). For example, the wearable electronic device (501) may acquire interpretation data from the interpretation service engine (800) and output the interpretation data as visual information (e.g., translated text) and / or auditory information (e.g., translated audio). According to one embodiment, if the interpretation data is text data, the wearable electronic device (501) may display virtual information (870) (e.g., digital content such as translated text) associated with a target subject (e.g., a real subject in the real world) through the display of the wearable electronic device (501). According to one embodiment, when the interpretation data is audio data, the wearable electronic device (501) can output an audio signal (e.g., translation audio or translation voice) in real time in response to the speech of a target subject (e.g., a real subject in the real world) through the speaker of the wearable electronic device (501).
[0224] According to one embodiment, when a wearable electronic device (501) provides interpretation data, it may set a predetermined mask (e.g., edge-based masking of the subject, color masking of the subject) for each designated target subject (e.g., target speaker) to provide ease of identification for the target subject and / or the speaker currently speaking among the target subjects, and may provide original language information, translation information and alias information associated with the target subject together.
[0225] An electronic device according to one embodiment of the present disclosure (e.g., the electronic device of FIG. 1, FIG. 2, FIG. 4 or FIG. 5a–7 (101, 201, 400, 501, 601)) may include at least one processor comprising a display, a camera, communication circuitry, and processing circuitry, and a memory. In one embodiment, the memory may store instructions that cause the electronic device to perform an operation when executed individually and / or collectively by the processor.
[0226] According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may execute an interpretation service based on the detection of a user's voice command. According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may acquire input data through the camera and the microphone. According to one embodiment, the input data may include image data acquired through the camera and voice data acquired through the microphone. According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may generate a prompt containing the input data to generate output data for the input data. According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may provide the prompt to an artificial intelligence on an on-device and / or external device. According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may be configured to acquire output data in relation to the prompt. According to one embodiment, the output data may include text data and / or audio data of at least one target speaker, information with a predetermined masking set for the target speaker, source language information associated with the target speaker, translation information, and alias information. According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may be configured to provide an interpretation service based on the output data.
[0227] According to one embodiment, the voice command may include a voice command in which the user describes a target speaker and a predetermined wake-up command for initiating the operation of the interpretation service.
[0228] According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may identify a target speaker corresponding to the voice command based on image analysis of the image data.
[0229] According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may fuse feature information based on the image data with feature information described by the user based on the voice command to generate an alias for the target speaker.
[0230] According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may separate the voice data of the target speaker from the voice data of the input data and perform a translation of the voice data of the target speaker.
[0231] According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may separate mixed voice data from a plurality of target speakers for each speaker and perform translations corresponding to each speaker's language for a multi-language used by a plurality of target speakers.
[0232] According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may provide the interpretation service by outputting the output data as at least one of visual information or auditory information through a designated output device of the electronic device.
[0233] According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may display virtual information associated with the target speaker through the display of the electronic device when the output data is text data.
[0234] According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may be configured to output an audio signal in real time in response to the utterance of the target speaker through the speaker of the electronic device, if the output data is audio data.
[0235] According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may provide the output data by setting and providing a masking for the target speaker and providing original language information, translation information, and alias information associated with the target speaker together.
[0236] According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the electronic device may select the target speaker based on hand gesture or eye gaze recognition.
[0237] According to one embodiment, the video data may be provided continuously via streaming while the interpretation service is being executed.
[0238] An interpretation electronic device according to one embodiment of the present disclosure (e.g., the electronic device (101, 201, 400, 501, 601) of FIGS. 1, FIGS. 2, FIGS. 4 or FIGS. 5a–7) (e.g., interpretation service engine (800)) may include at least one processor comprising processing circuitry and a memory for storing instructions. In one embodiment, the memory may store instructions that cause the electronic device to perform an operation when executed individually and / or collectively by the processor.
[0239] According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the interpreting electronic device may receive input data. According to one embodiment, the input data may include image data comprising a point point for a target speaker designated by a voice command or user gesture in which the user describes the target speaker, and an object of at least one speaker in real-world space corresponding to the user's field of view (FoV). According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the interpreting electronic device may extract feature information based on the image data of the input data and feature information described by the user based on the voice command. According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the interpreting electronic device may fuse the feature information based on the image data and the feature information described by the user based on the voice command to generate an alias. According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the interpreting electronic device may perform a mask of the target speaker based on the image data of the input data. According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the interpreting electronic device may map and store mask information and alias information. According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the interpreting electronic device may perform a translation on the target speaker's speech signal based on the target speaker's language information.According to one embodiment, when the instructions are executed individually and / or collectively by at least one processor, the interpretation electronic device may provide result data corresponding to the translation performance.
[0240] Hereinafter, the operation method of a system for providing interpretation services of various embodiments (e.g., electronic device (101), electronic device (201), intelligent server (300), electronic device (400), wearable electronic device (501), interpretation service engine (800)) is described in detail. Hereinafter, the system for providing interpretation services may be referred to as an interpretation electronic device that can be distinguished as a first electronic device, a second electronic device, or a third electronic device depending on the implementation form of the interpretation service engine (800) for interpretation services. In one embodiment, the first electronic device may correspond to the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2, or the wearable electronic device (501) of FIG. 5a to FIG. 7 (hereinafter referred to as the wearable electronic device (501)) as described above. In one embodiment, the second electronic device may correspond to the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2, the electronic device (400) of FIG. 4, or the electronic device (601) of FIG. 6 (hereinafter referred to as the electronic device (601)) as described above. In one embodiment, the third electronic device may correspond to the server (108) of FIG. 1 as described above or the intelligent server (300) of FIG. 2 or FIG. 3 (hereinafter referred to as the intelligent server (300)).
[0241] According to one embodiment, the interpretation service may be provided based on the first service form, second service form, third service form, or fourth service form as described above, depending on the form in which the interpretation service engine (800) is implemented. For example, the first service form may be a form in which the interpretation service engine (800) is included in a wearable electronic device (501) or an electronic device (601), and the wearable electronic device (501) or the electronic device (601) provides the interpretation service independently. For example, the second service form may be a form in which the interpretation service engine (800) is included in an electronic device (601), and the interpretation service is provided through interoperability between the wearable electronic device (501) and the electronic device (601). For example, the third service type may be one in which the interpretation service engine (800) is included in the intelligent server (300) and interpretation services are provided through interoperability between the wearable electronic device (501) and the intelligent server (300). For example, the fourth service type may be one in which the interpretation service engine (800) is included in the intelligent server (300) and interpretation services are provided through interoperability between the wearable electronic device (501), the electronic device (601), and the intelligent server (300).
[0242] Operations performed by an interpretation electronic device according to various embodiments (e.g., a first electronic device (hereinafter referred to as a wearable electronic device (501)), a second electronic device (hereinafter referred to as an electronic device (601)), or a third electronic device (hereinafter referred to as an intelligent server (300))) may be executed by the artificial intelligence of the interpretation electronic device (e.g., an interpretation service engine (800)). According to one embodiment, the interpretation service engine (800) may be implemented by at least one processor comprising various processing circuitry and / or executable program elements of the interpretation electronic device. According to one embodiment, operations performed by the artificial intelligence (e.g., an interpretation service engine (800)) of the interpretation electronic device may be stored as instructions in the memory of the interpretation electronic device and may be performed (or executed) individually and / or collectively by the processor of the interpretation electronic device.
[0243] An interpretation service according to an embodiment of the present disclosure may be performed solely by a wearable electronic device (501) or an electronic device (601) (e.g., a first service form), performed through mutual interaction between the wearable electronic device (501) and the electronic device (601) (e.g., a second service form), performed through mutual interaction between the wearable electronic device (501) and the intelligent server (300) (e.g., a third service form), or performed through mutual interaction between the wearable electronic device (501), the electronic device (601), and the intelligent server (300) (e.g., a fourth service form). An interpretation service according to an embodiment of the present disclosure may be a service in which a target speaker (e.g., a target subject) is designated by a user via voice input and / or gesture input, and the voice input of the target speaker is translated by a designated electronic device and provided in real time as visual information and / or auditory information (e.g., an instant interpreting service).
[0244] FIG. 9 is a flowchart illustrating the operation method of an interpretation electronic device according to one embodiment of the present disclosure.
[0245] According to one embodiment, FIG. 9 may illustrate an example of a method for providing interpretation services through artificial intelligence (or a model, module, or device) (e.g., an interpretation service engine (800)) in an interpretation electronic device that supports interpretation services according to one embodiment. According to one embodiment, the operation of the interpretation electronic device according to the embodiment of FIG. 9 may correspond to the operation of artificial intelligence (e.g., an interpretation service engine (800)) implemented in the device according to, for example, a first service form to a fourth service form.
[0246] According to one embodiment, as illustrated in FIG. 9, the operation method performed by the interpretation electronic device (e.g., interpretation service engine (800)) may include generating and providing interpretation data for an interpretation service through artificial intelligence based on input data transmitted (or acquired) from a camera (or image sensor) (e.g., camera module (180) of FIG. 1 or image sensor (450) of FIG. 4) and a microphone (e.g., input module (150) of FIG. 1) of the interpretation electronic device (e.g., wearable electronic device (501) or electronic device (601)) in which artificial intelligence (e.g., interpretation service engine (800)) is implemented.
[0247] According to one embodiment, as illustrated in FIG. 9, the operation method performed by the interpretation electronic device may include the operation in which the interpretation electronic device (e.g., electronic device (400) or intelligent server (300)), in which artificial intelligence (e.g., interpretation service engine (800)) is implemented, generates interpretation data for an interpretation service through artificial intelligence and provides it to the wearable electronic device (501) based on input data transmitted (or acquired) from the camera (or image sensor) (e.g., camera module (180) of FIG. 1, image sensor (450) of FIG. 4, camera (560) of FIG. 5a and FIG. 5b)) of the external electronic device (e.g., wearable electronic device (501)) and microphone (e.g., input module (150) of FIG. 1 or microphone (565-1, 565-2, 565-3) of FIG. 5a and FIG. 5b).
[0248] According to one embodiment, artificial intelligence may include generative artificial intelligence (generative AI). Generative artificial intelligence may represent an artificial intelligence technology that creates new content using existing content such as text, audio, and / or images. For example, generative artificial intelligence may represent an artificial intelligence technology that can generate content (e.g., text, audio, images, and / or videos) corresponding to an input based on a given input (e.g., a prompt or an instruction). According to one embodiment, an electronic device (101) may generate and provide content based on generative artificial intelligence (e.g., on-device AI). According to one embodiment, the electronic device (101) may request content generation from a server and receive and provide content generated based on the server's generative artificial intelligence from the server. According to one embodiment, the electronic device (101) may provide the generative artificial intelligence with a prompt (or instruction or generative AI prompt) requesting content creation (e.g., a question or instruction to be entered into the generative artificial intelligence).
[0249] A method of operation performed in an interpretation electronic device (e.g., interpretation service engine (800)) according to one embodiment of the present disclosure may be performed, for example, according to the flowchart illustrated in FIG. 9. The flowchart illustrated in FIG. 9 is an example according to one embodiment of the operation of the interpretation electronic device, and the order of at least some operations may be changed, performed in parallel, performed as independent operations, or at least some other operations may be performed complementarily to at least some operations. According to one embodiment of the present disclosure, operations 901 through 917 may be performed by at least one processor of the interpretation electronic device.
[0250] As illustrated in FIG. 9, the operation method performed by an interpretation electronic device (e.g., interpretation service engine (800)) according to one embodiment may include an operation of detecting an operation initiation trigger (operation 901), an operation of obtaining first input data related to the selection of a target speaker (operation 903), an operation of identifying a target speaker based on the analysis of the first input data (operation 905), an operation of generating an alias based on the first input data (907), an operation of mapping the target speaker to the alias (operation 909), an operation of obtaining second input data (operation 911), an operation of performing translation based on the second input data (operation 913), an operation of generating interpretation data (operation 915), and an operation of outputting alias information and interpretation data (operation 917).
[0251] Referring to FIG. 9, in operation 901, the interpretation electronic device (or interpretation service engine (800)) can detect an operation start trigger.
[0252] In operation 903, the interpreting electronic device may acquire first input data related to the selection of a target speaker. In one embodiment, the first input data may include, for example, a voice command in which a user describes (or explains or describes) a target speaker, and / or a point (e.g., coordinate information) for the target speaker designated by a user gesture (e.g., hand gesture or eye gaze). According to one embodiment, the first input data may include image data (or image information) comprising at least one speaker object (or subject) in real-world space corresponding to the user's field of view (FoV). In one embodiment, the image data may be provided continuously (or persistently) via streaming while the interpreting service is being executed.
[0253] In operation 905, the interpreting electronic device can identify a target speaker based on the analysis of the first input data. According to one embodiment, the interpreting electronic device can recognize an object (e.g., a target speaker or a target subject) corresponding to a voice command or a point point based on image analysis (e.g., scene analysis) of the image data of the first input data.
[0254] In operation 907, the interpreting electronic device may generate an alias based on the first input data. According to one embodiment, the interpreting electronic device may generate an alias by fusing feature information based on image data and feature information described by a user based on a voice command. According to one embodiment, the interpreting electronic device may generate a text-based alias for a target subject (e.g., target speaker) using information described by a user in voice. According to one embodiment, the alias for the target subject may be directly assigned by the user (e.g., assigned by a voice command). According to one embodiment, the alias generation may use an LLM model.
[0255] In operation 909, the interpreting electronic device can map a target speaker to an alias. In one embodiment, the interpreting electronic device can perform mapping regarding a generated alias. According to one embodiment, the interpreting electronic device can locate a target subject (e.g., a target speaker) described by a user in speech and perform masking. According to one embodiment, the interpreting electronic device can map the mask and the alias and store them in a designated memory.
[0256] In operation 911, the interpreting electronic device may acquire second input data. In one embodiment, the second input data may include voice data (e.g., voice data containing noise) input from a real subject in the real world. According to one embodiment, the interpreting electronic device may acquire a segmented image (e.g., video data) and voice (e.g., voice data) corresponding to the image (e.g., video data) together as an input (e.g., second input data) for interpretation (or translation). For example, a segmented image (e.g., video data) and voice (e.g., voice data) corresponding to the image (e.g., video data) together may be used as an input for interpretation (or translation).
[0257] In operation 913, the interpreting electronic device may perform translation based on second input data. According to one embodiment, the interpreting electronic device may perform translation on voice data of an identified target subject (or segmented image) (e.g., target speaker). According to one embodiment, the interpreting electronic device may perform translation into a language corresponding to each of a plurality of target subjects. According to one embodiment, if the target subject corresponds to multiple speakers who use different languages (e.g., a speaker who uses Spanish, a speaker who uses Chinese), the interpreting electronic device may separate (or restore) the mixed voice from the multiple speakers by speaker and perform translation corresponding to the language of each speaker. According to one embodiment, the interpreting electronic device may perform translation by performing a combined function of ASR, TTS, and interpretation / translation from the restored voice signal of the speaker.
[0258] In operation 915, the interpreting electronic device may generate interpreting data. According to one embodiment, the interpreting electronic device may generate interpreting data in response to the completion of translation. In one embodiment, the interpreting data may include text data (e.g., translated text) and / or audio data (e.g., translated audio) of at least one target subject (or target speaker).
[0259] In operation 917, the interpreting electronic device may output alias information and interpreting data. According to one embodiment, the interpreting electronic device may provide the interpreting data to be output through a designated output device of the wearable electronic device (501). For example, the wearable electronic device (501) may acquire the interpreting data from the interpreting electronic device and output the interpreting data as visual information (e.g., translated text) and / or auditory information (e.g., translated audio). According to one embodiment, if the interpreting data is text data, the wearable electronic device (501) may display virtual information (e.g., digital content such as translated text) associated with a target subject (e.g., a real subject in the real world) through the display of the wearable electronic device (501). According to one embodiment, when the interpretation data is audio data, the wearable electronic device (501) can output an audio signal (e.g., translation audio or translation voice) in real time in response to the speech of a target subject (e.g., a real subject in the real world) through the speaker of the wearable electronic device (501). According to one embodiment, when the interpretation electronic device provides interpretation data, it can set a predetermined mask (e.g., edge-based masking of the subject, color masking of the subject) for each designated target subject (e.g., target speaker) to provide ease of identification for the target subject and / or the speaker currently speaking among the target subjects, and can provide original language information, translation information, and alias information associated with the target subject.
[0260] FIG. 10 is a flowchart illustrating the operation method of an interpreting electronic device according to one embodiment of the present disclosure.
[0261] According to one embodiment, FIG. 10 may illustrate an example of a method for providing an interpretation service through artificial intelligence (or a model, module, or device) (e.g., an interpretation service engine (800)) (hereinafter referred to as an interpretation electronic device) in an interpretation electronic device that supports an interpretation service according to one embodiment.
[0262] A method of operation performed in an interpreting electronic device according to one embodiment of the present disclosure may be performed, for example, according to the flowchart illustrated in FIG. 10. The flowchart illustrated in FIG. 10 is an example according to one embodiment of the operation of the interpreting electronic device, and the order of at least some operations may be changed, performed in parallel, performed as independent operations, or at least some other operations may be performed complementarily to at least some operations. According to one embodiment of the present disclosure, operations 1001 to 1017 may be performed by at least one processor of the interpreting electronic device.
[0263] According to one embodiment, the operation described in FIG. 10 may be performed heuristically in combination with the operations described in FIG. 8 and FIG. 9, for example, replaced at least some of the operations described and performed heuristically in combination with at least some other operations, or performed heuristically as a detailed operation of at least some of the operations described.
[0264] As illustrated in FIG. 10, an operation method performed by an interpretation electronic device (e.g., interpretation service engine (800)) according to one embodiment comprises: receiving input data (operation 1001); estimating alias information based on the input data (operation 1003); determining whether corresponding alias information is registered (operation 1005); if alias information is not registered, extracting features based on the input data (operation 1007); generating an alias using the features (1009); performing a mask of a target speaker based on the input data (operation 1011); reading mask information and mapping it to the target speaker (1013); mapping and storing mask information and alias information (operation 1015); performing interpretation on the voice signal of a target speaker based on the language information of the target speaker (operation 1017); if alias information is registered, identifying mask information and language information corresponding to the alias information (operation 1019); and, based on the mask information, the target speaker It may include an identifying action (action 1021), and an action of performing interpretation on the target speaker's voice signal based on the target speaker's language information (action 1017).
[0265] According to one embodiment, operations 1007 and 1011 may be performed in parallel or sequentially, and operations 1009 and 1013 may be performed in parallel or sequentially following each of operations 1007 and 1011.
[0266] According to one embodiment, at least some of operations 1001 to 1017 can perform substantially the same operations as those described in the description section with reference to FIGS. 8 to 9 as described above, and specific descriptions of the corresponding operations may be omitted.
[0267] Referring to FIG. 10, in operation 1001, an interpretation electronic device (or interpretation service engine (800)) may receive input data. In one embodiment, the input data may include a voice command in which a user describes (or explains or describes) a target speaker, and / or a point (e.g., coordinate information) for the target speaker specified by a user gesture (e.g., hand gesture or eye gaze). According to one embodiment, the input data may include image data (or image information) comprising at least one speaker object (or subject) in real-world space corresponding to the user's field of view (FoV). In one embodiment, the image data may be provided continuously (or persistently) via streaming while the interpretation service is running. For example, the input data may include a segmented image (e.g., image data) and voice (e.g., voice data) corresponding to the image (e.g., image data).
[0268] In operation 1003, the interpreting electronic device can estimate alias information based on input data. According to one embodiment, the interpreting electronic device can estimate an alias by fusing feature information based on image data and feature information described by a user based on a voice command. According to one embodiment, the interpreting electronic device can estimate a text-based alias for a target subject (e.g., target speaker) using information described by a user in voice. According to one embodiment, the interpreting electronic device can estimate an alias based on an object (e.g., target speaker or target subject) recognized in correspondence with a voice command or point point based on image analysis of image data (e.g., scene analysis). According to one embodiment, alias estimation may use an LLM model.
[0269] In operation 1005, the interpreting electronic device can determine whether corresponding alias information is registered. According to one embodiment, the interpreting electronic device can identify whether alias information corresponding to the estimated alias information is registered by comparing the estimated alias information with alias information that is pre-stored (or registered) in the interpreting electronic device (e.g., memory, etc.).
[0270] In operation 1005, if the interpreting electronic device does not have alias information registered (e.g., 'No' in operation 1005), in operation 1007, it can extract features based on input data. According to one embodiment, the interpreting electronic device can extract feature information based on image data and feature information described by a user based on voice commands.
[0271] In operation 1009, the interpreting electronic device can generate an alias utilizing features. According to one embodiment, the interpreting electronic device can generate an alias by fusing feature information based on image data and feature information described by a user based on voice commands. According to one embodiment, the interpreting electronic device can generate a text-based alias for a target subject (e.g., target speaker) using information described by a user in voice.
[0272] In operation 1011, the interpreting electronic device can perform a mask of the target speaker based on input data. According to one embodiment, the interpreting electronic device can find a target subject (e.g., target speaker) described by a user in speech and perform a mask.
[0273] In operation 1013, the interpreting electronic device can read mask information and map it to a target speaker. According to one embodiment, the interpreting electronic device can map a mask to an alias.
[0274] In operation 1015, the interpreting electronic device can store mask information and alias information by mapping them. According to one embodiment, the interpreting electronic device can store the mask and alias by mapping them in a memory designated as mapping information.
[0275] In operation 1017, the interpreting electronic device can perform interpretation (or translation) on the voice signal of a target speaker based on the language information of the target speaker. According to one embodiment, the interpreting electronic device can perform translation on voice data of an identified target subject (e.g., target speaker). According to one embodiment, the interpreting electronic device can separate (or restore) the voice signal of the target speaker from input data and perform translation on the voice signal. According to one embodiment, a segmented image (e.g., video data) and voice (e.g., voice data) corresponding to the image (e.g., video data) can be used together as input for interpretation (or translation). According to one embodiment, the interpreting electronic device can perform translation into a language corresponding to each of a plurality of target subjects. According to one embodiment, when the target subject corresponds to multiple speakers who use different languages (e.g., a Spanish speaker, a Chinese speaker), the interpretation electronic device can separate (or restore) the mixed voice from the multiple speakers by speaker and perform translation corresponding to the language of each speaker. According to one embodiment, the interpretation electronic device can perform translation by performing a combined function of ASR, TTS, and interpretation / translation from the restored voice signal of the speaker.
[0276] In operation 1005, if the interpreting electronic device has registered alias information (e.g., 'Yes' in operation 1005), in operation 1019, it can identify mask information and language information corresponding to the alias information. According to one embodiment, the interpreting electronic device can retrieve pre-stored mask information and language information corresponding to the registered alias information.
[0277] In operation 1021, the interpreting electronic device can identify a target speaker based on mask information. According to one embodiment, the interpreting electronic device can identify a target speaker corresponding to the mask information based on image analysis (e.g., scene analysis) of video data.
[0278] In operation 1023, the interpreting electronic device can perform interpretation (or translation) on the voice signal of a target speaker based on the language information of the target speaker. According to one embodiment, the interpreting electronic device can perform translation on the voice data of an identified target subject (e.g., target speaker). According to one embodiment, the interpreting electronic device can separate (or restore) the voice signal of the target speaker from the input data and perform translation on the corresponding voice signal. According to one embodiment, the interpreting electronic device can perform translation into a language corresponding to each of a plurality of target subjects. According to one embodiment, if the target subject corresponds to multiple speakers who use different languages (e.g., a speaker who uses Spanish, a speaker who uses Chinese), the interpreting electronic device can separate (or restore) the mixed voice from the multiple speakers by speaker and perform translation corresponding to the language of each speaker. According to one embodiment, the interpreting electronic device can perform translation by performing a combined function of ASR, TTS, and interpretation / translation from the restored voice signal of the speaker.
[0279] FIG. 11 is a flowchart illustrating the operation method of an interpreting electronic device according to one embodiment of the present disclosure.
[0280] According to one embodiment, FIG. 11 may illustrate an example of a method for providing interpretation services through artificial intelligence (e.g., interpretation service engine (800)) in an interpretation electronic device that supports interpretation services according to one embodiment. According to one embodiment, the method of operation of the interpretation electronic device according to the embodiment of FIG. 11 may correspond to the operation of, for example, a wearable electronic device (e.g., the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2, or the wearable electronic device (501) of FIG. 5a to FIG. 7) that operates independently according to a first service type (hereinafter referred to as the wearable electronic device (501)) or an electronic device (e.g., the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2, the electronic device (400) of FIG. 4, or the electronic device (601) of FIG. 6) (hereinafter referred to as the electronic device (601)).
[0281] A method of operation performed in an interpreting electronic device (e.g., a wearable electronic device (501) or an electronic device (601)) according to one embodiment of the present disclosure may be performed, for example, according to the flowchart illustrated in FIG. 11. The flowchart illustrated in FIG. 11 is an example according to one embodiment of the operation of the interpreting electronic device, and at least some of the operations may be changed, performed in parallel, performed as independent operations, or at least some other operations may be performed complementarily to at least some of the operations. According to one embodiment of the present disclosure, operations 1101 to 1111 may be performed by at least one processor of the interpreting electronic device.
[0282] According to one embodiment, the operation described in FIG. 11 may be performed heuristically in combination with the operations described in FIG. 8 to FIG. 10, for example, replaced at least some of the operations described and performed heuristically in combination with at least some other operations, or performed heuristically as a detailed operation of at least some of the operations described.
[0283] As illustrated in FIG. 11, an operation method performed by an interpretation electronic device (e.g., a wearable electronic device (501) or an electronic device (400)) according to one embodiment may include an operation of detecting a user's voice command (operation 1101), an operation of executing an interpretation service (operation 1103), an operation of acquiring input data through a camera and a microphone (operation 1105), an operation of processing an interpretation service based on the input data (operation 1107), an operation of generating output data (operation 1109), and an operation of providing an interpretation service based on the output data (operation 1111).
[0284] Referring to FIG. 11, in operation 1101, a processor of an interpretation electronic device (e.g., a wearable electronic device (501) or an electronic device (601)) can detect a voice command from a user. In one embodiment, the user's voice command may include, for example, a voice command in which the user describes (or explains or describes) a target speaker. According to one embodiment, the user's voice command may include a defined wake-up command for initiating the operation of the interpretation service (e.g., wake-up).
[0285] In operation 1103, the processor may execute an interpretation service. According to one embodiment, the processor may identify the initiation of an interpretation service in response to the detection of a user's voice command. According to one embodiment, the processor may include an operation of executing (e.g., turn-on, etc.) related components (e.g., camera, microphone, display, and / or speaker, etc.) in accordance with the execution of the interpretation service.
[0286] In operation 1105, the processor may acquire input data through the camera and microphone of the interpretation electronic device. According to one embodiment, the processor may acquire image data through the camera and voice data through the microphone. According to one embodiment, a segmented image (e.g., image data) and voice (e.g., voice data) corresponding to the image (e.g., image data) may be provided together as input for interpretation (or translation). According to one embodiment, the processor may input input data including image data and voice data as a prompt to an artificial intelligence (e.g., interpretation service engine (800)).
[0287] In operation 1107, the processor can process an interpretation service based on input data. According to one embodiment, the processor can process an operation for an interpretation service as an operation corresponding to an artificial intelligence-based operation with reference to FIGS. 9 and FIGS. 10, for example.
[0288] In operation 1109, the processor may generate output data. In one embodiment, the output data may include text data (e.g., translated text) and / or audio data (e.g., translated audio) of at least one target subject (or target speaker). According to one embodiment, the processor may provide output data including, together with the interpretation data, information on which a masking (e.g., edge-based masking of the subject, color masking of the subject) is set for each specified target subject (e.g., target speaker), source language information associated with the target subject, translation information, and alias information.
[0289] In operation 1111, the processor may provide an interpretation service based on output data. According to one embodiment, the processor may output interpretation data as visual information (e.g., translation text) and / or auditory information (e.g., translation audio). According to one embodiment, if the interpretation data is text data, the processor may display virtual information (e.g., digital content such as translation text) associated with a target subject (e.g., a real subject in the real world) through the display of the interpretation electronic device. According to one embodiment, if the interpretation data is audio data, the processor may output an audio signal (e.g., translation audio or translation voice) in real time in response to a speech by the target subject (e.g., a real subject in the real world) through the speaker of the interpretation electronic device. According to one embodiment, when a processor provides interpretation data, it may set a predetermined mask (e.g., edge-based masking of the subject, color masking of the subject) for each designated target subject (e.g., target speaker) to provide ease of identification for the target subject and / or the speaker currently speaking among the target subjects, and may provide source language information, translation information, and alias information associated with the target subject together.
[0290] FIG. 12 is a flowchart illustrating a method of operation of an electronic device according to one embodiment of the present disclosure.
[0291] According to one embodiment, FIG. 12 may illustrate an example of a method in which, by mutual interaction between a first electronic device (e.g., a wearable electronic device (501)) and a second electronic device (e.g., an electronic device (601) and / or an intelligent server (300)) that supports an interpretation service according to one embodiment, input data related to the interpretation service is provided by the first electronic device, and an interpretation service is provided through artificial intelligence (e.g., an interpretation service engine (800)) in a second electronic device in which an interpretation service engine (800) is implemented. According to one embodiment, the method of operation of an electronic device according to the embodiment of FIG. 12 may correspond to the operation of a first electronic device (e.g., the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2, or the wearable electronic device (501) of FIG. 5a to FIG. 7) (hereinafter referred to as the wearable electronic device (501)) which operates by mutual interoperability between electronic devices according to a second service form, a third service form, or a fourth service form.
[0292] A method of operation performed in a wearable electronic device (501) according to one embodiment of the present disclosure may be performed, for example, according to the flowchart illustrated in FIG. 12. The flowchart illustrated in FIG. 12 is an example according to one embodiment of the operation of the wearable electronic device (501), and at least some of the operations may be changed, performed in parallel, performed as independent operations, or at least some other operations may be performed complementarily to at least some of the operations. According to one embodiment of the present disclosure, operations 1201 to 1213 may be performed by at least one processor of the wearable electronic device (501).
[0293] According to one embodiment, the operation described in FIG. 12 may be performed heuristically in combination with the operations described in FIG. 8 to FIG. 11, for example, replaced at least some of the operations described and performed heuristically in combination with at least some other operations, or performed heuristically as a detailed operation of at least some of the operations described.
[0294] As illustrated in FIG. 12, the operation method performed by a wearable electronic device (501) according to one embodiment may include an operation of detecting a user's voice command (operation 1201), an operation of executing an interpretation service (operation 1203), an operation of acquiring input data through a camera and a microphone (operation 1205), an operation of generating a prompt containing input data to generate output data for the input data (operation 1207), an operation of transmitting the prompt to an external device (operation 1209), an operation of receiving output data (operation 1211), and an operation of providing an interpretation service based on the output data (operation 1213).
[0295] Referring to FIG. 12, in operation 1201, the processor of the wearable electronic device (501) can detect a voice command from a user. In one embodiment, the user's voice command may include, for example, a voice command in which the user describes (or explains or describes) a target speaker. According to one embodiment, the user's voice command may include a defined wake-up command for initiating the operation of the interpretation service (e.g., wake-up).
[0296] In operation 1203, the processor may execute an interpretation service. According to one embodiment, the processor may identify the initiation of the interpretation service in response to the detection of a user's voice command. According to one embodiment, the processor may include an operation of executing (e.g., turn-on, etc.) related components (e.g., camera, microphone, display, and / or speaker, etc.) in response to the execution of the interpretation service. According to one embodiment, the processor may execute the interpretation service (e.g., turn-on the camera, etc.) and, in response, request an external device (e.g., electronic device (601) or intelligent server (300)) connected to the wearable electronic device (501) to initiate the interpretation service.
[0297] In operation 1205, the processor may acquire input data through a camera and a microphone. According to one embodiment, the processor may acquire image data through a camera and voice data through a microphone. According to one embodiment, the processor may configure input data including image data and voice data. According to one embodiment, a segmented image (e.g., image data) and voice (e.g., voice data) corresponding to the image (e.g., image data) may be used together as input for interpretation (or translation).
[0298] In operation 1207, the processor may generate a prompt containing input data to generate output data for the input data. In one embodiment, the prompt may represent an input value (or command or instruction) to be passed as input to a generative artificial intelligence (or AI model) (e.g., an interpretation service engine (800)) for generating output data. According to one embodiment, the processor may generate a prompt by inputting input data (e.g., video data and voice data) related to generating output data as a prompt source. For example, the processor may generate an output data corresponding to the input data (e.g., interpretation data for a target speaker) and generate a prompt that provides that information.
[0299] In operation 1209, the processor may transmit a prompt to an external device. According to one embodiment, the processor may provide the prompt to a generative artificial intelligence (e.g., an interpretation service engine (800)) of a designated external device (e.g., an electronic device (601) or an intelligent server (300)). According to one embodiment, the processor may provide the prompt to the external device to provide output data based on the prompt. According to one embodiment, the external device may process an operation for an interpretation service based on the prompt, for example, as an operation corresponding to an artificial intelligence-based operation with reference to FIGS. 9 and FIGS. 10.
[0300] In operation 1211, the processor may receive output data. According to one embodiment, the processor may obtain output data in relation to a prompt from an external device. According to one embodiment, the processor may obtain interpretation data, which is output data, in relation to a prompt (or instruction). According to one embodiment, the output data obtained in relation to a prompt may include, in addition to the interpretation data, information on which a masking (e.g., edge-based masking of the subject, color masking of the subject) is set for each designated target subject (e.g., target speaker), original language information associated with the target subject, translation information, and alias information.
[0301] In operation 525, the processor (120) may display an interface containing a sound source. According to one embodiment, the processor (120) may provide to the user by displaying on a display an interface containing a sound source (or information about the sound source) which is a result (e.g., result data) of a user's request to create music. For example, the processor (120) may display, in the interface showing the result, an album cover, a title for the music, a description of the music, and / or lyrics information.
[0302] In operation 1213, the processor may provide an interpretation service based on output data. According to one embodiment, the processor may output interpretation data as visual information (e.g., translation text) and / or auditory information (e.g., translation audio). According to one embodiment, if the interpretation data is text data, the processor may display virtual information (e.g., digital content such as translation text) associated with a target subject (e.g., a real subject in the real world) through the display of the interpretation electronic device. According to one embodiment, if the interpretation data is audio data, the processor may output an audio signal (e.g., translation audio or translation voice) in real time in response to a speech by the target subject (e.g., a real subject in the real world) through the speaker of the interpretation electronic device. According to one embodiment, when a processor provides interpretation data, it may set a predetermined mask (e.g., edge-based masking of the subject, color masking of the subject) for each designated target subject (e.g., target speaker) to provide ease of identification for the target subject and / or the speaker currently speaking among the target subjects, and may provide source language information, translation information, and alias information associated with the target subject together.
[0303] FIG. 13 is a drawing illustrating an example of an interface providing interpretation services in a wearable electronic device according to one embodiment of the present disclosure.
[0304] FIG. 14 is a drawing illustrating an example of an interface providing interpretation services in a wearable electronic device according to one embodiment of the present disclosure.
[0305] According to one embodiment, the interpretation service according to the present disclosure may be provided by a user of a wearable electronic device (501) describing (or explaining) the target speaker by voice command through the wearable electronic device (501) to specify the target speaker, performing interpretation / translation on the target speaker's utterance (e.g., voice signal) through artificial intelligence (e.g., interpretation service engine (800)) of an on-device and / or external device (e.g., electronic device (601) and / or intelligent server (300)), and outputting interpretation data through the wearable electronic device (501).
[0306] Referring to FIGS. 13 and 14, a user (1301), while wearing a wearable electronic device (501), [possesses] a voice command (1303, 1403) (e.g., a natural language-based voice command related to speaker description) for selecting a target speaker (or target subject) of the interpretation target in the real world (or real space) (1307) (e.g., a real object (1310, 1320, 1330, 1340)), a user gesture for selecting a target speaker (or target subject) of the interpretation target in the real world (or real space) (1307) (e.g., a real object (1310, 1320, 1330, 1340)) with the user's (1301) hand (1305), and / or the user's (1301) gaze at the target speaker (or target) of the interpretation target in the real world (or real space) (1307). At least one target speaker can be specified based on an eye gaze that selects a subject (e.g., actual object (1310, 1320, 1330, 1340)).
[0307] For example, a user (1301) may specify a target speaker through voice command input (e.g., natural language speech) describing the target speaker of the interpretation target in a real subject (1310, 1320, 1330, 1340) in the real world (1307). According to one embodiment, a wearable electronic device (501) may, in response to a voice command, transmit video data (or video information) (e.g., images and / or videos) acquired through a camera and the user's voice data (or voice information) (e.g., voice signals or audio signals) acquired through a microphone to an external device (e.g., electronic device (601) or intelligent server (300)).
[0308] For example, a user (1301) may designate a target speaker through user gesture input using the user's hand (1305) on a real subject (1310, 1320, 1330, 1340) in the real world (1307). According to one embodiment, a wearable electronic device (501) may transmit image data (or image information) (801) (e.g., an image and / or video including the user's hand (805) designating the target subject (801b)) acquired through a camera in response to a user gesture to an external device (e.g., an electronic device (601) or an intelligent server (300)).
[0309] According to one embodiment, a user (1301) may designate at least one speaker among a plurality of speakers (1310, 1320, 1330, 1340) in the real world (1307) as the target speaker for interpretation.
[0310] For example, the user (1301) can specify a single target speaker with a voice command (1303, 1403) that specifies a single target speaker, such as “Translate what the woman wearing sunglasses says into Korean.” For example, the user (1301) can specify a single target speaker with a user gesture (e.g., a single gesture pointing to (or indicating) a target speaker) using a hand (1305).
[0311] For example, the user (1301) can specify multiple target speakers using voice commands (1303, 1403) such as, “The woman on the far left wearing a white shirt is Spanish, the woman wearing glasses is French, the man wearing glasses is Arab, and the person on the far right is German, translate it.” For example, the user (1301) can specify multiple target speakers using a user gesture (e.g., a repeating gesture pointing to (or indicating) multiple target speakers) by using a hand (1305).
[0312] According to one embodiment, the target speaker may be designated by a natural language-based voice command (1303, 1403) as in the example described above, and / or selected by a point designation based on the user's gesture (e.g., hand (1305) gesture or gaze gaze).
[0313] According to one embodiment, a wearable electronic device (501) may provide input data (e.g., a prompt) based on video data and voice data to a connected external device (e.g., an electronic device (601) or an intelligent server (300)) in response to a target speaker designation, and may obtain output data (or result data) (e.g., interpretation data) from the external device in relation to the input data.
[0314] According to one embodiment, the output data obtained in relation to the input data may include mapping information in which masking information and alias information are mapped, in addition to the interpretation data. For example, an external device (e.g., electronic device (601) or intelligent server (300)) may generate interpretation data based on the voice signal of a target speaker in relation to the input data (e.g., prompt) through artificial intelligence (e.g., generative artificial intelligence or interpretation service engine (800)), and generate the generated interpretation data and additional information related to the interpretation data (e.g., alias information) together, and provide the output data to a wearable electronic device (501).
[0315] According to one embodiment, a wearable electronic device (501) may output an interface containing output data. For example, the wearable electronic device (501) may provide interpretation data, which is a result (e.g., output data) of a user's interpretation service request, to the user by displaying it as visual information on a display or / or providing it to the user by outputting it through a speaker as auditory information. For example, the wearable electronic device (501) may acquire output data from an on-device and / or external device (e.g., electronic device (601) and / or intelligent server (300)) and provide the output data to the user as visual information and / or auditory information according to a predetermined output method. According to one embodiment, an example of providing an interface containing interpretation data to the user by displaying it as visual information on a display is illustrated in FIG. 14.
[0316] As illustrated in FIG. 14, the wearable electronic device (501) can obtain output data from an external device (e.g., electronic device (601) or intelligent server (300)) and output the output data as visual information (e.g., translated text) and / or auditory information (e.g., translated audio). According to one embodiment, when the output data of the wearable electronic device (501) is text data, the wearable electronic device (501) can display virtual information (1315, 1325, 1335, 1345) (e.g., digital content such as translation text) associated with a target subject (e.g., a real subject (1310, 1320, 1330, 1340) of the real world (1307) through the display of the wearable electronic device (501). According to one embodiment, when the output data of the wearable electronic device (501) is audio data, the wearable electronic device (501) can output an audio signal (e.g., translation audio or translation voice) in real time in response to speech by a target subject (e.g., a real subject (1310, 1320, 1330, 1340) of the real world (1307) through the speaker of the wearable electronic device (501).
[0317] According to one embodiment, when the wearable electronic device (501) provides output data, it may set a predetermined mask (e.g., edge-based masking of the subject, color masking of the subject) for each designated target subject (e.g., target speaker (1310, 1320, 1330, 1340)) to provide ease of identification for the target subject and / or the speaker currently speaking among the target subjects, and may provide original language information, translation information and alias information associated with the target subject.
[0318] For example, multiple languages different from each of the multiple target speakers (1310, 1320, 1330, 1340) may be set, and the wearable electronic device (501) may provide the user with translated results for each of the multiple languages through virtual information (1315, 1325, 1335, 1345) according to the user's voice command (e.g., voice command (1403)) for the multiple target speakers (1310, 1320, 1330, 1340).
[0319] According to one embodiment, a wearable electronic device (501) may translate and provide the native language of each of the plurality of target speakers (1310, 1320, 1330, 1340) into the user's native language (e.g., Korean). For example, the first language of the first target speaker (e.g., English) may be translated into the user's native language (e.g., Korean), the second language of the second target speaker (e.g., Spanish) may be translated into the user's native language, the third language of the third target speaker (e.g., Chinese) may be translated into the user's native language, and the fourth language of the fourth target speaker (e.g., Japanese) may be translated into the user's native language and provided.
[0320] According to one embodiment, virtual information (1315, 1325, 1335, 1345) is reference numeral <1400> As exemplified in [Image], information may be included regarding an alias (1410) corresponding to a feature described (or explained) by a user’s voice command (1403), the voice (1420) of the target speaker (e.g., the source language of the target speaker), and a translation (1430) (e.g., target language translation). For example, fictitious information (1315, 1325, 1335, 1345) may include alias information, source language information, and translation information distinguished by target speaker from the output data.
[0321] A method of operation performed in an electronic device (101, 201, 501, 601) according to one embodiment of the present disclosure may include an operation of executing an interpretation service based on the detection of a user's voice command. According to one embodiment, the method of operation may include an operation of acquiring input data through a camera and a microphone. According to one embodiment, the input data may include image data acquired through the camera and voice data acquired through the microphone. According to one embodiment, the method of operation may include an operation of generating a prompt containing the input data to generate output data for the input data. According to one embodiment, the method of operation may include an operation of providing the prompt to an artificial intelligence on an on-device and / or external device. According to one embodiment, the method of operation may include an operation of acquiring output data in relation to the prompt. According to one embodiment, the output data may include text data and / or audio data of at least one target speaker, information with a predetermined masking set for the target speaker, original language information associated with the target speaker, translation information, and alias information. According to one embodiment, the operation method may include an operation of providing an interpretation service based on the output data.
[0322] According to one embodiment, the operation method may include an operation of identifying a target speaker corresponding to the voice command based on image analysis of the video data. According to one embodiment, the voice command may include a voice command in which a user describes the target speaker and a predetermined wake-up command for initiating the operation of the interpretation service.
[0323] According to one embodiment, the operation method may include an operation of generating an alias for the target speaker by fusing feature information based on the image data and feature information described by the user based on the voice command.
[0324] According to one embodiment, the operation method may include the operation of separating the voice data of the target speaker from the voice data of the input data, and the operation of performing a translation on the voice data of the target speaker.
[0325] According to one embodiment, the operation method may include, for multiple languages used by a plurality of target speakers, the operation of separating mixed voice data from a plurality of target speakers by each speaker, and the operation of performing translation corresponding to each speaker's language.
[0326] According to one embodiment, the operation method may include the operation of providing the interpretation service by outputting the output data as at least one of visual information or auditory information through a predetermined output device of the electronic device.
[0327] According to one embodiment, the operation method may include, when providing the output data, setting and providing a predetermined mask for the target speaker, and providing original language information, translation information, and alias information associated with the target speaker together.
[0328] An operation method performed in an interpretation electronic device (800) according to one embodiment of the present disclosure may include an operation of receiving input data. According to one embodiment, the input data may include image data comprising a point point for a target speaker designated by a voice command or user gesture in which a user describes a target speaker, and at least one speaker object in a real-world space corresponding to the user's field of view (FoV). According to one embodiment, the operation method may include an operation of extracting feature information based on the image data of the input data and feature information described by the user based on the voice command. According to one embodiment, the operation method may include an operation of generating an alias by fusing the feature information based on the image data and the feature information described by the user based on the voice command. According to one embodiment, the operation method may include an operation of performing a mask of the target speaker based on the image data of the input data. According to one embodiment, the operation method may include an operation of mapping and storing mask information and alias information. According to one embodiment, the operation method may include an operation of performing translation on the voice signal of the target speaker based on the language information of the target speaker. According to one embodiment, the operation method may include an operation of providing result data corresponding to the translation performance.
[0329] A non-transitory computer-readable recording medium storing instructions that cause the processor to perform operations when executed by the processor of an electronic device (101, 201, 501, 601) according to one embodiment of the present disclosure, wherein the operations include: an operation to perform an interpretation service based on the detection of a user's voice command; an operation to acquire input data through a camera and a microphone; the input data includes image data acquired through the camera and voice data acquired through the microphone; an operation to generate a prompt including the input data to generate output data for the input data; an operation to provide the prompt to an artificial intelligence of an on-device and / or external device; an operation to acquire output data in relation to the prompt; the output data includes text data and / or audio data of at least one target speaker, information with a predetermined mask set for the target speaker, original language information associated with the target speaker, translation information and alias information, and an operation to provide an interpretation service based on the output data. It can be included.
[0330] It will be understood that the foregoing embodiments and their technical features may be combined with one another in any combination, provided there is no conflict between the two embodiments or features. For example, any combination of two or more of the foregoing embodiments may be conceived and included within the present disclosure. One or more features from any embodiment may be incorporated into any other embodiment and may provide corresponding advantages or benefits.
[0331] The electronic device according to the various embodiments disclosed in this document may be of various forms. The electronic device may include, for example, a portable communication device (e.g., a smartphone), a computer device, a portable multimedia device, a portable medical device, a camera, a wearable device, or a consumer electronics device. The electronic device according to the embodiments of this document is not limited to the devices described above.
[0332] The various embodiments of this document and the terms used therein are not intended to limit the technical features described in this document to specific embodiments, and should be understood to include various modifications, equivalents, or substitutions of said embodiments. In connection with the description of the drawings, similar reference numerals may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more of said items unless the relevant context clearly indicates otherwise. In this document, phrases such as "A or B," "at least one of A and B," "at least one of A or B," "A, B or C," "at least one of A, B and C," and "at least one of A, B, or C" may each include any one of the items listed together in the corresponding phrase, or all possible combinations thereof. Terms such as "first," "second," or "first" or "second" may be used simply to distinguish said components from other said components and do not limit said components in any other aspect (e.g., importance or order). Where any (e.g., 1st) component is referred to as "coupled" or "connected" to another (e.g., 2nd) component, with or without the terms "functionally" or "communicationly," it means that said any component may be connected to said other component directly (e.g., via a wire), wirelessly, or through a third component.
[0333] The term “module” as used in the various embodiments of this document may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit, for example. A module may be a component formed integrally, or a minimum unit of said component or a part thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).
[0334] Various embodiments of the present document may be implemented as software (e.g., program (140)) comprising one or more instructions stored in a storage medium (or recording medium) (e.g., internal memory (136) or external memory (138)) readable by a machine (e.g., electronic device (101)). For example, a processor (e.g., processor (120)) of the machine (e.g., electronic device (101)) may call at least one of the one or more instructions stored from the storage medium and execute it. This enables the machine to be operated to perform at least one function according to the at least one called instruction. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. The storage medium readable by the machine may be provided in the form of a non-transitory storage medium. Here, 'non-transient' simply means that the storage medium is a tangible device and does not contain a signal (e.g., electromagnetic waves), and the term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily.
[0335] According to one embodiment, the method according to the various embodiments disclosed herein may be provided by being included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)) or an application store (e.g., Play Store). TM It can be distributed online (e.g., downloaded or uploaded) through ) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily created on a device-readable storage medium (or recording medium), such as the memory of a manufacturer's server, an application store's server, or a relay server.
[0336] According to various embodiments, each component (e.g., module or program) of the components described above may include a singular or multiple entities, and some of the multiple entities may be separated and placed in other components. According to various embodiments, one or more of the components or operations of the aforementioned components may be omitted, or one or more other components or operations may be added. Generally or additionally, multiple components (e.g., module or program) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the multiple components in the same or similar manner as those performed by the corresponding component among the multiple components prior to integration. According to various embodiments, operations performed by the module, program, or other components may be executed sequentially, in parallel, iteratively, or heuristically, or one or more of the operations may be executed in a different order, omitted, or one or more other operations may be added.
[0337] The various embodiments of the present disclosure disclosed in this specification and drawings are provided as specific examples to facilitate the explanation of the technical content of the present disclosure and to aid in understanding the present disclosure, and are not intended to limit the scope of the present disclosure. Accordingly, the scope of the present disclosure should be interpreted to include all modifications or variations derived based on the technical concept of the present disclosure, in addition to the embodiments disclosed herein.
Claims
In the electronic device (101, 201, 501, 601), camera; mike; At least one processor including processing circuitry; and It includes memory for storing instructions, When the above instructions are executed individually and / or collectively by the at least one processor, the electronic device, Based on the detection of the user's voice command, execute the interpretation service, and Input data is acquired through the camera and the microphone, and the input data includes image data acquired through the camera and voice data acquired through the microphone. To generate output data for the above input data, a prompt containing the above input data is generated, and Input the above prompt into a trained AI model stored on an on-device device and / or a trained AI model on an external device, and Output data is obtained from the learned artificial intelligence model based on the input data above, and the output data includes text data and / or audio data of at least one target speaker, information with a predetermined mask set for the target speaker, original language information associated with the target speaker, translation information and alias information, and An electronic device that generates visual information and / or auditory information to provide interpretation services based on the above output data. In paragraph 1, The above voice command is an electronic device comprising a voice command in which a user describes a target speaker and a predetermined wake-up command for initiating the operation of the interpretation service. In paragraph 2, when the instructions are executed individually and / or collectively by the at least one processor, the electronic device, An electronic device that identifies a target speaker corresponding to the voice command based on image analysis of the above video data. In paragraph 2, when the instructions are executed individually and / or collectively by the at least one processor, the electronic device, An electronic device that generates an alias for the target speaker by fusing feature information based on the above-mentioned image data and feature information described by a user based on the above-mentioned voice command. In paragraph 2, when the instructions are executed individually and / or collectively by the at least one processor, the electronic device, Among the voice data of the above input data, the voice data of the above target speaker is separated, and An electronic device that performs translation on the voice data of the above-mentioned target speaker. In paragraph 2, when the instructions are executed individually and / or collectively by the at least one processor, the electronic device, For multilingual systems used by multiple target speakers, mixed speech data from multiple target speakers is separated by speaker, and An electronic device that performs translations corresponding to each speaker's language. In paragraph 2, when the instructions are executed individually and / or collectively by the at least one processor, the electronic device, An electronic device that outputs the above visual information and / or the above auditory information through a corresponding output device of the electronic device. In paragraph 7, when the instructions are executed individually and / or collectively by the at least one processor, the electronic device, If the above output data is text data, virtual information is displayed in association with the target speaker through the display of the electronic device, and An electronic device that, when the output data is audio data, outputs an audio signal in real time in response to the utterance of the target speaker through the speaker of the electronic device. In paragraph 7, when the instructions are executed individually and / or collectively by the at least one processor, the electronic device, An electronic device that, when providing the above output data, sets and provides a predetermined mask for the target speaker, and provides original language information, translation information, and alias information associated with the target speaker together. In paragraph 2, when the instructions are executed individually and / or collectively by the at least one processor, the electronic device, An electronic device that selects the target speaker based on hand gesture or eye gaze recognition. In paragraph 2, The above video data is provided continuously via streaming while the interpretation service is running, in an electronic device. In the interpretation electronic device (800), At least one processor including processing circuitry; and It includes memory for storing instructions, When the above instructions are executed individually and / or collectively by the at least one processor, the interpreting electronic device, Input data is received, and said input data includes a point for a target speaker designated by a voice command or user gesture describing the target speaker, and image data including at least one speaker object in real-world space corresponding to the user's field of view (FoV). Extract feature information based on the image data of the above input data and feature information described by the user based on the above voice command, and An alias is generated by fusing feature information based on the above image data and feature information described by the user based on the above voice command, and Based on the image data of the above input data, a mask of the target speaker is performed, and Map and store mask information and alias information, and Based on the language information of the above target speaker, a translation is performed on the speech signal of the above target speaker, and An electronic interpretation device that provides result data corresponding to the above translation performance. In a method of operating an electronic device (101, 201, 501, 601), An action of executing an interpretation service based on the detection of a user's voice command; An operation of acquiring input data through the camera and the microphone, wherein the input data includes image data acquired through the camera and voice data acquired through the microphone; An operation to generate a prompt containing the input data to generate output data for the input data; The operation of providing the above prompt to a trained artificial intelligence model stored on an on-device device and / or a trained artificial intelligence model on an external device; An operation of obtaining output data from the learned artificial intelligence model based on the provided data, wherein the output data comprises text data and / or audio data of at least one target speaker, information having a predetermined mask set for the target speaker, original language information associated with the target speaker, translation information, and alias information; and A method comprising the operation of generating visual information and / or auditory information to provide interpretation services based on the above output data. In Paragraph 13, The above voice command includes a voice command in which the user describes a target speaker and a predetermined wake-up command for initiating the operation of the interpretation service, and The method of operation of the above electronic device is, An operation to identify a target speaker corresponding to the voice command based on image analysis of the above video data; The operation of separating the voice data of the target speaker from the voice data of the above input data, and A method comprising the operation of performing translation on the voice data of the target speaker. In a non-transitory computer-readable recording medium that stores instructions for causing a processor of an electronic device to perform operations when executed by said processor, said operations are, An action of executing an interpretation service based on the detection of a user's voice command, The operation of acquiring input data through the camera and the microphone, wherein the input data includes image data acquired through the camera and voice data acquired through the microphone, and The operation of generating a prompt containing the input data to generate output data for the input data, The operation of providing the above prompt to a trained artificial intelligence model stored on an on-device device and / or a trained artificial intelligence model on an external device, An operation of obtaining output data from the learned artificial intelligence model based on the input data, wherein the output data comprises text data and / or audio data of at least one target speaker, information having a predetermined mask set for the target speaker, original language information associated with the target speaker, translation information and alias information, and A recording medium comprising an operation to generate visual information and / or auditory information for providing interpretation services based on the above output data.