Application execution method and wearable device supporting same
The wearable device addresses the challenge of integrating voice recognition and display by determining virtual object placement and shape based on environmental and user context, enhancing the augmented or mixed reality experience through accurate and context-aware object display.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SAMSUNG ELECTRONICS CO LTD
- Filing Date
- 2025-12-15
- Publication Date
- 2026-06-25
AI Technical Summary
Existing wearable devices struggle to effectively integrate voice recognition and display virtual objects in augmented or mixed reality environments, particularly in determining the optimal placement and shape of virtual objects based on user context and environmental factors.
A wearable device with a processor and memory that executes a voice recognition application, determines virtual objects based on external device information, recognized objects, and user history, and displays them accordingly, utilizing a multimodal language model to analyze user inputs and environmental context.
Enhances the integration of voice-activated applications by accurately placing and shaping virtual objects in the user's field of view, providing a more intuitive and context-aware augmented or mixed reality experience.
Smart Images

Figure KR2025021730_25062026_PF_FP_ABST
Abstract
Description
Method for running an application and a wearable device supporting it
[0001] The embodiments disclosed in this document relate to a method for executing an application and a wearable device that supports the same.
[0002] Wearable devices such as smart glasses or head-mounted displays (HMDs) that support VST (video see-through) are being used. The wearable device is worn on the user's body and can provide an image (hereinafter, FOV image) corresponding to the user's FOV (field of view). The wearable device can recognize the user's external environment and provide the user with augmented reality, virtual reality, or mixed reality that reflects this.
[0003] Augmented reality (AR) is a technology that overlays virtual information onto real-world images displayed on a screen. Virtual reality (VR) is a technology that displays virtual information and / or previously captured preview images on a screen. Mixed reality (MR) is a technology that outputs complex content combining VR and AR. For example, in the case of augmented reality, a wearable device can detect real-world objects in the surroundings and add virtual objects to the real-world objects to provide them as an FOV image.
[0004] A wearable device can receive and process a user's speech input through a voice recognition application. The voice recognition application can run an application other than the voice recognition application to display a user interface corresponding to the user's speech input. For example, if the speech input is "Set a 3-minute timer," the wearable device can analyze keywords included in the speech input and determine the application corresponding to the speech input as a watch application. The wearable device can display a 3-minute timer executed by the watch application as a virtual object. In this case, the wearable device can display the timer in the center of the display according to default settings.
[0005] A wearable device according to one embodiment may include a display, a memory, and a processor. The memory may store instructions that, when executed individually or collectively by the at least one processor, the wearable device executes a voice recognition application that processes a user's voice input, receives a first speech input of the user through the voice recognition application, determines a first virtual object corresponding to the first speech input that is executed in an application other than the voice recognition application, determines a location on the display where the first virtual object is to be displayed or a shape of the first virtual object based on at least one of external device information around the wearable device, information regarding an object recognized in images that are currently being output through the display or have a history of being output through the display, or information related to the user, and displays the first virtual object on the display according to the location or shape.
[0006] A method for executing an application according to one embodiment may include: executing a voice recognition application that processes a user's voice input; receiving a first speech input of the user through the voice recognition application; determining a first virtual object corresponding to the first speech input, which is executed in an application other than the voice recognition application; determining a position on a display where the first virtual object is to be displayed or a shape of the first virtual object based on at least one of external device information around the wearable device, information regarding an object recognized in images that are currently being output through the wearable device display or have a history of being output through the display previously, or information related to the user; and displaying the first virtual object on a display according to the position or shape.
[0007] A computer-readable storage medium according to one embodiment may store instructions executable by a processor. When the instructions are executed, the processor of a wearable device may perform the following operations: executing a voice recognition application that processes a user's voice input; receiving a first speech input of the user through the voice recognition application; determining a first virtual object corresponding to the first speech input, which is executed in an application other than the voice recognition application; external device information around the wearable device; information regarding an object recognized in images that are currently being output through the wearable device display or have a history of being output through the display; or information related to the user, based on at least one of the following: determining a location on a display where the first virtual object is to be displayed or a shape of the first virtual object; and displaying the first virtual object on the display according to the location or shape.
[0008] FIG. 1 is a block diagram of an electronic device in a network environment according to various embodiments.
[0009] FIG. 2 illustrates an example of a block diagram of a wearable device according to one embodiment.
[0010] FIG. 3 is a configuration diagram of a wearable device according to one embodiment.
[0011] FIG. 4a is a flowchart illustrating an application execution method according to one embodiment.
[0012] FIG. 4b shows an example of a multi-parameter according to one embodiment.
[0013] FIG. 5a shows a representation of a context-based first virtual object according to one embodiment.
[0014] FIG. 5b shows a display of a first virtual object among other virtual object displays according to one embodiment.
[0015] FIG. 6a shows a representation of a first virtual object associated with another space according to one embodiment.
[0016] FIG. 6b shows the shape of a first virtual object according to one embodiment.
[0017] FIG. 7 is a flowchart regarding the display of a first virtual object inside or outside the FOV according to one embodiment.
[0018] FIG. 8 shows the display of a first virtual object on a large screen display according to one embodiment.
[0019] FIG. 9 shows the arrangement of content generated using generative AI according to one embodiment.
[0020] FIG. 10 shows a change in a first virtual object over time according to one embodiment.
[0021] FIG. 11 shows a change in a first virtual object according to the movement of a user according to one embodiment.
[0022] FIG. 12 shows a change in the first virtual object according to the movement of the user and the placement of the second virtual object according to one embodiment.
[0023] In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components.
[0024] Hereinafter, various embodiments of this document are described with reference to the accompanying drawings. However, this is not intended to limit the technology described in this document to specific embodiments and should be understood to include various modifications, equivalents, and / or alternatives to the embodiments of this document. In relation to the description of the drawings, similar reference numerals may be used for similar components.
[0025] FIG. 1 is a block diagram of an electronic device (101) in a network environment (100) according to various embodiments. Referring to FIG. 1, in the network environment (100), the electronic device (101) may communicate with an electronic device (102) through a first network (198) (e.g., a short-range wireless communication network) or with an electronic device (104) or a server (108) through a second network (199) (e.g., a long-range wireless communication network). According to one embodiment, the electronic device (101) may communicate with the electronic device (104) through a server (108). According to one embodiment, the electronic device (101) may include a processor (120), memory (130), input module (150), sound output module (155), display module (or display) (160), audio module (170), sensor module (176), interface (177), connection terminal (178), haptic module (179), camera module (180), power management module (188), battery (189), communication module (190), subscriber identification module (196), or antenna module (197). In some embodiments, at least one of these components (e.g., connection terminal (178)) may be omitted from the electronic device (101), or one or more other components may be added. In some embodiments, some of these components (e.g., sensor module (176), camera module (180), or antenna module (197)) may be integrated into a single component (e.g., display module (160)).
[0026] The processor (120) can control at least one other component (e.g., hardware or software component) of the electronic device (101) connected to the processor (120) by executing software (e.g., program (140)), and can perform various data processing or operations. According to one embodiment, as at least part of the data processing or operations, the processor (120) can store commands or data received from other components (e.g., sensor module (176) or communication module (190)) in volatile memory (132), process the commands or data stored in volatile memory (132), and store the resulting data in non-volatile memory (134). According to one embodiment, the processor (120) may include a main processor (121) (e.g., central processing unit or application processor) or an auxiliary processor (123) that can operate independently or together with it (e.g., graphics processing unit, neural processing unit (NPU), image signal processor, sensor hub processor, or communication processor). For example, if the electronic device (101) includes a main processor (121) and an auxiliary processor (123), the auxiliary processor (123) may be configured to use less power than the main processor (121) or to be specialized for a designated function. The auxiliary processor (123) may be implemented separately from the main processor (121) or as part thereof.
[0027] The auxiliary processor (123) may control at least some of the functions or states associated with at least one component of the electronic device (101) (e.g., display module (160), sensor module (176), or communication module (190)) on behalf of the main processor (121) while the main processor (121) is in an inactive (e.g., sleep) state, or together with the main processor (121) while the main processor (121) is in an active (e.g., application execution) state. According to one embodiment, the auxiliary processor (123) (e.g., image signal processor or communication processor) may be implemented as part of another functionally related component (e.g., camera module (180) or communication module (190)). According to one embodiment, the auxiliary processor (123) (e.g., neural network processing unit) may include a hardware structure specialized for processing an artificial intelligence model. The artificial intelligence model may be generated through machine learning. Such learning may be performed, for example, on the electronic device (101) itself where the artificial intelligence is performed, or through a separate server (e.g., server (108)). The learning algorithm may include, for example, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but is not limited to the examples described above. The artificial intelligence model may include a plurality of artificial neural network layers.An artificial neural network may be a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), a deep Q-network, or a combination of two or more of the above, but is not limited to the examples described above. In addition to the hardware structure, the artificial intelligence model may include a software structure, either additionally or substantially.
[0028] The memory (130) can store various data used by at least one component of the electronic device (101) (e.g., processor (120) or sensor module (176)). The data may include, for example, input data or output data for software (e.g., program (140)) and related commands. The memory (130) may include volatile memory (132) or non-volatile memory (134).
[0029] The program (140) may be stored as software in memory (130) and may include, for example, an operating system (142), middleware (144), or an application (146).
[0030] The input module (150) can receive commands or data to be used for a component of the electronic device (101) (e.g., processor (120)) from outside the electronic device (101) (e.g., user). The input module (150) may include, for example, a microphone, a mouse, a keyboard, a key (e.g., a button), or a digital pen (e.g., a stylus pen).
[0031] The sound output module (155) can output a sound signal to the outside of the electronic device (101). The sound output module (155) may include, for example, a speaker or a receiver. The speaker may be used for general purposes, such as multimedia playback or recording playback. The receiver may be used to receive incoming calls. According to one embodiment, the receiver may be implemented separately from the speaker or as part thereof.
[0032] The display module (160) can visually provide information to an external (e.g., user) of the electronic device (101). The display module (160) may include, for example, a display, a holographic device, or a projector and a control circuit for controlling said device. According to one embodiment, the display module (160) may include a touch sensor configured to detect a touch, or a pressure sensor configured to measure the intensity of the force generated by said touch.
[0033] The audio module (170) can convert sound into an electrical signal or, conversely, convert an electrical signal into sound. According to one embodiment, the audio module (170) can acquire sound through the input module (150) or output sound through the sound output module (155) or an external electronic device (e.g., electronic device (102)) (e.g., speaker or headphones) connected directly or wirelessly to the electronic device (101).
[0034] The sensor module (176) can detect the operating state of the electronic device (101) (e.g., power or temperature) or the external environmental state (e.g., user state) and generate an electrical signal or data value corresponding to the detected state. According to one embodiment, the sensor module (176) may include, for example, a gesture sensor, a gyroscope sensor, a barometric pressure sensor, a magnetic sensor, an accelerometer sensor, a grip sensor, a proximity sensor, a color sensor, an IR (infrared) sensor, a biosensor, a temperature sensor, a humidity sensor, or an illuminance sensor.
[0035] The interface (177) may support one or more specified protocols that can be used for the electronic device (101) to be connected directly or wirelessly to an external electronic device (e.g., electronic device (102)). According to one embodiment, the interface (177) may include, for example, a high definition multimedia interface (HDMI), a universal serial bus (USB) interface, an SD card interface, or an audio interface.
[0036] The connection terminal (178) may include a connector through which the electronic device (101) can be physically connected to an external electronic device (e.g., electronic device (102)). According to one embodiment, the connection terminal (178) may include, for example, an HDMI connector, a USB connector, an SD card connector, or an audio connector (e.g., a headphone connector).
[0037] The haptic module (179) can convert an electrical signal into a mechanical stimulus (e.g., vibration or movement) or an electrical stimulus that the user can perceive through tactile or kinesthetic senses. According to one embodiment, the haptic module (179) may include, for example, a motor, a piezoelectric element, or an electric stimulation device.
[0038] The camera module (180) can capture still images and video. According to one embodiment, the camera module (180) may include one or more lenses, image sensors, image signal processors, or flashes.
[0039] The power management module (188) can manage the power supplied to the electronic device (101). According to one embodiment, the power management module (188) can be implemented, for example, as at least part of a power management integrated circuit (PMIC).
[0040] The battery (189) can supply power to at least one component of the electronic device (101). According to one embodiment, the battery (189) may include, for example, a non-rechargeable primary battery, a rechargeable secondary battery, or a fuel cell.
[0041] The communication module (190) can support the establishment of a direct (e.g., wired) communication channel or a wireless communication channel between an electronic device (101) and an external electronic device (e.g., electronic device (102), electronic device (104), or server (108)), and the performance of communication through the established communication channel. The communication module (190) may include one or more communication processors that operate independently of the processor (120) (e.g., application processor) and support direct (e.g., wired) communication or wireless communication. According to one embodiment, the communication module (190) may include a wireless communication module (192) (e.g., cellular communication module, short-range wireless communication module, or GNSS (global navigation satellite system) communication module) or a wired communication module (194) (e.g., LAN (local area network) communication module, or power line communication module). The corresponding communication module among these communication modules can communicate with an external electronic device (104) through a first network (198) (e.g., a short-range communication network such as Bluetooth, WiFi (wireless fidelity) direct, or IrDA (infrared data association)) or a second network (199) (e.g., a legacy cellular network, a 5G network, a next-generation communication network, the Internet, or a computer network (e.g., a LAN or WAN)). These various types of communication modules may be integrated into a single component (e.g., a single chip) or implemented as multiple separate components (e.g., multiple chips). The wireless communication module (192) can identify or authenticate the electronic device (101) within a communication network such as the first network (198) or the second network (199) using subscriber information (e.g., International Mobile Subscriber Identifier (IMSI)) stored in the subscriber identification module (196).
[0042] The wireless communication module (192) can support 5G networks and next-generation communication technologies following 4G networks, for example, new radio access technology. NR access technology can support high-speed transmission of high-capacity data (enhanced mobile broadband (eMBB)), minimization of terminal power and connection of multiple terminals (massive machine type communications (mMTC)), or high reliability and low latency (ultra-reliable and low-latency communications (URLLC)). The wireless communication module (192) can support a high-frequency band (e.g., mmWave band) to achieve a high data transmission rate, for example. The wireless communication module (192) can support various technologies for securing performance in the high-frequency band, such as beamforming, massive MIMO (multiple-input and multiple-output), full-dimensional MIMO (FD-MIMO), array antenna, analog beam-forming, or large-scale antenna. The wireless communication module (192) can support various requirements specified in the electronic device (101), external electronic device (e.g., electronic device (104)), or network system (e.g., second network (199)). According to one embodiment, the wireless communication module (192) can support a Peak data rate (e.g., 20 Gbps or more) for realizing eMBB, loss coverage (e.g., 164 dB or less) for realizing mMTC, or U-plane latency (e.g., downlink (DL) and uplink (UL) each 0.5 ms or less, or round trip 1 ms or less) for realizing URLLC.
[0043] An antenna module (197) can transmit a signal or power to or from an external source (e.g., an external electronic device). According to one embodiment, the antenna module (197) may include an antenna comprising a radiator made of a conductor or a conductive pattern formed on a substrate (e.g., a PCB). According to one embodiment, the antenna module (197) may include a plurality of antennas (e.g., an array antenna). In this case, at least one antenna suitable for a communication method used in a communication network, such as a first network (198) or a second network (199), may be selected from the plurality of antennas, for example, by a communication module (190). A signal or power may be transmitted or received between the communication module (190) and an external electronic device through the selected at least one antenna. According to some embodiments, in addition to the radiator, other components (e.g., a radio frequency integrated circuit (RFIC)) may be additionally formed as part of the antenna module (197).
[0044] According to various embodiments, the antenna module (197) may form a mmWave antenna module. According to one embodiment, the mmWave antenna module may include a printed circuit board, an RFIC disposed on or adjacent to a first surface (e.g., bottom surface) of the printed circuit board and capable of supporting a specified high frequency band (e.g., mmWave band), and a plurality of antennas (e.g., array antennas) disposed on or adjacent to a second surface (e.g., top surface or side surface) of the printed circuit board and capable of transmitting or receiving a signal of the specified high frequency band.
[0045] At least some of the above components can be connected to each other via a communication method between peripheral devices (e.g., bus, GPIO (general purpose input and output), SPI (serial peripheral interface), or MIPI (mobile industry processor interface)) and exchange signals (e.g., commands or data) with each other.
[0046] According to one embodiment, commands or data may be transmitted or received between the electronic device (101) and an external electronic device (104) through a server (108) connected to a second network (199). Each of the external electronic devices (102, or 104) may be the same or different type of device as the electronic device (101). According to one embodiment, all or part of the operations performed on the electronic device (101) may be performed on one or more of the external electronic devices (102, 104, or 108). For example, if the electronic device (101) needs to perform a function or service automatically or in response to a request from a user or another device, the electronic device (101) may request one or more external electronic devices to perform at least part of the function or service instead of performing the function or service itself or additionally. One or more external electronic devices that receive the above request may execute at least part of the requested function or service, or additional function or service related to the request, and transmit the result of the execution to the electronic device (101). The electronic device (101) may provide the result as is or additionally processed as at least part of the response to the request. For this purpose, for example, cloud computing, distributed computing, mobile edge computing (MEC), or client-server computing technology may be used. The electronic device (101) may provide ultra-low latency services using, for example, distributed computing or mobile edge computing. In another embodiment, the external electronic device (104) may include an Internet of Things (IoT) device. The server (108) may be an intelligent server using machine learning and / or neural networks. According to one embodiment, the external electronic device (104) or the server (108) may be included within a second network (199).The electronic device (101) can be applied to intelligent services (e.g., smart home, smart city, smart car, or healthcare) based on 5G communication technology and IoT-related technology.
[0047]
[0048] FIG. 2 illustrates an example of a block diagram of a wearable device according to one embodiment.
[0049] Referring to FIGS. 1 and 2, a wearable device (201) according to one embodiment may include at least one of a processor (210), a memory (215), a display (220), a camera (225), a sensor (230), or a communication circuit (235). The processor (210), memory (215), display (220), camera (225), sensor (230), and communication circuit (235) may be electrically and / or operably coupled with each other by an electronic component such as a communication bus (202).
[0050] The type and / or number of hardware components included in the wearable device (201) are not limited to those shown in FIG. 2. For example, the wearable device (201) may include only some of the hardware components shown in FIG. 2. The elements within the memory (e.g., layers and / or modules) described below may be in a logically separated state. The elements within the memory (215) may be included within a hardware component that is separate from the memory (215). An operation performed by the processor (210) using each of the elements within the memory (215) is one embodiment, and the processor (210) may perform a different operation different from the above operation through at least one of the elements within the memory (215).
[0051] A processor (210) of a wearable device (201) according to one embodiment may include a hardware component for processing data based on one or more instructions. The hardware component for processing data may include, for example, an arithmetic and logic unit (ALU), a field programmable gate array (FPGA), and / or a central processing unit (CPU). The number of processors (210) may be one or more. For example, the processor (210) may have the structure of a multi-core processor such as a dual core, a quad core, or a hexa core.
[0052] A memory (215) of a wearable device (201) according to one embodiment may include a hardware component for storing data and / or instructions that are input and / or output to a processor (210). The memory (215) may include, for example, volatile memory such as random-access memory (RAM) and / or non-volatile memory such as read-only memory (ROM). Volatile memory may include, for example, at least one of dynamic RAM (DRAM), static RAM (SRAM), cache RAM, and pseudo SRAM (PSRAM). Non-volatile memory may include, for example, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), flash memory, hard disk, compact disk, and embedded multi-media card (eMMC).
[0053] In one embodiment, a display (220) of a wearable device (201) can output visualized information to a user of the wearable device (201). For example, the display (220) can be controlled by a processor (210) including a circuit such as a GPU (graphic processing unit) to output visualized information to a user. The display (220) may include a flat panel display (FPD) and / or electronic paper. The FPD may include a liquid crystal display (LCD), a plasma display panel (PDP), and / or one or more light emitting diodes (LEDs). The LED may include an organic LED (OLED).
[0054] In one embodiment, the camera (225) of the wearable device (201) may include one or more light sensors (e.g., a charged coupled device (CCD) sensor, a complementary metal oxide semiconductor (CMOS) sensor) that generate an electrical signal indicating the color and / or brightness of light. The plurality of light sensors included in the camera (225) may be arranged in the form of a two-dimensional grid (2 dimensional array). The camera (225) may acquire the electrical signals of each of the plurality of light sensors substantially simultaneously to generate two-dimensional frame data corresponding to the light reaching the light sensors of the two-dimensional grid.
[0055] For example, photo data captured using the camera (225) may mean one two-dimensional frame data obtained from the camera (225).
[0056] For example, video data captured using a camera (225) may mean a sequence of multiple two-dimensional frame data obtained from the camera (225) at a frame rate.
[0057] The camera (225) may further include a flash light for outputting light in the direction in which the camera (225) receives light.
[0058] According to one embodiment, a wearable device (201) may include a plurality of cameras arranged facing different directions as an example of a camera (225). Among the plurality of cameras, a first camera may be referred to as a motion recognition camera, and a second camera may be referred to as an eye tracking camera.
[0059] The wearable device (201) can identify the position, shape, and / or gesture of the hand using an image obtained using the first camera.
[0060] The wearable device (201) can identify the direction of gaze of a user wearing the wearable device (201) by using an image obtained using a second camera. For example, the direction in which the first camera is facing and the direction in which the second camera is facing may be opposite.
[0061] According to one embodiment, a sensor (230) of a wearable device (201) can generate electrical information that can be processed by a processor (210) and / or memory (215) of a wearable device (201) from non-electronic information associated with the wearable device (201). The information may be referred to as sensor data.
[0062] The sensor (230) may include a GPS (global positioning system) sensor for detecting the geographic location of the wearable device (201), an image sensor, an illuminance sensor and / or a ToF (time-of-flight) sensor, and an IMU (inertial measurement unit) for detecting the physical motion of the wearable device (201).
[0063] In one embodiment, the communication circuit (235) of the wearable device (201) may include hardware components to support the transmission and / or reception of electrical signals between the wearable device (201) and an external electronic device. The communication circuit (235) may include, for example, at least one of a modem, an antenna, and an optic / electronic converter. The communication circuit (235) may support the transmission and / or reception of electrical signals based on various types of protocols such as Ethernet, LAN (local area network), WAN (wide area network), WiFi (wireless fidelity), Bluetooth, BLE (Bluetooth low energy), ZigBee, LTE (long term evolution), 5G NR (new radio) and / or 6G.
[0064] According to one embodiment, within the memory (215) of a wearable device (201), one or more instructions (or commands) representing operations and / or operations to be performed on data by the processor (210) of the wearable device (201) may be stored. A set of one or more instructions may be referred to as firmware, an operating system, a process, a routine, a sub-routine, and / or an application. For example, the wearable device (201), and / or the processor (210), may perform at least one operation when a set of a plurality of instructions distributed in the form of an operating system, firmware, a driver, and / or an application is executed.
[0065] In the following, the statement that an application is installed in a wearable device (201) may mean that one or more instructions provided in the form of an application are stored in memory (215), and that the one or more applications are stored in an executable format (e.g., a file having an extension specified by the operating system of the wearable device (201)) by the processor (210). For example, the application may include a program and / or library related to a service provided to a user.
[0066] Referring to FIG. 2, programs installed on a wearable device (201) can be classified into any one of different layers, including an application layer (240), a framework layer (250), and / or a hardware abstraction layer (HAL) (280), based on the target.
[0067] For example, within the hardware abstraction layer (280), programs (e.g., modules, or drivers) designed to target the hardware of the wearable device (201) (e.g., display (220), camera (220), and / or sensor (230)) may be classified. The framework layer (250) may be referred to as an XR framework layer in that it includes one or more programs for providing XR (extended reality) services. For example, FIG. 2 illustrates the layers separated within memory (215), but the layers may be logically separated. However, it is not limited thereto. According to an embodiment, the layers may be stored in a designated area within memory (215).
[0068] For example, within the framework layer (250), programs designed to target at least one of the hardware abstraction layer (280) and / or the application layer (240) (e.g., location tracker (271), spatial recognizer (272), gesture tracker (273), and / or eye tracker (274), face tracker (275)) may be classified. Programs classified into the framework layer (250) may provide an application programming interface (API) that is executable based on other programs.
[0069] For example, within the application layer (240), programs designed to target a user controlling a wearable device (201) may be classified. Examples of programs classified into the application layer (240) include an XR (extended reality) system UI (user interface) and / or an XR application (242), but embodiments are not limited thereto. For example, programs classified into the application layer (240) (e.g., software applications) may call an API (application programming interface) to cause the execution of functions supported by programs classified into the framework layer (250).
[0070] For example, the wearable device (201) may display one or more visual objects on the display (220) to perform interaction with a user for using a virtual space based on the execution of the XR system UI (241). A visual object may mean an object that can be deployed within the screen for the transmission of information and / or interaction, such as text, images, icons, videos, buttons, checkboxes, radio buttons, text boxes, sliders, and / or tables. A visual object may be referred to as a visual guide, a virtual object, a visual element, a UI element, a view object, and / or a view element. The wearable device (201) may provide the user with a service to control functions available within the virtual space based on the execution of the XR system UI (241).
[0071] Although the XR system UI (241) is illustrated to include a lightweight renderer (243) and / or an XR plugin (244), it is not limited thereto. For example, the XR system UI (241) may cause the execution of supported functions in the lightweight renderer (243) and / or XR plugin (244) included within the application layer (240).
[0072] For example, a wearable device (201) may acquire resources (e.g., APIs, system processes and / or libraries) used to define, create, and / or execute a rendering pipeline, which is permitted to be partially modified, based on the execution of a lightweight renderer (243). The lightweight renderer (243) may be referred to as a lightweight render pipeline in terms of defining a rendering pipeline, which is permitted to be partially modified. The lightweight renderer (243) may include a renderer built prior to the execution of a software application (e.g., a pre-built renderer). For example, the wearable device (201) may acquire resources (e.g., APIs, system processes and / or libraries) used to define, create, and / or execute the entire rendering pipeline based on the execution of an XR plugin (244). The XR plugin (244) may be referred to as an open XR native client in terms of defining (or setting) the entire rendering pipeline.
[0073] For example, the wearable device (201) may display a screen representing at least a portion of a virtual space on the display (220) based on the execution of the XR application (242). The XR plugin (241-1) included in the XR application (242) may be referenced by the XR plugin (244) of the XR system UI (241). Descriptions of the XR plugin (241-1) that overlap with descriptions of the XR plugin (244) may be omitted. The wearable device (201) may trigger the execution of a screen composition manager (251) based on the execution of the XR application (242).
[0074] According to one embodiment, a wearable device (201) may provide a virtual space service based on the execution of a screen composition manager (251). For example, the screen composition manager (251) may include a platform (e.g., an Android platform) for supporting the virtual space service. Based on the execution of the screen composition manager (251), the wearable device (201) may display on a display the posture of a virtual object representing a rendered user's posture using data acquired through a sensor (230). The screen composition manager (251) may be referred to as a composition presentation manager (CPM).
[0075] For example, the screen configuration manager (251) may include a runtime service (252). In one example, the runtime service (252) may be referred to as an OpenXR runtime module. A wearable device (201) may be used to provide at least one of a user pose prediction function, a frame timing function, and / or a spatial input function through the wearable device (201) based on the execution of the runtime service (252). In one example, the wearable device (201) may be used to perform rendering for a virtual space service for the user based on the execution of the runtime service (252). For example, an application (e.g., Unity or OpenXR native application) may be implemented based on the execution of the runtime service (252).
[0076] For example, the screen configuration manager (251) may include a pass-through library (253). The wearable device (201) may, based on the execution of the pass-through library (253), display another screen representing real space acquired through a camera (225) overlaid on at least a portion of the screen while displaying a screen representing virtual space on the display (220).
[0077] For example, the screen composition manager (251) may include a renderer. The wearable device (201) can render a screen to be displayed on a display by compositing virtual layers (or virtual nodes) rendered based on sensor data (e.g., sensing data obtained through a camera (225) or sensor (230)) and pass-through layers (or pass-through nodes) obtained through a pass-through library (253) through the screen composition manager (251) using the renderer. The virtual layers may be referred to as virtual nodes and / or virtual surfaces. The wearable device (201) can render each of the virtual layers or render all of the virtual layers through the screen composition manager (251).
[0078] For example, the screen configuration manager (251) may include an input manager (254). The wearable device (201) may identify acquired data (e.g., sensor data) by executing one or more programs included in the recognition service layer (270) based on the execution of the input manager (254). The wearable device (201) may initiate the execution of at least one of the functions of the wearable device (201) using the acquired data.
[0079] For example, a perception abstract layer (260) may be used for data exchange between a screen configuration manager (251) and a perception service layer (270). In terms of being used for data exchange between the screen configuration manager (251) and the perception service layer (270), the perception abstract layer (260) may be referred to as an interface. As an example, the perception abstract layer (260) may be referred to as OpenPX and / or PPAL (perception platform abstract layer). The perception abstract layer (260) may be used for a perception client and a perception service.
[0080] According to one embodiment, the recognition service layer (270) may include one or more programs for processing data obtained from a sensor (230) (or a camera (225)). The one or more programs may include at least one of a location tracker (271), a spatial recognizer (272), a gesture tracker (273), an eye tracker (274), and / or a face tracker (275). The type and / or number of the one or more programs included in the recognition service layer (270) are not limited to those shown in FIG. 2.
[0081] For example, the wearable device (201) can identify the posture of the wearable device (201) using the sensor (230) based on the operation of the position tracker (271). The wearable device (201) can identify the 6 degrees of freedom pose (6 DOF pose) of the wearable device (201) using data acquired using the camera (225) and the IMU based on the operation of the position tracker (271). The position tracker (271) may be referred to as a head tracking (HeT) module.
[0082] For example, the wearable device (201) may be used to construct the surrounding environment of the wearable device (201) (or the user of the wearable device (201)) into a three-dimensional virtual space based on the execution of the space recognizer (272). The wearable device (201) may reconstruct the surrounding environment of the wearable device (201) in three dimensions using data acquired through the camera (225) based on the execution of the space recognizer (272). The wearable device (201) may identify at least one of a plane, an incline, or a staircase based on the surrounding environment of the wearable device (201) reconstructed in three dimensions based on the execution of the space recognizer (272). The space recognizer (272) may be referred to as a scene understanding (SU) module.
[0083] For example, the wearable device (201) may be used to identify (or recognize) the pose and / or gesture of the user's hand of the wearable device (201) based on the execution of the gesture tracker (273). For example, the wearable device (201) may identify the pose and / or gesture of the user's hand using data acquired from the sensor (230) based on the execution of the gesture tracker (273). For example, the wearable device (201) may identify the pose and / or gesture of the user's hand based on data (or images) acquired using the camera (225) based on the execution of the gesture tracker (273). The gesture tracker (273) may be referred to as a hand tracking (HaT) module and / or a gesture tracking module.
[0084] For example, the wearable device (201) can identify (or track) the movement of the user's eyes of the wearable device (201) based on the execution of the eye tracker (274). For example, the wearable device (201) can identify the movement of the user's eyes using data obtained from at least one sensor based on the execution of the eye tracker (274). For example, the wearable device (201) can identify the movement of the user's eyes based on data obtained using a camera (225) and / or an IR LED (infrared light emitting diode) based on the execution of the eye tracker (274). The eye tracker (274) may be referred to as an eye tracking (ET) module and / or a gaze tracking module.
[0085] For example, the recognition service layer (270) of the wearable device (201) may further include a face tracker (275) for tracking the user's face. For example, the wearable device (201) may identify (or track) the movement of the user's face and / or the user's facial expression based on the execution of the face tracker (275). The wearable device (201) may estimate the user's facial expression based on the movement of the user's face based on the execution of the face tracker (275). For example, the wearable device (201) may identify the movement of the user's face and / or the user's facial expression based on data (e.g., an image) acquired using a camera based on the execution of the face tracker (275).
[0086]
[0087] FIG. 3 is a configuration diagram of a wearable device according to one embodiment. FIG. 3 is classified according to functions related to the output of a context-based virtual object, but is not limited thereto. The configurations of FIG. 3 may be partially integrated or separated. The operation of each configuration in FIG. 3 may be the operation of the processor (210) of FIG. 2.
[0088] Referring to FIG. 3, a wearable device (301) (e.g., the wearable device (201) of FIG. 2) may include a parameter extraction unit (305), a multimodal language model (or a multimodal large language model) (310), a position determination unit (320), an evaluation unit (330), and an output unit (340).
[0089] The parameter extraction unit (305) can extract multi-parameters from a plurality of field of view (FOV) images (or a plurality of scenes, hereinafter the same). The multi-parameters (or context) may include state information of surrounding external devices (e.g., IoT devices), information regarding actual objects included in the FOV image, information regarding virtual objects included in the FOV image, or user pattern information. Additional information regarding multi-parameters may be provided through FIG. 4b.
[0090] According to one embodiment, the parameter extraction unit (305) can extract multi-parameters not only from the FOV image of the direction of gaze the user is currently looking at (hereinafter, current FOV image), but also from FOV images of the direction of gaze the user has previously looked at. Alternatively, the parameter extraction unit (305) can extract multi-parameters from images of an area adjacent to the FOV of the direction of gaze the user is currently looking at.
[0091] According to one embodiment, the parameter extraction unit (305) can periodically acquire FOV images and extract multi-parameters according to the movement of the user's gaze direction. For example, the parameter extraction unit (305) can store FOV images for a certain period of time in volatile memory for scene understanding. The parameter extraction unit (305) can extract multi-parameters from each of the stored FOV images and transmit them to the multimodal language model (310).
[0092] The multimodal language model (310) can receive multi-parameters obtained from each of the multiple FOV images. The multimodal language model (310) can comprehensively analyze the multi-parameters of each FOV image and generate a prompt (315) for each FOV image. The multimodal language model (310) can transmit the generated prompt (315) to the positioning unit (320).
[0093] According to one embodiment, when a speech input (hereinafter referred to as the first speech input) related to the execution of another application (hereinafter referred to as the first application) is received through a speech recognition application that recognizes the speech input of a user, the multimodal language model (310) can select an FOV image at the time of occurrence of the first speech input, or FOV images that overlap at least partially with the FOV image at the time of speech. The multimodal language model (310) can generate a prompt (315) using the multi-parameters of the selected FOV images.
[0094] The multimodal language model (310) can analyze the multi-parameters extracted from the parameter extraction unit (305) and the first utterance input to generate a prompt (315) requesting the determination of the location / shape of a virtual object (hereinafter, the first virtual object) associated with the first utterance input in the first application.
[0095] According to one embodiment, the prompt (315) may be text containing a description in a form similar to human speech (e.g., natural language). The prompt may include multi-parameter based conditions and output requests expressed in more detail than the first speech input. The prompt may be processed by a multimodal large language model. Additional information regarding the prompt may be provided through FIGS. 5a through 6b.
[0096] According to one embodiment, the time at which FOV images are stored and the time at which a first utterance input occurs (hereinafter, the first utterance time) may be different from each other. In this case, at the first utterance time, the multi-parameters of the FOV images may differ from the multi-parameters of the stored FOV images. If the multi-parameters associated with the first utterance input have not changed, the multimodal language model (310) may use the stored FOV images for prompt generation. If the multi-parameters associated with the first utterance input have changed, the multimodal language model (310) may use the newly extracted multi-parameters for prompt generation if the multi-parameters can be extracted at the first utterance time. Alternatively, if the multimodal language model (310) cannot extract the multi-parameters at the first utterance time, the stored FOV images may not be used for prompt generation.
[0097] The position determining unit (320) can output multiple position / form candidates (325) regarding the position / form of the first virtual object using a prompt generated in relation to the first utterance input in the multimodal language model (310). The position determining unit (320) may be a multimodal large language model capable of processing prompts. For example, the position determining unit (320) may output multiple position / form candidates (325) that may be the optimal position / form of the first virtual object through prompt chaining.
[0098] According to one embodiment, the position determining unit (320) may output a plurality of position / shape candidates (325) together with information regarding the reason for determining the plurality of position / shape candidates (325) (hereinafter, candidate determination information). Additional information regarding the determination of the plurality of position / shape candidates (325) may be provided through FIGS. 5a to 6b.
[0099] The evaluation unit (330) can receive a plurality of location / shape candidates (325) output from the location determination unit (320), and candidate determination information for each of the plurality of location / shape candidates (325). The evaluation unit (330) may be a multimodal large language model capable of processing prompts. The evaluation unit (330) can calculate an evaluation score for each of the plurality of location / shape candidates (325) of the first virtual object according to a pre-stored evaluation criteria table.
[0100] According to one embodiment, the evaluation criteria table may include validity scores regarding relevance to the task currently being performed by the user, placement appropriateness based on spatial perception, stability, and comfort as follows.
[0101] [Evaluation Criteria: Validity Score 1 to 5 points]
[0102] Validity 1 point (1. Task Relevance: Completely irrelevant to current work, 2. Placement Appropriateness: Inappropriately placed without considering the surrounding environment, 3. Safety and Comfort: Threatens user safety or causes serious discomfort),
[0103] Validity 2 points (1. Job Relevance: Relevant to current work but generally inappropriate, 2. Placement Appropriateness: Conflicts with the surrounding environment or is in an unnatural position, 3. Safety and Comfort: Causes minor safety risks or significant discomfort),
[0104] Validity 3 points (1. Job Relevance: Relevant to current work and appropriate at a general level, 2. Placement Appropriateness: Placed considering the surrounding environment but not optimized, 3. Safety and Comfort: Provides basic safety and comfort but room for improvement)
[0105] Validity 4 points (1. Task Relevance: Highly relevant to current work and efficiently integrated, 2. Placement Appropriateness: Harmoniously placed with the surrounding environment and mostly optimized, 3. Safety and Comfort: Safe and comfortable to use)
[0106] Validity 5 points (1. Job Relevance: Perfectly aligns with current tasks and significantly improves work efficiency, 2. Placement Appropriateness: Accurately perceives the surrounding environment and is perfectly placed in the optimal location, 3. Safety and Comfort: User safety and comfort are fully considered)
[0107] According to one embodiment, the evaluation unit (330) can fine-tun the evaluation criteria table in advance using basic operation criteria common to all applications or specific operation criteria tailored to a specific application.
[0108] The output unit (340) can receive an evaluation score for each of the multiple position / shape candidates (325) calculated by the evaluation unit (330). The output unit (340) can determine the final position / shape of the first virtual object based on the evaluation score and display it on the display (220).
[0109] According to one embodiment, the output unit (340) may display the first virtual object according to the candidate with the highest evaluation score, or may display the first virtual object according to a candidate of a different rank according to a predefined criterion. For example, the output unit (340) may display the first virtual object by giving weight to a candidate that is highly relevant to the current FOV image, or may display the first virtual object in a form that is highly relevant to the user's gesture (e.g., giving high weight to a point indicated by the user's gesture).
[0110] According to one embodiment, the output unit (340) can execute the first application through a voice recognition application and display the first virtual object (360).
[0111] According to one embodiment, when the output unit (340) determines the final position or shape of the first virtual object (360), it can perform the task of converting it into a coordinate system that can be recognized and output by an XR interface (e.g., XR system UI (241) of FIG. 2).
[0112]
[0113] FIG. 4a is a flowchart illustrating an application execution method according to one embodiment.
[0114] Referring to FIG. 2 and FIG. 4a, in operation 381, the processor (210) can execute a voice recognition application that processes the user's voice input. For example, the processor (210) can execute a voice recognition application when the user wears a wearable device (201).
[0115] In operation 383, the processor (210) can receive a first speech input from a user through a voice recognition application. The first speech input may be an input that causes the user interface of another application to be executed through the voice recognition application.
[0116] In operation 385, the processor (210) can determine a first virtual object corresponding to a first speech input, which is executed in a voice recognition application and another application. For example, if the first speech input is "set a 3-minute timer," the processor (210) can analyze keywords included in the first speech input to determine a clock application as the application corresponding to the first speech input, and determine a 3-minute timer executed by the clock application as the first virtual object.
[0117] In operation 387, the processor (210) can determine the location or shape of the first virtual object based on multi-parameters. The multi-parameters may include location information, peripheral device information, actual object information, virtual object information, or user pattern information. The processor (210) can extract multi-parameters using information recognized through a sensor or camera, information received through a peripheral device or server, or information obtained by analyzing an FOV image. Additional information regarding multi-parameters may be provided through FIG. 4b.
[0118] According to one embodiment, the processor (210) can determine the location or shape of the first virtual object by inputting multi-parameters into a large language model. Additional information regarding the determination of the location or shape of the first virtual object may be provided through the drawings below.
[0119] In operation 389, the processor (210) may display the first virtual object on the display (220) according to a determined location or shape. The processor (210) may determine the location or shape in which the first virtual object is displayed by reflecting various contexts (multi-parameters) around the user (or around the wearable device (201)). The processor (210) may change the location or shape of the displayed first virtual object by reflecting information that changes over time or as the user moves (see FIG. 11 and FIG. 12).
[0120]
[0121] FIG. 4b shows an example of a multi-parameter according to one embodiment.
[0122] Referring to FIGS. 3 and 4b, the parameter extraction unit (305) can extract multi-parameters using information recognized through a sensor or camera, information received through a peripheral device or server, and information obtained by analyzing an FOV image. The multi-parameters may include location information, peripheral device information, actual object information, virtual object information, or user pattern information.
[0123] For example, the multi-parameter (401) may include location information (410), peripheral device information (420), virtual object information (430, 440), and user pattern information (450, 460).
[0124] The location information (410) may include information about the location where the user (or wearable device (201)) is located (e.g., kitchen, living room, room 1). The location information (410) may include information about the location on a map where the wearable device (201) is placed.
[0125] Peripheral device information (420) may include information regarding the on / off status, on time, and operating status of IoT devices around the user (or wearable device (201)).
[0126] The virtual object information (430, 440) may include information about the virtual object currently being displayed (e.g., application name, screen size, screen distance, focus status).
[0127] User pattern information (450, 460) may include information on recently executed applications (450) and information (460) regarding the user's behavioral patterns.
[0128] FIG. 4b is exemplary and is not limited thereto. For example, the multi-parameter (401) may further include information about the actual object through scene understanding.
[0129]
[0130] FIG. 5a shows a representation of a context-based first virtual object according to one embodiment.
[0131] Referring to FIGS. 2, FIGS. 3 and FIGS. 5a, the processor (210) can display an FOV image (510) on a display (220). The FOV image (510) can correspond to the range that a user looks at while wearing the wearable device (201).
[0132] According to one embodiment, the processor (210) may execute a voice recognition application (505) to receive a user's speech input and output a response corresponding to the speech input. The response corresponding to the speech input may include an operation of executing another application to display a first virtual object on an FOV image (510).
[0133] For example, the processor (210) may receive a first speech input (505a) through a voice recognition application (505). The first speech input (505a) may be "Set a 3-minute timer." The processor (210) may analyze keywords included in the first speech input (505a) to determine a first application (e.g., a clock application) corresponding to the first speech input (505a), and determine a first virtual object (or first user interface) (e.g., a 3-minute timer) (518) displayed by the first application.
[0134] According to one embodiment, the processor (210) can determine the location or form in which the first virtual object (518) is displayed by reflecting various contexts around the user (or around the wearable device (201)). The processor (210) can check information that the induction range (511) is activated and information that a pot (512) is placed on the induction range (511) immediately before the first speech input (505a) occurs. For example, if the induction range (511) is an IoT device, the processor (210) can receive information regarding the on / off state and the time of on of the induction range (511) from the induction range (511) or an external server. The processor (210) can recognize the state in which a pot (512), which is a real object, is placed on the induction range (511) through scene understanding of the FOV image (510).
[0135] The processor (210) can determine the location or shape of the first virtual object (518) based on keywords (e.g., 3 minutes, timer, set) included in the first speech input (505a), information about the induction range (511) which is an IoT device, and information about the pot (512) which is a real object.
[0136] According to one embodiment, the processor (210) can generate a prompt corresponding to the first utterance input (505a) using multi-parameters corresponding to various contexts. The processor (210) can determine the location or shape of the first virtual object (518) using the generated prompt. For example, when using the multimodal language model (310) of FIG. 3, the multimodal language model (310) can generate the following prompt for the FOV image (510).
[0137] Prompt output of the multimodal language model (310): "The user is currently located in the kitchen. There is an induction range within the current line of sight. The power to the induction range is on, and there is a pot on the induction range. The user mainly uses the induction range for about 3 minutes when cooking ramen. At this time, the user requested that a timer app be launched for 3 minutes. Considering this situation, determine where it is best to launch the timer app."
[0138] The prompt may be text containing a description in a form similar to human speech (e.g., natural language). The prompt may express the first speech input (505a) in more detail and may include multi-parameter based conditions and output requests. The prompt may be processed by a separate large language model (e.g., the positioning unit (320) of FIG. 3).
[0139] According to one embodiment, the processor (210) can determine a plurality of location / shape candidates for the first virtual object (518) using a prompt corresponding to the first utterance input (505a). When a plurality of location / shape candidates are determined, the processor (210) can determine the final location or shape of the first virtual object (518) through scoring and evaluation.
[0140] For example, the left or right side of the induction range (511), or the top of the pot (512), may be determined as candidate locations for the first virtual object (518). The processor (210) may determine the top point of the pot (512) as the final location of the first virtual object (518) by reflecting the high correlation score with the pot (512).
[0141] When there is no execution of a virtual object by another application, the processor (210) can determine the size of the first virtual object (518) according to the default settings. The processor (210) can display the first virtual object (518) in the determined position and shape. The first virtual object (518) can be displayed on the top of the pot (512) with the size according to the default settings. When the FOV changes due to the user's gaze movement, the first virtual object (518) can be continuously displayed on the top of the pot (512) to inform the user whether three minutes have elapsed.
[0142] According to one embodiment, the processor (210) may determine information regarding the location / shape of the first virtual object as a text-based output as follows: Information regarding the location / shape of the first virtual object (518): 1. The space directly above the cooking utensil (which is close to the user's main work area and therefore highly visible. It is a location where the user can easily check the timer while cooking. It is placed 30 cm to 50 cm away from the cooking utensil), 2. Size (240*160 pixels). Subsequently, the processor (210) may output the first virtual object by performing the task of converting the text-based output into a coordinate system that can be recognized and output by an XR interface (e.g., XR system UI (241) of FIG. 2).
[0143]
[0144] FIG. 5b shows a display of a first virtual object among other virtual object displays according to one embodiment.
[0145] Referring to FIG. 2 and FIG. 5b, the processor (210) can display an FOV image (520) on a display (220). The FOV image (520) can correspond to the range that a user looks at while wearing the wearable device (201).
[0146] The processor (210) can receive a first speech input (506b) through a voice recognition application (506). The first speech input (506b) may be "Set a 3-minute timer." The processor (210) can analyze keywords included in the first speech input (506b) to determine a first application (e.g., a clock application) corresponding to the first speech input (506b), and determine a first virtual object (e.g., a 3-minute timer) (528) displayed by the first application.
[0147] According to one embodiment, the processor (210) can determine the location or form in which the first virtual object (528) is displayed by reflecting various contexts around the user (or around the wearable device (201)). The processor (210) can check information that the induction range (521) is activated and information that a pot (522) is placed on the induction range (521) immediately before the first utterance input (506b) occurs.
[0148] In addition, unlike FIG. 5a, the processor (210) may check information that a second virtual object (525) is running in the FOV image (520). For example, the second virtual object (525) may be a user interface of an application related to video playback. The processor (210) may check information regarding the execution time or execution size of the second virtual object (525).
[0149] The processor (210) can determine the location or shape of the first virtual object (528) based on keywords (e.g., 3 minutes, timer, set) included in the first speech input (506b), information about the induction range (521) which is an IoT device, information about the pot (522) which is a real object obtained by analyzing the FOV image (520), and information about the execution time and execution size of the second virtual object (525).
[0150] According to one embodiment, the processor (210) can generate a prompt corresponding to the first utterance input (506b) using multi-parameters corresponding to various contexts. The processor (210) can determine the location or shape of the first virtual object (528) using the generated prompt. For example, when using the multimodal language model (310) of FIG. 3, the multimodal language model (310) can generate the following prompt for the FOV image (520).
[0151] Prompt output of the multimodal language model (310): "The user is currently located in the kitchen. There is an induction range within the current line of sight. The power to the induction range is on, and there is a pot on the induction range. The user mainly uses the induction range for about 3 minutes when cooking ramen. YouTube is running at the top of the user's interface at a specific size. At this time, the user requested that a timer app be launched for 3 minutes. Considering these circumstances, determine where it is best to launch the timer app."
[0152] The prompt may be text containing a description in a form similar to human speech (e.g., natural language). The prompt may express the first speech input (506b) in more detail and may include multi-parameter based conditions and output requests. The prompt may be processed by a separate large language model (e.g., the positioning unit (320) of FIG. 3).
[0153] The processor (210) can determine multiple location / shape candidates using a prompt corresponding to the first speech input (506b). When multiple candidates are determined, the processor (210) can determine the final location or shape of the first virtual object (528) through scoring and evaluation. For example, the left or right side of the induction range (521), or the top of the pot (522), may be determined as candidate locations for the first virtual object (528). The processor (210) may determine the top point of the pot (522) as the location of the first virtual object (528) by reflecting the high correlation score with the pot (522). The second virtual object (525) related to video playback may be set to have a higher priority than the first virtual object (528). The processor (210) may display the first virtual object (528) so that the first virtual object (528) does not obscure the second virtual object (525).
[0154] According to one embodiment, the processor (210) may determine information regarding the location / shape of the first virtual object (528) as a text-based output as follows: Information regarding the location / shape of the first virtual object (518): 1. Space directly above the cooking utensil (high attention is received as it is close to the user's main work area. It is a location where the user can easily check the timer while cooking. It is displayed 30cm to 50cm away from the cooking utensil), 2. Size (240*160 pixels), 3. Relationship with other apps (lower priority than YouTube UI). Subsequently, the processor (210) may output the first virtual object by performing the task of converting the text-based output into a coordinate system that can be recognized and output by an XR interface (e.g., XR system UI (241) of FIG. 2).
[0155] FIG. 5b is exemplary and is not limited thereto. Unlike FIG. 5b, the processor (210) may display the first virtual object (528) so that it partially overlaps with the pot (522), thereby displaying the first virtual object (528) and the second virtual object (525) so that they do not overlap. Alternatively, the processor (210) may display the first virtual object (528) and the second virtual object (525) so that they do not overlap, by making the size of the first virtual object (528) smaller than the size set by default. As another example, the processor (210) may display the first virtual object (528) in the left space of the pot (522).
[0156] When the FOV changes due to the user's gaze movement, the first virtual object (528) is continuously displayed on the top of the pot (522) to inform the user whether 3 minutes have passed.
[0157] According to one embodiment, the processor (210) may change the position or shape of the first virtual object (528) in response to the passage of time or a change in the state of the second virtual object (525). For example, when the 3-minute timer has 1 minute remaining, or when the playback of the video in the second virtual object (525) ends, the processor (210) may display the first virtual object (528) on a higher layer than the second virtual object (525) so that the first virtual object (528) partially obscures the second virtual object (525).
[0158] According to one embodiment, the processor (210) can provide voice feedback (506c) related to the placement of the first virtual object (528) through a voice recognition application (505).
[0159]
[0160] FIG. 6a shows a representation of a first virtual object associated with another space according to one embodiment.
[0161] Referring to FIG. 2 and FIG. 6a, the processor (210) can display an FOV image (610) on a display (220). The FOV image (610) can correspond to the range that a user looks at while wearing the wearable device (201).
[0162] The processor (210) can receive a user's speech input through a voice recognition application (605) and output a response corresponding to the speech input. The response corresponding to the speech input may include an operation to display a virtual object on the FOV image (610) by executing another application.
[0163] For example, the processor (210) may receive a first speech input (605a) through a voice recognition application (605). The first speech input (605a) may be "Set a 3-minute timer." The processor (210) may analyze keywords included in the first speech input (605a) to determine a first application (e.g., a clock application) corresponding to the first speech input (605a), and determine a first virtual object (or first user interface) (e.g., a 3-minute timer) (630) displayed by the first application.
[0164] According to one embodiment, the processor (210) can determine the location or form in which the first virtual object (630) is displayed by reflecting various contexts around the user (or around the wearable device (201)).
[0165] For example, the processor (210) can check information that a second virtual object (611) related to video playback (e.g., a YouTube user interface) is running before the first speech input (605a) occurs. The processor (210) can check information regarding the execution time and execution size of the second virtual object (611). Additionally, the processor (210) can check information that an induction range is running in the space where the user (or wearable device (201)) is located, even if it is not within the FOV image (610). The processor (210) can check information that a pot is placed on the induction range in another FOV image with a previously displayed history. The processor (210) can determine the location or shape of the first virtual object (630) based on keywords (e.g., 3 minutes, timer, set) included in the first speech input (605a), information regarding the induction range which is an IoT device, and information regarding the execution time and execution size of the second virtual object (515).
[0166] According to one embodiment, the processor (210) may determine a plurality of location / shape candidates for a first virtual object (630) corresponding to a first utterance input (605a). When a plurality of location / shape candidates are determined, the processor (210) may determine the final location or shape of the first virtual object (630) through scoring and evaluation.
[0167] For example, the left or right side of the second virtual object (630) may be determined as a candidate location for the first virtual object (630). In a context where the timer is determined to be highly relevant to the operation of the induction range (e.g., the induction range was just turned ON, and there is a history of a user requesting that the timer be displayed when the induction range is turned ON), the left side of the second virtual object (515), which is a point close to the induction range, may be determined as the location for the first virtual object (630). The second virtual object (611) related to video playback may be set to have a higher priority than the first virtual object (630). The processor (210) may display the first virtual object (630) so that the first virtual object (630) does not obscure the second virtual object (611).
[0168] According to one embodiment, if the display of the first virtual object (630) relates to a space or area that does not correspond to the FOV image (610), the processor (210) may set a different method of displaying the first virtual object (630). For example, a first part (630a) of the first virtual object (630) may be displayed outside the FOV image (610), and a second part (630b) of the first virtual object (630) may be displayed inside the FOV image (610). Through this, the processor (210) may indirectly inform the user of the state in which the first virtual object (630) is displayed in relation to an induction range located outside the FOV image (610). Alternatively, as another example, the processor (210) may display an auxiliary image (e.g., a pot-shaped icon) related to the induction range outside the FOV image (610) together with the first virtual object (630). Through this, the user can be clearly informed that the first virtual object (630) is displayed in relation to the induction range outside the FOV image (610).
[0169] According to one embodiment, the processor (210) can provide voice feedback (605b) related to the placement of the first virtual object (630) through a voice recognition application (605).
[0170]
[0171] FIG. 6b shows the shape of a first virtual object according to one embodiment.
[0172] Referring to FIG. 2 and FIG. 6b, the processor (210) can display an FOV image (620) on a display (220). The FOV image (620) can correspond to the range that a user looks at while wearing the wearable device (201).
[0173] The processor (210) can receive a user's speech input through a voice recognition application (606) and output a response corresponding to the speech input. The response corresponding to the speech input may include an operation to display a virtual object on the FOV image (620) by executing another application.
[0174] For example, the processor (210) may receive a first speech input (606b) through a voice recognition application (606). The first speech input (606b) may be "Connect me to a video call with Tom." The processor (210) may analyze keywords included in the first speech input (606b) to determine a first application (e.g., a video call application) corresponding to the first speech input (606b), and determine a first virtual object (or first user interface) (e.g., a video call UI) (680) displayed by the first application.
[0175] According to one embodiment, the processor (210) can determine the location or form in which the first virtual object (680) is displayed by reflecting various contexts around the user (or around the wearable device (201)).
[0176] For example, the processor (210) can check information that a second virtual object (621) related to video playback is running before the first speech input (606b) occurs. The processor (210) can determine the location or form of the first virtual object (680) based on information regarding keywords (e.g., Tom, video call, connection) included in the first speech input (605a), the execution time of the second virtual object (515) running, and the execution size.
[0177] According to one embodiment, the processor (210) can generate a prompt corresponding to the first utterance input (606b) using multi-parameters corresponding to various contexts. The processor (210) can determine the location or shape of the first virtual object (680) using the generated prompt. For example, when using the multimodal language model (310) of FIG. 3, the multimodal language model (310) can generate the following prompt for the FOV image (620).
[0178] Prompt output of the multimodal language model (310): The user is currently located in the living room. The user is currently watching a YouTube video on a large screen. It has been about 10 minutes since the user started watching the video. At this time, the user requested to make a video call to Tom's contact. Considering this situation, determine where it is best to launch the video call app.
[0179] The processor (210) can determine multiple position / shape candidates using a prompt corresponding to the first speech input (606b). For example, the upper, left, or right side of the second virtual object (621) may be determined as a candidate position of the first virtual object (680). The horizontal or vertical arrangement of the user-side image and the counterpart-side image included in the first virtual object (680) may be determined as a candidate shape.
[0180] When multiple location / shape candidates are determined, the processor (210) can determine the final location or shape of the first virtual object (680) through scoring and evaluation. For example, the upper, left, or right side of the second virtual object (515) may be determined as a candidate location for the first virtual object (680). If the importance of the recognized actual object (e.g., a picture frame) is not high, the processor (210) may determine the left side of the second virtual object (621), which matches the user's eye level, as the location for the first virtual object (680). Additionally, the processor (210) may arrange the user-side image and the counterpart-side image included in the first virtual object (680) vertically, taking into account the size of the space on the left side of the second virtual object (621).
[0181] According to one embodiment, the second virtual object (621) associated with video playback may be set to have a lower priority than the first virtual object (680). In this case, unlike the form illustrated in FIG. 6b, the processor (210) may display the first virtual object (680) such that the first virtual object (680) covers at least a part (e.g., the edge) of the second virtual object (621) and the size of the second virtual object (621) is maintained at a size greater than or equal to a specified size.
[0182]
[0183] FIG. 7 is a flowchart regarding the display of a first virtual object inside or outside the FOV according to one embodiment.
[0184] Referring to FIGS. 2, FIGS. 3 and FIGS. 7, in operation 710, the processor (210) can determine a plurality of location / shape candidates for the first virtual object. For example, the processor (210) can determine a plurality of location / shape candidates regarding the location / shape of the first virtual object using a prompt generated in relation to the first utterance input.
[0185] In operation 720, the processor (210) can calculate an evaluation score for each of the multiple location / shape candidates of the first virtual object according to a pre-stored evaluation criteria table. For example, the evaluation criteria table may include validity scores regarding relevance to the task the user is currently working on, placement appropriateness based on spatial awareness, stability, and comfort.
[0186] In operation 730, the processor (210) can determine whether the first position / shape candidate with the highest validity score is included in the user's current FOV image.
[0187] In operation 740, if the first position / shape candidate is included in the current FOV image (operation 730-YES), the processor (210) may display the first virtual object directly within the current FOV image according to the first position / shape candidate. The processor (210) may output voice feedback regarding the position / shape of the first virtual object.
[0188] In operation 750, if the first position / form candidate is not included in the current FOV image (operation 730-NO), the processor (210) may display a user interface to receive consent from the user. For example, the user interface may be a popup and a voice notification.
[0189] In operation 760, the processor (210) may display a first virtual object according to a first position / shape candidate in an area other than the current FOV image, depending on whether the user consents.
[0190]
[0191] FIG. 8 shows the display of a first virtual object on a large screen display according to one embodiment.
[0192] Referring to FIG. 2 and FIG. 8, when the display (220) is a large screen, the processor (210) may include a front scene (810) and a side scene (820) in the FOV image. The FOV image may include an intermediate area (815) where at least a portion of the front scene (810) and the side scene (820) overlap.
[0193] The processor (210) can cause at least a portion of the first virtual object (850) corresponding to the first speech input to be displayed in the intermediate area (815). For example, if the first virtual object (850) is a video call, the processor (210) can place most of the other party's video (851) in the intermediate area (815) and place the user's video (852) in the side area (820) rather than the intermediate area (815). Through this, the user can perceive the execution of the first virtual object (850) without interfering with the viewing of the second virtual object (830) displayed through the front scene (810).
[0194]
[0195] FIG. 9 shows the arrangement of content generated using generative AI according to one embodiment.
[0196] Referring to FIGS. 2 and FIGS. 9, the processor (210) can display an FOV image (910) on a display (220). The FOV image (910) can correspond to the range that a user looks at while wearing the wearable device (201).
[0197] The processor (210) can receive a user’s speech input through a voice recognition application (905) and output a response corresponding to the speech input. The response corresponding to the speech input may include an operation to display a virtual object on the FOV image (910) by executing another application.
[0198] For example, the processor (210) can receive a first speech input (905a) through a voice recognition application (905). The first speech input (905a) may be "summarize this video and make it into a test file."
[0199] The processor (210) can check information regarding the execution time or execution size of the second virtual object (911) of the application related to video playback before the first utterance input (905a) occurs. For example, the second virtual object (911) may be a user interface related to video playback. The processor (210) can check information regarding the execution time or execution size of the second virtual object (911).
[0200] The processor (210) can determine a first application (e.g., generative AI) corresponding to the first utterance input (906a) by analyzing keywords included in the first utterance input (905a). The processor (210) can determine the location / form of a first virtual object (e.g., shortcut icon) (920) associated with a file (e.g., video summary file) generated by the operation of the first application by reflecting various contexts.
[0201] The processor (210) can generate the first virtual object (920) and the second virtual object (911) to have similar forms (e.g., setting the icon of the summary file to the representative image of the video) when the first virtual object (920) is associated with the second virtual object (911).
[0202] For example, the processor (210) may place the first virtual object (e.g., a shortcut icon) (920) to the left or right of the second virtual object (911) so that the first virtual object (e.g., a shortcut icon) (920) does not overlap with the second virtual object (911). Additionally, the processor (210) may place the first virtual object (e.g., a shortcut icon) (920) at the height of the midpoint of the FOV image (910) so that it is convenient for the user to execute.
[0203]
[0204] FIG. 10 shows a change in a first virtual object over time according to one embodiment.
[0205] Referring to FIG. 2 and FIG. 10, the first virtual object (1030) may be displayed by the first utterance input. At the time the first utterance input occurs, the second virtual object (1010) and the third virtual object (1020) may be in an active state.
[0206] According to one embodiment, the processor (210) can set the importance between each virtual object. If the importance (or priority) between virtual objects changes over time, the processor (120) can change the position / shape of the first virtual object (1030) to reflect the change in importance. For example, at the time of the first output of the first virtual object (1030), the first virtual object (1030) may be displayed on a lower layer than the second virtual object (1010) and the third virtual object (1020).
[0207] Subsequently, when the importance changes in response to a change in the state of the second virtual object (1010) and the third virtual object (1020), or the first virtual object (1030), the processor (120) may change the position or shape of the first virtual object (1030) to reflect the changed importance. For example, if no user input occurs for more than a specified time in the third virtual object (1020), the first virtual object (1030) may be displayed as a higher layer than the third virtual object (1020). As another example, when the timer is set to 3 minutes, if the remaining time becomes less than 1 minute, the first virtual object (1030) may be displayed as a higher layer than the third virtual object (1020).
[0208]
[0209] FIG. 11 shows a change in a first virtual object according to the movement of a user according to one embodiment.
[0210] Referring to FIG. 2 and FIG. 11, the processor (210) can display an FOV image (1110) on a display (220). The FOV image (1110) can correspond to the range that a user looks at while wearing the wearable device (201).
[0211] According to one embodiment, the processor (210) may determine the left or right side of the induction range (1111), or the top of the pot (1112), as the candidate location of the first virtual object (1130). The processor (210) may determine the top point of the pot (1112) as the final location of the first virtual object (1130) by reflecting the high correlation score with the pot (1112).
[0212] Subsequently, when the user moves and the FOV image changes, the processor (210) can display a modified virtual object (1135) associated with the first virtual object (1130). The modified virtual object (1135) can be generated through generative AI in a form that matches the space the user moved to.
[0213] According to one embodiment, the processor (210) may display an auxiliary image (e.g., a pot-shaped icon) associated with an induction range together with a modified virtual object (1135). This allows the user to clearly be informed that the modified virtual object (1135) is displayed in relation to an induction range (1111) in a different space.
[0214]
[0215] FIG. 12 shows a change in the first virtual object according to the movement of the user and the placement of the second virtual object according to one embodiment.
[0216] Referring to FIG. 2 and FIG. 12, the processor (210) can display an FOV image (1210) on a display (220). The FOV image (1210) can correspond to the range that a user looks at while wearing the wearable device (201).
[0217] According to one embodiment, the processor (210) may determine the left or right side of the induction range (1211), or the top of the pot (1212), as the candidate location of the first virtual object (1230). The processor (210) may determine the top point of the pot (1212) as the final location of the first virtual object (1230) by reflecting the high correlation score with the pot (1212).
[0218] Subsequently, when the user moves to look at a different area, the processor (210) can display the FOV image (1220) on the display (220). The processor (210) can check information regarding the execution time or execution size of the second virtual object (1221) in the FOV image (1220). For example, the second virtual object (1221) may be a user interface of an application related to video playback. The processor (210) can check information regarding the execution time or execution size of the second virtual object (1221).
[0219] The processor (210) can display a modified virtual object (1235) associated with the first virtual object (1230) by reflecting the execution time or execution size of the second virtual object (1221). The modified virtual object (1235) can be generated through generative AI in a form that matches the space the user has moved to. The modified virtual object (1235) can be positioned so as not to obscure the second virtual object (1221).
[0220] According to one embodiment, the processor (210) may display an auxiliary image (e.g., a pot-shaped icon) associated with an induction range together with a modified virtual object (1235). This allows the user to clearly be informed that the modified virtual object (1235) is displayed in relation to an induction range in a different space.
[0221]
[0222] A wearable device can receive and process a user's speech input through a voice recognition application. The voice recognition application can execute an application other than the voice recognition application to display a user interface corresponding to the user's speech input. For example, if the speech input is "Set a 3-minute timer," the wearable device can analyze keywords included in the speech input to determine the watch application as the application corresponding to the speech input. The wearable device can display the 3-minute timer executed by the watch application as a virtual object. In this case, the wearable device displays the virtual object in the center of the display based on default settings without reflecting the context around the user. Consequently, the user experiences the inconvenience of having to move the virtual object further or reconfigure it.
[0223]
[0224] A wearable device according to one embodiment may include a display, a memory, and a processor. The memory may store instructions that, when executed individually or collectively by the at least one processor, the wearable device executes a voice recognition application that processes a user's voice input, receives a first speech input of the user through the voice recognition application, determines a first virtual object corresponding to the first speech input that is executed in an application other than the voice recognition application, determines a location on the display where the first virtual object is to be displayed or a shape of the first virtual object based on at least one of external device information around the wearable device, information regarding an object recognized in images that are currently being output through the display or have a history of being output through the display, or information related to the user, and displays the first virtual object on the display according to the location or shape.
[0225] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the wearable device may store the images in advance prior to the occurrence of the first speech input and determine the location or the shape using the stored images.
[0226] According to one embodiment, the information regarding the object may include information regarding the actual object included in the images or information regarding the second virtual object running in the images.
[0227] According to one embodiment, the information regarding the user may include information regarding the execution method or execution pattern of an application executed in relation to the user.
[0228] According to one embodiment, the external device information may include information regarding the operating status or operating history of an IoT device within a specified distance from the wearable device.
[0229] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the wearable device may use a first large language model to generate a prompt corresponding to the first speech input, the external device information, the information regarding the object, or the information regarding the user to determine the location or the form.
[0230] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the wearable device may determine a plurality of candidates for the location or the form based on the prompt using a second large-scale language model.
[0231] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the wearable device may use a third large-scale language model to calculate an evaluation score for each of the plurality of candidates and use the calculated evaluation score to determine the location or the shape.
[0232] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the wearable device may display the first virtual object in a position or form corresponding to the candidate when the first candidate with the highest evaluation score among the plurality of candidates is included in an area corresponding to the image being output through the display.
[0233] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the wearable device may receive input of consent from a user through a separate user interface and display the first virtual object in a location or form corresponding to the candidate when the first candidate with the highest evaluation score among the plurality of candidates is not included in the area corresponding to the image being output through the display.
[0234] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the wearable device may change the location or the form when the importance of the object or the importance of the first virtual object changes according to a change in time.
[0235] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the wearable device may set a first weight on a first image being output through the display, set a second weight on a second image that has a history of being output through the display, and determine the position or the shape based on the first weight and the second weight.
[0236] A method for executing an application according to one embodiment may include: executing a voice recognition application that processes a user's voice input; receiving a first speech input of the user through the voice recognition application; determining a first virtual object corresponding to the first speech input, which is executed in an application other than the voice recognition application; determining a position on a display where the first virtual object is to be displayed or a shape of the first virtual object based on at least one of external device information around the wearable device, information regarding an object recognized in images that are currently being output through the wearable device display or have a history of being output through the display previously, or information related to the user; and displaying the first virtual object on a display according to the position or shape.
[0237] According to one embodiment, the operation of determining the position or the shape may include the operation of determining the position or the shape using the images stored in advance prior to the occurrence of the first speech input.
[0238] According to one embodiment, the information regarding the object may include information regarding the actual object included in the images or information regarding the second virtual object running in the images.
[0239] According to one embodiment, the information regarding the user may include information regarding the execution method or execution pattern of an application executed in relation to the user.
[0240] According to one embodiment, the external device information may include information regarding the operating status or operating history of an IoT device within a specified distance from the wearable device.
[0241] According to one embodiment, the operation of determining the location or the form may include the operation of determining the location or the form by using a first large language model to generate a prompt corresponding to the first utterance input, the external device information, the information regarding the object, or the information regarding the user.
[0242] According to one embodiment, the operation of determining the position or the form may include the operation of determining a plurality of candidates regarding the position or the form based on the prompt using a second large-scale language model.
[0243] According to one embodiment, the operation of determining the position or the form may include the operation of calculating an evaluation score for each of the plurality of candidates using a third large-scale language model, and the operation of determining the position or the form using the calculated evaluation score.
[0244]
[0245] A wearable device according to one embodiment disclosed in this document can output a response corresponding to a user's speech input by reflecting various contexts around the user.
[0246] A wearable device according to embodiments disclosed in this document can extract various contexts around a user as multi-parameters and input them into a multimodal large-scale language model. The wearable device can output a virtual object in a location or form optimized for the user's speech input using a prompt output from the multimodal large-scale language model.
[0247]
[0248] The various embodiments of this document and the terms used therein are not intended to limit the technical features described in this document to specific embodiments, and should be understood to include various modifications, equivalents, or substitutions of said embodiments. In connection with the description of the drawings, similar reference numerals may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more of said items unless the relevant context clearly indicates otherwise. In this document, phrases such as "A or B," "at least one of A and B," "at least one of A or B," "A, B or C," "at least one of A, B and C," and "at least one of A, B, or C" may each include any one of the items listed together in the corresponding phrase, or all possible combinations thereof. Terms such as "first," "second," or "first" or "second" may be used simply to distinguish said components from other said components and do not limit said components in any other aspect (e.g., importance or order). Where any (e.g., first) component is referred to as “coupled” or “connected” to another (e.g., second) component, with or without the terms “functionally” or “communicationly,” it means that said any component may be connected to said other component directly (e.g., by wire), wirelessly, or through a third component.
[0249] The term “module” as used in the various embodiments of this document may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit, for example. A module may be a component formed integrally, or a minimum unit of said component or a part thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).
[0250] Various embodiments of the present document may be implemented as software (e.g., program (140)) comprising one or more instructions stored in a storage medium (e.g., internal memory (136) or external memory (138)) readable by a machine (e.g., electronic device (101)). For example, a processor (e.g., processor (120)) of the machine (e.g., electronic device (101)) may call at least one of the one or more instructions stored in the storage medium and execute it. This enables the machine to be operated to perform at least one function according to the at least one called instruction. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. The storage medium readable by the machine may be provided in the form of a non-transitory storage medium. Here, 'non-temporary' simply means that the storage medium is a tangible device and does not contain a signal (e.g., electromagnetic waves), and the term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily.
[0251] According to one embodiment, the method according to the various embodiments disclosed herein may be provided as included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)), or distributed online (e.g., download or upload) through an application store (e.g., Play Store™) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily created on a device-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.
[0252] According to various embodiments, each component (e.g., module or program) of the components described above may include a singular or multiple entities, and some of the multiple entities may be separated and placed in other components. According to various embodiments, one or more of the components or operations of the aforementioned components may be omitted, or one or more other components or operations may be added. Generally or additionally, multiple components (e.g., module or program) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the multiple components in the same or similar manner as those performed by the corresponding component among the multiple components prior to integration. According to various embodiments, operations performed by the module, program, or other components may be executed sequentially, in parallel, iteratively, or heuristically, or one or more of the operations may be executed in a different order, omitted, or one or more other operations may be added.
Claims
1. In a wearable device, display; Memory; and It includes at least one processor comprising processing circuitry, and When the above memory is executed individually or collectively by the at least one processor, the wearable device, Run a voice recognition application that processes the user's voice input, and Through the above voice recognition application, the first speech input of the user is received, and Executed in an application other than the above-mentioned voice recognition application, and determines a first virtual object corresponding to the first speech input, Based on at least one of external device information around the wearable device, information regarding an object recognized in images currently being output through the display or previously output through the display, or information related to the user, the position on the display where the first virtual object is to be displayed or the shape of the first virtual object is determined, and A wearable device that stores instructions for displaying the first virtual object on the display according to the above position or the above form.
2. In paragraph 1, when the instructions are executed individually or collectively by the at least one processor, the wearable device, The images are stored in advance prior to the occurrence of the first speech input, and A wearable device that determines the position or shape using the stored images.
3. In paragraph 1, the information regarding the object A wearable device comprising information regarding a real object included in the above images or information regarding a second virtual object running in the above images.
4. In paragraph 1, the information regarding the above user A wearable device comprising information regarding the execution method or execution pattern of an application executed in relation to the above user.
5. In paragraph 1, the external device information A wearable device comprising information regarding the operating status or operating history of an IoT device within a specified distance from the wearable device.
6. In paragraph 1, when the instructions are executed individually or collectively by the at least one processor, the wearable device, A wearable device that uses a first large language model to generate a prompt corresponding to the first utterance input, the external device information, the information regarding the object, or the information regarding the user to determine the location or the form.
7. In paragraph 6, when the instructions are executed individually or collectively by the at least one processor, the wearable device, A wearable device that uses a second large-scale language model to determine a plurality of candidates for the position or form based on the prompt.
8. In paragraph 7, when the instructions are executed individually or collectively by the at least one processor, the wearable device, Using a third large-scale language model, an evaluation score for each of the above multiple candidates is calculated, and A wearable device that determines the position or shape using the above-calculated evaluation score.
9. In paragraph 8, when the instructions are executed individually or collectively by the at least one processor, the wearable device, A wearable device that displays the first virtual object in a position or shape corresponding to the candidate when the first candidate with the highest evaluation score among the plurality of candidates is included in an area corresponding to an image being output through the display.
10. In paragraph 8, when the instructions are executed individually or collectively by the at least one processor, the wearable device, A wearable device that, when the first candidate with the highest evaluation score among the plurality of candidates is not included in the area corresponding to the image being output through the display, receives user consent input through a separate user interface and displays the first virtual object in a location or form corresponding to the candidate.
11. In paragraph 1, when the instructions are executed individually or collectively by the at least one processor, the wearable device, A wearable device that changes the position or the form when the importance of the object or the first virtual object changes according to a change in time.
12. In paragraph 1, when the instructions are executed individually or collectively by the at least one processor, the wearable device, A first weight is set on the first image being output through the above display, and A second weight is set on the second image that has a history of being output through the above display, and A wearable device that determines the position or the shape based on the first weight and the second weight.
13. A method for executing an application performed on a wearable device, The action of executing a voice recognition application that processes the user's voice input; The operation of receiving the user's first speech input through the above voice recognition application; An operation to determine a first virtual object corresponding to the first speech input, which is executed in an application other than the above-mentioned voice recognition application; An operation of determining the position on the display where the first virtual object is to be displayed or the shape of the first virtual object based on at least one of external device information around the wearable device, information regarding an object recognized in images currently being output through the wearable device display or previously output through the display, or information related to the user; and A method comprising the operation of displaying the first virtual object on a display according to the above position or the above form.
14. In paragraph 13, the operation of determining the above position or the above form is, A method comprising the operation of determining the position or the shape using the images stored in advance prior to the occurrence of the first utterance input.
15. In Paragraph 13, information regarding the above object A method comprising information regarding actual objects included in the above images or information regarding a second virtual object running in the above images.